Speech Transcript - Fast Scoring for PLDA with Uncertainty Propagation

a single channel and i hope available you into two you kindness the output of

a lunch

okay

the talking i'm going to talk about is how to make uncertainty propagation run fast

and also consume less memory

my name and why max from the home component alec university

so here used a liar representation i will first to keep an overview of i-vector

p d a and x spring how the uncertainty properties and can

can

model the uncertainty of the i-vector

and how to how to make the uncertainty fabrication run faster one possible use less

memory

and we evaluate the proposed to our problem on nist two thousand trial

and

finally we keep looser okay so here is the outline of the i-vector p lda

map onto the probably all you already know this though i mean cool for go

through these i very quickly a here

we use a the posterior be not the latent factor

you to

tool

to as a low dimensional representation of the speaker so given the mfcc wetter or

phrase utterance we compute the posterior mean of the

a latent factor and recall this time at

okay and t is the total variability matrix that define the channel and speaker subspace

or you represent a subspace where the i met okay very

so here's the procedure for the i-vector extraction given a sequence of mfcc what are

we extract the

i-vector using the post your and beam of the latent factor

and because if we would like to use the gaussian the lda therefore will lead

suppress the lawn the non gaussian behavior of the i-vector through some preprocessing

okay for example whitening and also length normalization

and after this preprocessing step because the process that i-vector pairs

top you

and then this not you

i-vector or preprocessed the i-vector can be modeled by the t i would be

so the idea is to decrease of the phone of you have a modeling

we represent a speaker subspace

and that h i is the use the speaker factor

and that you can see for the change sets exception of the i've speaker we

only have one

latent factor h i okay

and epsilon i j work that sentiment she to that cannot be represented by the

speaker subspace

so now that to in the scoring

at time we have that

the test i-vector

we have a test i-vector w t

and also we have this target speaker i-vector w s

and they're

and we compute these likelihoods are assuming that the top us to cut but come

from the same speaker

and of the also have the alternative hypothesis where the top us and w t

come from different speaker

there are four after some mathematical at manipulation become if this very nice equation so

in this equation we only have matrix and wet

multiplication and

the nice thing use the matrix c

hence i'll can all be computed as you can see these set of creation here

at the bottom

and all these sigma a c segment total

and thus and so on can be pre-computed from the

the lda model parameter that that's explain why the

scoring p lda very fast

but one problem of these conventional i-vector p lda there

i'm not two years

that's not have the ability to work that stands the reliability of the i-vector

so whether the utterance is very long already sought we still use

low dimension i-vector to work that stands

the speaker characteristics

of the whole utterance

so this we propose a problem for sort utterance speaker verification

but not problem for very long utterance that's how you we have three mean is

or sixteen use all speech

but if we're utterances only about ten second three second and you

then the variability or uncertainty of the i-vector will be so high that's

and the plp a scroll wheel favourite same speaker hypothesis

even if the test utterance is given by a in a imposed

about the recent years you've the spectrum is very short we will not have enough

acoustic vector for the nbp estimation or in we do not have enough acoustic webster

to compute the posterior mean of the leading factor you know factor analysis model

so in the ideal i'm certainly publication

we not only extract i-vectors but also clancy the

the posterior covariance matrix q

so this i this time to illustrate the idea

this gaussian represent the posterior density of the latent factor

and to do so i-vector is that's me so it is a point estimate

and this

equation initial the procedure of computing it

okay so t c use access to cease partition of the total variability matrix but

as you can see you

if the variance of this gaussian use very large

then the point estimate we'll not be where a correct

and this happened went utterances where is not the recent years

if the utterance is very short and see which is that zero although sufficient statistic

will be very small

so use this party where is more than the whole covariance matrix l university'll be

very big

and thus the means that these variations and be large as a result of point

estimate we might be very reliable

and

that's why two thousand and thirteen

that any proposed ideas that the lda and certainly propagation

so you that is so to extracting the i-vector we also express the posterior covariance

matrix

the latent factor

and that represents the uncertainty don't i-vector

and with some preprocessing as i have mentioned because we want to use a

a thousand you lda for the

as the as the final stage of the modelling for the scoring

therefore we also need to preprocessed

matrix

you time school

which is if you processed version of the of the posterior covariance matrix and i

thought that we could have a the lda modelling now how can

how to a certain these corpora t

other publication come from here so you know generative model

it a generative model we have wh i press i to allow you can see

this you

use and like at the conventional

the lda model we've the eigen channels

so this is my eigen channel but instead you keep and so on

the section

so it depends on the i

speaker

and the change section of the i speaker

as a result the z is also depends on the i and j

now the trouble of this year's

for every test utterance

we also lead to compute the u i t so unlike the i can channel

situation

we only need to pre-computed and make use of which during scoring now in a

uncertainty propagation this you i to have to be computed during

a scoring time

because the ssm dependent

and do compute this u r j which was performed at a dusky decomposed system

of the posterior covariance matrix

and that's why we have these the intra speaker covariance matrix like this

so loud finally and during the score

with the p l d a u p then we have these at equation

which is very similar to the equation the actual you

all the conventional p lda right as you can see this s u

matrix and vector multiplication

but the difference yes

this time the at b c and d all depends on the test utterance

an issue can see from this a set of recreation a s t e s

t c s t and the st the all depends on the test utterance

that that's me as they have to be pre-computed

and only very small number of matrix can be pretty compute this have to be

computed this set have to be computed during scoring time

and this that have to be compute a can be computed up before scoring time

so we will thus save much computation might become use the covariance matrix

so dissatisfy summarised some summarize the computation that needed to take phone a conventional lp

lda we almost have nothing to

to compute all you need to compute these

i went and matrix multiplication

but for the plp a review g

we have to compute all these set of matrix on the right

so as you can see that so we'll a increase the computation complexity a lot

and also if we increase the memory about this place

because of for every time the speaker we need to store

this may take a p c d for every target speaker

so we propose a to a way of in a speeding up the computation and

also

a two we use the memory consumption

the whole idea come from

come from d c equation

okay come on this equation case here the posterior covariance matrix and only depends on

n c

which

and two and a testing time and c will be to zero or the sufficient

statistics of the test utterance

well

you okay so you the two i-vectors are also meeting that's integration

we assume that all we

i think okay

the composed here are covariance matrix a similar because as you can see we plot

and the mfcc audible acoustic that only

the zero all the sufficient statistic

so having this hypothesis

we and

proposals

to roll direct a according to their be activity

now can be

we find w happy that we you we used a scalar to define the we're

not be by facing for each scroll the i-vector reliability is modeled by performance vehicle

right matrix

and we obtain the posterior covariance matrix from the development data

okay so here

that i take a k stand for the

case

and i'll this u k

is independent of the section

nice to look at

well at the bottom of the slide

we have you i j i taste depends on the

section

but now if you look at this here

we successfully

make the u i j which is the session dependent

is now becomes session independent

now you've having this u k become session independent we could

do a lot of precomputation on there

so one way of doing this

used to

used to grow

the

the i-vector

using these three approaches one is the base on the utterance recent which is intuitive

to group the i-vector based on the

at a race of because we as we believe that reason we use related to

the uncertainty of related to the reliability of the i-vector

we have also tried using the mean of the diagonal elements of the posteriori matrix

of this is a nice thing to do because

the being of the diagonal and on there is a scalar so working will become

very easy

okay and the last one we have tried is the largest eigenvalue of the reference

matrix

so this i basically tell us how to perform the grouping you for example you

uses the time access

then this one corresponding to extremely soft uncertains

go to medium length but am sort

and we're case where is

long utterance and u h group be fine one representative

okay from the k two

we're consensus the whole group

so this

or percent at u one u one times will work at santa

the posterior covariance matrix a very strong extremely short utterance

u k or u k tricycle corresponding to the posterior covariance matrix

what and certainty all that very long utterance

so now that all you really two

during the scoring time really to find

the real identity

so by using the three approach to quantify cook reliability noise gave a

we will be able to find what i the nn and so that we case

the

all the session dependent

matrix in two

am and

and c n and

not as to compare with the conventional original plp a few p

this eight easy all the session dependent because

t is the test utterance

so t stand for a test utterance s spent for the attack at six speaker

utterance

and now it's a it's two am an

and c n and the n and all these have been pre-computed already

using a development data

so as to can see that will be ice

for this computation saving my

using the pre-computed but rather than computers the covariance matrix on the prior

so again that this lie there are some more ice

the computation saving that we could have

so this is the p lda we've a fast what we've using a reference fast

scoring okay so we to only to determine the group i t m and n

but for the conventional plp a beep

and so that the publication we have to compute all this matrix during the scoring

so be performed experiments on

sre two thousand trial common condition two

using the classical sixty dimensional mfcc wetter one or two for gaussian

find the total factor in the total variability matrix

and we tried this three different way off on a single the i-vector

you know how to create a procedure of the

posterior covariance matrix

okay so this diagram of summarize the results nice okay cp lda just ultra fast

a piece of a represents the

scoring time on the back to the times the total time for the whole evaluation

on its common condition two

but unfortunately the performance is not very good

well the reason is that what we white is not very good use of because

it we use

where is the utterance of arbitrary duration so we need to do that this segmentation

or cutting utterance into sought medium sort and a long very soul

so that it is not

we are we do not use the original data for training and testing but instead

we use some of that at ones used were sought some of the utterance used

medium sought some of that when it is very long so we create a situation

be a victory

to raise and you both the training and test utterance

now the plp every u p performed extremely well

unfortunately the scoring time is also where i

and we've our fast scoring approach we successfully we used a scoring kind from here

to here

if only a very small increase in the eer

we have using a more groups okay so that developed a you know our from

you with the number of these larger we can make the eer almost the same

as the one achieved by the

p lda beep uncertainty project each so what happened use

we successfully we use the computation time but we followed increasing

the eer

as the same situation ocarina been dcf the detailed

you know paper

and also we show system three here because the performance you some two and system

three are very similar so i only saw this only show the system to

and system one space on utterance duration we want to solve this because you syllables

in three d way of doing

a memory consumption

a domain reconnaissance a i have i have in a similar trend

the lda used very small amount of memory

and

the plp a but use much of a large amount of memory

because we need to store all of the posterior covariance matrix of the utterance

and we have talk about well

what they're gigabyte here

and

this is not one videos that memory consumption almost by how

case and system to a this set the same

the memory consumption

and if we increase the number groups

or obviously not require something will increase

but you value that

number really a lot of forty five

it still use less memory and

the original plp we've and set and the propagation

so it's the det curve and

not as you can see the

paying

or the paper

this leo and we use them for to conventional lp lda report performance

about all the others system one two three and also the one with u p

one

much better

because it with the uncertainty propagation you can do the utterance of a feature integration

and what we have used that we find that system one use i c pool

then the system two and three

but system one has the largest

we that's and in terms of computation time

so in conclusion

we propose a very fast scoring map for the lda with certain people bifurcation

as the whole idea used to become people's the

posterior covariance matrix

or the loading matrix representing the reliability of the i-vector

two pre-computed

all of them

that's much as possible and you know how to do this precomputation really two

to the grouping first two in the development time

and we find three ways all performing the grouping

and all this grouping

are based on some a scalar just like a the k-means outgrow from you need

to use the distance of the way to say

it's a

criteria for a finding al

the cruel a what do you we mean by process now we use the

the be all the diagonal covariance matrix

okay or well sort then be all the diagonal elements of the posterior covariance matrix

what the maximum

eigenvalue all the posterior covariance matrix order to radiation as a way of doing this

huh set as the criteria for the grouping

and all these use a computationally light and sre so it's

as a result

the proposed f okay perform yes us a similar to the standard u p but

we only two point three percent of the scoring time

thank you

we have time for questions yes we

we do not frankly them randomly but use that set for every one second interval

we have a week rate

so for three second for second five seconds so we randomly extracted from there

that's speech data after

also when we extract every randomly extract the problem

so we durations range between three seconds and how much

well as well as long test utterance o can excel some utterances small groups of

five a traditional to therefore for different utterance we will have a different operating

my experience i wonder if you could just comment on this my experience with this

with this method

it is the i found other works well in a situation other than the specific

problem where was intended

okay if there is a gross mismatch between a enrollment and test such as telephone

enrolment the microphone channels

or a huge mismatch and the in the duration

then i found that this work well but i was a bit disappointed with the

with the performance only specific problem that you're addressing here which is the problem just

a duration variability

you fact we could be involved in our experiments we also have recently because

well we literacy generated duration mismatch in order to create a situation having a at

times a picture it duration therefore the test utterance and the target speaker utterance will

have different k

of course that we've each are very small times that

you one of the u one or two open you a

harry

the utterance will have various

but really then

excluding

so because everything random so there will be a lot of utterance with various packet

utterance which operates and also a trend which are real all that would be a

duration mismatch

a tree in the test

i be very interested to see what so what happens in the in the upcoming

nist evaluation where this problem is good is going to be in the in the

forefront of our have excellent thank

it is you know the truncation the duration will be truncated to between ten seconds

and sixty seconds

so i think we're all looking up to five percent equal error rate you know

a before we moved to chinese and no

target more go

verification trials

okay then that's like the speaker

Fast Scoring for PLDA with Uncertainty Propagation

Speaker Recognition: i-vector approaches

Weiwei Lin, Man-Wai Mak