Speech Transcript - GMM Weights Adaptation Based on Subspace Approaches for Speaker Verification

all right nice the so i'm going to propose something that we have been the

what knowing actually during the last

see last p washable hopkinson trying to explore

you've these any information useful in the gmm weights because of the i-vectors are or

probably no is only at the

try to adapt the means

so as you probably all down now that the i-vectors is related to adapting the

means for handies be have been

very well applied for speaker language dialect and different all applications

so and the story

behind only adapting them each is

going back to the

gmm amount but the gmm map adaptation with the ubm universal background model as basis

so undefined it only means can be adapted to what whenever try to revise the

fact that maybe what the i-vectors it only older useful information

in the we trained the weights or even in the variance

probably patrick well a already tried with the variance for jfa

but

so we hear when this work we try to

to do something with the weight so huh the having a lot of a peak

i technique set of inoperable proposed for the weights

and

and we have tried will build a new one called non-negative factor analysis what was

actually has and who was the students and belgian one he was busy male the

mighty and we first tried foreign language id and which actually we have some success

with it

and the reason was that you know for language at some time of me haven't

ubm and you use for that you're portions are kind of phonemes supposedly so if

some from language this phoneme that not appearing so turning dams how can also zero

or even the weights of this goal cushions can be useful information and that's what

we found out and that's what actually motivated construct four speakers there's any information that

can be also used for speaker from the gmm weights

and this is ultimately the topics of this work

and we also compared to switch to this non-negative factor analysis and fa two

already existing techniques that it was proposed in but we should subspace a multinomial model

and over the speed this that the this presentation is kind of comprising between the

two for in the case of gmm weight adaptation

it's for forty adapting the gmm weights i have been a lot of a peak

techniques already applied maximum a posteriori maximum likelihood any many recreational are

eigenvoice wishes they

the starting point of all the new technology not jfa and i-vector

and they were also a lot of

weight adaptation techniques they're like for example maximum likelihood nonnegative matrix factorization and multinomial subspace

model

and then you wanted we propose non-negative factor analyzers

sell the idea behind this example the i-vector concept i don't want to bore you

with this

is you know you say that for a given utterance there is an ubm which

is not a prior of all the sounds how the sauce look like and the

i-vectors try to model all the shift for this ubm to a given recording you

can be

can be require a model by a low dimensional

us matrix representation

and the coordinates of this

recording in this space we call that i-vectors

so we tried to use the same concepts actually for doing though the means and

all that this fourteen sorry for doing that we the weights

and the only difference that we were facing is that the way it should be

all positive and they sum to one so

i can explain that later so in order to do the weights now so the

first thing so you when you when you when you have we have one ubm

for example universal background model

and you have a sequence of features you can compute some counts which is the

posterior distribution of occupational fish a portion given the frame cell which given here in

equation

so the objective function in the weights is kind of be of this kind of

it's work

this callback liberty versions which is kind of trying to maximize

the cover different versions between the because the that's a redeeming about the cover different

prices between the counts

and of the weights that you want to model

and so if you want if you get the data discounts and you don't normalise

in with the land of

you're for eight euros sentence you get the maximum likelihood estimation

for the weight which is easy to do so for example the first two well

technique that we propose a bit unfortunately we can compare would it for this for

this paper is more negative matrix factorization so we suppose you have a weights and

you say that okay this which can be

split into negative nonnegative sorry

mattresses

and the first one is gonna be the basis of your space and the second

one he's the cover the coordinate of this

and this the composition can be

sorry

can be a the of to find to optimize

the auxiliary function

okay so this is the fact that and forty two in have time to do

comparison with it

but we did what we did we compare with this subspace model because always would

look actually that i

so we try to compare with a two

so they behind and what this subsystems to model this is you have that the

concise and of accounts here

and you try to find a multinomial distribution

that fit

this

this distribution

and this can be defined by saluted much there are i-vector space this is the

ubm weights

and this is normalized

but to get the that the weights sum to one

and it had they have splits over papers in that how to do the optimization

have some haitian

solution for that

so for example suppose you have for example for the s m suppose you have

to go options

and the in band for each point here is the maximal likely to put that

is a maximum likelihood estimation of the weights for a given recording okay

and for this example which we see with the actually this point are generated for

and suppress model don't the subspace multimodal distribution

so we generate from this model

because the time of belief that for low dim for high dimensional space

you that the detector should be distributed like to not over like because if you

take a lot of data and it right only to go options the data would

be everywhere

but if and high dimensional space

i would try to simulate that and is it and to find that

you know tried to simulate high dimensional gmm intrigue oceans

quite case and this is kind of that's what we did so we they you

to look at low and other people but did

so we generate a data from this model

and we shall we what's difference between this

this model and the non-negative factor analyses

so we non-negative factor analysis actually what would say let's say which is the same

as the i-vectors we suppose that we haven't ubm

and issue recording which the weight for each recording can be explained by a shift

just t v in the direction of the data

and this

so the same as the i-vector sell this can be a low rank and are

is a new i-vectors in this new space

so the only problem with this we had were facing is that

the weights for each recording should be always positive initial sum to one

so here we have we develop some kind of an em like

so in of so we have an like an

we first a big air

we get some statistics for this here

we get some a to the gods in the a sound

to estimate the air

and then

when we obtain the l

we do and projected crow project projected the gradient ascent

what the projection metric that we used try to

a given as the constraint that they should always sum to one and they should

always be positive

and that's what we actually did if you want to have more explanation

i don't have time for that and here i can find that so

remember this is the several account this is the auxiliary function for the lack of

for the for the gmm weights case and with this is our weight and we

would like to estimate is to

parameters

which subjected that they should sum to one so what we did we just

assume that the g is a one vector of one so they should just one

should sum to one

and they all we should be positive

okay

so this is a to constrain that a low us to keep that the weights

should be something to one and they should be opposite so indicates for example if

you compare between what the gmm what the non-negative factor as a the when compared

to what subspace model to model

and what you know muir model is doing so

so for this case for example

differently the s m is different refitting well the data

because it was generated from it

but the i-vector the anything would choose an approximation of the data so it has

the benefit it has a disadvantage to because the been if it is in the

case of it and s m has a behavior to overfit of the data because

he we really model well the distribution of the training data but twenty go to

the lid task

sometime and in generalize well

so as to what but did they have the user is a regularization

to try to control this over fitting

so they have an orgasm regularization term that you to one when you're dead

in order to do that so for our case we are so we are not

suffering too much from this the good from this the we don't fit to may

very well the training data

but we approximate one generalize sometime better

sometime then that's mm but is that then of that application to be honest we

compare that for several application is sometimes the one is a bit another sometime the

opposite

but anyway so the difference is this one is like some this as a man

can fit really well the data

the training data but can have problem of overfitting we need to control with regularization

or and the n f a approximate optional the data

and you will sometimes generalize better

so this is the approximate the experimented with this so we have actually train and

i-vectors first and all that the data that we have

and which would test it actually in telephone condition of nist two thousand ten

and we have ubm of

two thousand forty eight this is not more technical things so we haven't i-vectors of

extend read we use the lda let normalisation p lda scheme that but use

and then we ask which so we try to use an i-vector for the means

and an i-vector from

the weights from s m and four and fa

and we tried to do fusion how we can combine them for example just a

simple fusion so we did score fusion

didn't help

allow so we just so key for get it would be some i-vectors fusion

it seems to be

a little bit of the better but not too much for speaker that's what little

disappointed

but for a language id actually with helping a lot

so i for example i two can affect for example i try to see how

the dimensionality of and then the day this new weight adaptation a compared to for

example the i-vectors

so i took and i don't wanna get factor analysis to train five hundred

thousand

and one by one power

one thousand five hundred so we remember that the starting ubm was two thousand forty

eight

so i and the this one's the lda first do much reduction before you length

normalization

and you see that

it's not really do is to the difference of not really big by the one

by varying the data dimension d for lda

and even

if you compare between five hundred thousand data as the difference is not really big

so we were a little bit surprised especially for and fa which we seen the

same behaviour for s m as well

and

but sometimes they just send them is you need to be more low dimensional compared

to all

wanna get the factorized as non-negative factor try to be more high dimension compared to

the other one

so here for example you i we compare the best result that we obtained from

a negative factor analysis

compared to one for the

subsist multi model

and for the core condition of male

and female and eight conversations so we can see that actually that's not really too

much difference

some time and if a listening that sometime is less but better than

then an s m

and

and the but you see that for the conversation you know you can get very

nice improvement over a nice result even without using the gmm weights is the mean

they're just t weights

no if you compared with the i-vectors

so the i-vectors is i don't so i the says a lot the maximum likelihood

of the weights so we should talk about the

the maximum likelihood of the weights with of the log and feed that to lda

maybe it's not the best way to do

so maybe you can do something clever so it seems that to with a local

the women selected was worse

compared to s m and

and the weight for all the condition eight conversation male female and core condition as

well

so now we remove the maximum likelihood from the loop

and we put the i-vectors here so we can see that

usually the i-vectors is twice better

this year we can do you get the can i-vectors other than the weight vectors

any divide by traffic to get the i-vectors so

so the

so the i-vectors is it differently much better than the weights

and

let's not too much but if you go to do eight conversation

it's actually pretty cool "'cause" the correct is a very low

so even for the long for when you have a lack of a lot a

more recording from the speaker

the weights can also give your

almost useful information that need

the i-vector can give you

so that was of the source surprising for that of reason

so here what we took sector this will have and you the minimal dcf only

the c of liquids doesn't a this doesn't and is the great you have the

baseline with you the i-vector

for female and male

so one would you the i-vectors with the weights would use the

i-vector fusion here

so this is an f when you're ready to and if we if we added

and fa will win little bit here

we use an acre eight looks but here but not rate too much

but at one for example for female when we do this when you fuse we

s m we get muscular but again for new dcf

you know operating point and even the correct

so for f e m s m was the best

diffuse with

for male you know we can see that the and if a was much better

for all this but not really in medium new minimal dcf

so it was not really

exciting to tour of fusion to be honest it was loaded with improvement of really

locked compared what was seen for language id

so here since the i-vectors is an awful related to the dimensionality of the supervectors

so we cannot go right to increase the ubm sizes for the gmm weights

the dimensionality is kind of related to how many courses you have so

we have

with tried to say okay well let's try to increase and decrease the ubm sides

and see what happened with the with the

we did not example here we tried only get a factor as for the only

one

so if we can see that for example if we increase the

the portions in the ubm size you get a very nice improvement in the for

both men and female

especially if only maybe

and you mindcf so

so here vince the i-vector that the weight is not ready to the size of

the supervector as you can increase

the

the amount of the

or of the

of the portions in the so in the ubm size so you can you can

even think about using

a speech recognizer and try it some if you want

so i'll so what we did here is actually took the baseline as well

which i

sorry

we took the i storage notably i-vectors

that is all i would try to fuse it with the

the i-vectors from different

ubm size

and he can see that for example of is a kind of are not really

kind of conclusions

over true even you increase even here for example we get

well sorry yes

if and you get better results with two thousand forty thousand questions

diffusions for example for female didn't help too much

to be honest was actually words

and for female form a was a little bit so as well

political court and would do

it doesn't mean that you get better results in the weights will happen would if

use only i-vectors as they could the question

so as a conclusion here we try to a

use the weights and try to think if it is worth a little better way

of and using the weights and updating as well

not only the means which is the

what the i-vector is doing

and so we will we seen some slight improvement when you want to combine them

maybe we need to find a better way to combine them some for example

similar to what subspace gmms are doing for is for speech recognition

i don't know what

then and look for working on that and on of they make some progress

which i tried interactively for example you estimate the weights you all data gmm weights

of the ubm

and then extract wilma statistic second you i-vectors it didn't have for speaker to be

honest i tried it and in a given the same result no improvement not think

but i met in tried for language id but only for speaker

thank you

you have a lot of time that i'm to understand my question that i

you know we walk a lot on the way to unit in avenue always negotiate

for mainly

and we are looking also on

the weights with l you know we approach and

michelle has also some results so

it seems to me since beginning was because she felt that

the weights are very interesting very nice source of information

in fact it's of in every information

why if you come back to ubm-gmm

and come back to dog results

when he proposed via a top cushion scoring compression you are using on the

one motion put one to this one and zero to all view of those the

lowest them of that summons was quite small

after that when you go to a nickel ship a results

which wine is the em too many out and

do a lot of things very close to what you

presented

at the end the best solution was to use very rank based normalisation

in the right based is very close to a

put the one to some portion zero to view of those on the weight and

count

this after

and now if you look at p m share the results he's bit of a

need to explain a text in but and of the time using just the zero

and one information a of the weights it seems that we are able to find

so according to me the way to a position

represents information birdies

this information yes or no and not a continuous information

you are trying to do any we

so i so there is a good point here because

one i one eight one to one has sent could start working with the mean

in a negative factor is my first question on my first think was that was

aware what nicole lasted in

i don't i'm

i want to sparsity into in the weights

this is this is not able to do what are what we're doing now sell

because i agree

with top one top five what it what ones for the top five

so it's like i say one sparsity in the in the in the weights like

which is right and all of them the response like some

zero one can only the top five for example or something

but for this system

well for this model that we have we're not doing that

that was my first actually common one because of it was in the committee and

made was my first comment was like how we can be good sparse

because based on what you what you're saying exactly

extract the i-vectors adaptively

okay that the ubm

before you extract

for each frame

there are very few girls

that's what happens i don't know it's that's

them going to knock down to solution to your problem but you will get sparsity

about way how okay

thanks one a

so this kind of follows up on patrick's questionnaire so you're doing sequential estimation for

the l six l some c are

on how many iterations at you go through two

to get that

a wrong

ten of like an em style greater and for each one is a grant in

this s and i go looks five for our and three for l

i'm asking this because to me it's interesting to see the rate of convergence that

you might actually hit on this and i know it's extra work right

you did in your evaluations i believe you're doing your evaluations when you believe you

converge did you run any previous system

so let's say that before hitting five you try to that i try and just

to see where you're actually see it may be there are certain

seems to get activator i-vector to get active

are you might actually see

there might be some insight into c

i try this but not in the context on the on this context of the

without of this enforce something what that the n and it

the different and it's a little it's sensitive to one like to see it up

more sometime when you get it that's true like fifteen iteration

kind of like seen that results going to grabbing after some from point the degradation

start to be seen

and usually it's like between

eight five to eight you already saturated

yes we need to control a little bit that

yes if we go if elected goal you is sort yourself actually s m sometime

especially for the sparsity s m is much better because it will it would hit

it just so what you will know like it would fit the data but he

with just

and have a would not do that because it's like an approximation

so that's my issue would and fa

s m would definitely get miss some sparsity if you know if you know how

to control it because you know you maybe you might overfit

the side

more probably marceau it can no better than meets is

for a morsel

you probably know what are then mean that

"'cause" he was doing this isn't that right

where did i

actually when we did this work are we tried different optimization algorithms

for these approximate hacienda it converts and iterated

quite good and also like the questions before we also saw like even few iterations

already

you got already quite good results

and if you like when only iterating you've got some degradation that so it looked

like it starts over fitting the model

so i guess you use all similar

two

GMM Weights Adaptation Based on Subspace Approaches for Speaker Verification

Speaker Modeling I

Najim Dehak, Oldrich Plchot, Mohamad Hasan Bahari, Lukas Burget, Hugo Van Hamme and Reda Dehak