Speech Transcript - Hierarchical speaker clustering methods for the NIST i-vector Challenge

i present the other words that we did the

our first speech to the i-vector challenge

and actually that is in just the slides it is some more that was not

presented in the paper

but the was submitted the a system description i think that this was to

we should with you with you guys

here's outline of my talk so first i will present the

of the progress of our system

and then i will a detailed to work to ideas that are the class training

and the score normalisation for comp losing computing the stock

so we it is

the time of the panel for the mindcf

for also for our system

so for starting from the baseline was they

min dcf of zero point three hundred the at six

we end up with the mean dcf of zero point two hundred the forty seven

which makes a

relative improvement of about thirty six percent

so i'm gonna present the this is the main a direct in the i graphical

manner so we have the development set and have the evaluation set that is

split into enrollment and test

and we have this these the three steps that was the in that baseline so

we have the whitening the nickel normalisation and the cosine scoring

and as we see that only whitening need the training

and do so we don't they really need the at the label of

of the development set for that

so static from this the baseline us something we get that can be done is

if we can better choose the of the data the data for the whitening

i mean if we take only that the

the you tenants with more than thirty five seconds no id experiments

we will are getting like

what it is some improvements with the mean dcf of zero point three hundred seventy

two

so after afterward i what i'm gonna use that this the a conditioned i-vectors so

i'm gonna use this deaf twenty about two and later experiments

like to systems

so all the next step that we did is the clustering

is a clustering so actually tried different kind of clustering and then i'm gonna come

back this is just later on but the one of the best clustering that you're

getting is that what you called the cosine be any clustering

and so actually

after this clustering we take only the

the clusters that have more than a to i-vectors in it

and we and we apply and now we can apply like

supervised based techniques like lda be at a double c and muppets

so here we just a this study at and clustering in the loop and you

can see that we can already get some improvements women dcf of zero point three

hundred three hundred fifty six

so what we tried next is

less to place the cosine scoring by about the kind of scrollings force of for

them was the svm

so actually here the so we trained a linear svm for every target speaker

what the positive we have only one positive samples of the next normalized

i-vector of the target speaker and the negative samples are the next normalize i-vector of

of the processed the

development set

so we had we can get some jump more miss with the mindcf of three

hundred two

you're two

so next we added the w c n and the loop

just after the lda

so he had for the svm would not get any improvement

but for

for the lda that would explain next slide we will got the w c and

was happily but

so here's

so he is a bit the a so we use our scalability implementation of the

standard lda

and does so the scores are the likelihood ratio between the average i-vectors of the

target speaker and the test i-vector not as he that the i-vector the average i-vectors

not normalized in this case which is not the case for the svm

so here also again we can get additional improvements with the mindcf of zero point

the two hundred it and i two

afterward we tried the some

we tried some score normalisation ideas

actually that i tried that you know i tried the

s-norm and others and one that was the working the best is a small

but i will also come back to the slated

as so actually a small usually what was used only at the recognition level but

i also applied as a clustering so he'll when we apply that's clustering we can

we can get additional improvement to the even dcf of zero a zero point the

two hundred eighty six

then i applied this if one at the after a lda scoring and you can

get also another jumping performance of the mindcf of zero point two how that the

fifty and eight and this was a system that was submitted as a dateline

at the design that line of the evaluation

afterward i thought also i that idea which replace this cosine create a score a

clustering by svm clustering which is also done in iraq and a manner

and also we can get them into several as and

additional improvements to the mindcf of the

zero point two hundred the forty seven which is very close to the best performing

system

so we now system this is more or less than i hit of the

just the pushing of our system

we don't have usually don't have quality measures

function

so that's it after afterward we tried the so i was trained with the clustering

so for the clustering

okay clustering was already study in the charts are four i-vectors in either support unsupervised

the manner or supervised manner for example the work from mit on cosine bayes k-means

clustering in which the number of clusters is known a priori and which because they

would what you want composition conversational a telephone speech

and then the improve the system by using good basic spectral clustering i don't with

a simple heuristic that the that in that computing the number of cluster automatically

other words from cream what using the cosine based the mean shift clustering

so wouldn't post methods all if i'm not among all use cosine does the scoring

other method used to provide the clustering like the one from you where the used

integer linear programming

but their method there a distance metric i think a small amount of this

requires labeled training data

in order to compute the within class at companies matrix

other works from the project at all when using the p at a

based clustering but of course this vad a needs labeled the external unlabeled data to

remote two

to compute the lda model and then of to do the similar to compute of

similarity measure and the iraqi could and do the iraqi plastic

so actually we tried different kind of clustering i'm not gonna going to ten and

all of them one of those was the ward clustering

and actually so it is also known also provides you don't clustering with the goal

is to optimize an overall objective functions by function by minimizing the within class scatter

this clustering is very fast

since its use lance williams algorithm

in a recursive manner

like in a recursive manner

and the actually the problem of this algorithm

is that it needs euclidean distance to be to be to be good

and the problem

it was shown in this work that the cost the euclidean this is not as

good as the cosine distance

what the as a cluster that we tried is what i quit the cosine ple

clustering so it's two-step clustering

what the first one is based on cosine

cosine measure

actually after each iteration the similarity measure is updated by the computing the cosine measure

between average i-vector of the resulting clusters

and the here the we decide to stop early in the clustering process in order

to ensure high purity clusters

so once we have this a first set of cluster because we can would step

a second us a step of clusters is the

s dataset is that is second step of clustering which debate on the lda

and actually we did it so somehow differently from others so actually we

we after each iteration we could train the p lda model and compute i again

the this is a bit i can similar to make a matrix

and the but since this is hot somehow posterior doing it we would we get

every five hundred

merged

so i'm gonna show this

this figure that the show them as the evaluation of a mindcf in terms of

the clustering process

on the progress set using as back and the bit happier days a model scoring

so as we see boasts

what clustering which is in blue and cosine classical sample at clustering which is in

that we can get better performance so then

baseline system and also we can see that consecutive clustering is much better than the

ward clustering

and the best the heat in this experiment the best the results were obtained was

a number of clusters of sixteen fell

let me now look a bit of the score normalisation

so as i say the we try to think of kind of normalization one of

the most the successful one was introduced by professor can but and he's as soon

then i think energy models on the paper

so this actually works quite nice in was unlabeled code set which is the case

in our that's not you

so as a set i use it for both a recognition and clustering so few

for recognition

the core set

that i used was all the development set

so the thirty six on the

i-vectors and what i took that the top-k neighbours

neighbours the i-vector to the propose but target the speech i-vector and the test i-vector

so use the formalize you see it's a symmetric form a lot

so we have mu and sigma involve this formal or more you the mean you

kate by for instance just means that

we take the top the one thousand five hundred the scores

that are scores that of the highest for

target speaker for the target speaker and then we do this and c and the

same for some there's the duration and that's one

so we have more or less the same formula that was used for the

for clustering

and he it but you of course it's between

two plus two pair of a pair of clusters

and the cohort set in this case is actually all the

what the average i-vectors that what that are not concern in this and this measures

so please or dialect or the clusters

but not see wanted one this one

that's that i'm gonna

conclude so

actually in this but this evaluation was very helpful for us we learn a lot

of things and the

and it was i mean and also the by special successful

and also we don't that clustering is

what important

and also the adaptive a symmetric normalization

this is that's can be reproduced with the with our open-source libraries that the

that you can see this link and we also you can

use you know what it and icassp paper

as future work and its you start working nist on it is

and how to automatically

addicted mind that of the stopping criteria criterion the clustering process and actually we have

some ideas

but i hope we can lead such a shared with you guys

so like the variation of the number of the mindcf on the development set and

the variation of the number of clusters of nothing written a clusters

and also possible use of spectral clustering

and so one a good idea for next the maybe for next evaluation that could

be considered

because it's because of its potential application is the somewhat supervised the clustering

so actually here there's many techniques emotionally that were that are order to use like

co-training and others

thank you for

congratulations that was very good system without fusion getting these results is amazing i have

the slight impression that you make the distinction between a supervised and unsupervised if you

can go back pieces like

i could easy to go back then slide

well i think this distinction is a little bit arbitrary a good as the unsupervised

we use the tree with that muhammad since i we used we try to use

labels and of course to what's better in the best results we demonstrated was it

was of course

it's always good a good idea is like some labels if you have them and

my impression is that the only way to get a fully unsupervised clustering without knowing

the number of classes is a more like model is bayesian method although in the

main see if there are some tricks in they if you check the original paper

of common each you you'll see that there are some tricks yet you can do

in that are successful in a much processing so you can somehow estimate the number

of classes but i think that's also the guys from

from liam that you have the must supervised

they use it also with the stander prewhitening without even

i getting about the labels and this and the system works fine as well so

it's a little bit are better for me this these distinctions not

it i think and my sense the

supplies an unsupervised adjust the

i in the sense of labeled around the unlabeled training data to

and actually i think

i just have a question of outdoor your svm you said you use this single

positive examples from the averaged i-vector instead of

five was that the examples of you try both

i that the

so you see a number of summation i tried many and actually

as so this one this one was what you the best

and i forgot to mention that it's was in and by used the

it's not you will the weights would like zero point one for positive ends you

want mine for negative

so that's but i think it's not it's not

well we gain bit by doing this

it's more or less the same if use the

but i mean

by the

by nist in a t

but as an online

the new could just the

say what they wanted to say about is em so i just have a comment

i never had the or progress

slide like you a the one you sure in

you of the third slide so

when you're developing the system you had the only progress which is

wonderful a really wonderful situation for to be more

almost it's very interesting for us to know also or you negative trials what you

tried and what was not efficient

during the development of who systems it's a somewhat the but it's interesting for me

that's true and

well if a file if i want to talk what about the things that did

not work i think it takes

but whatever that's

so you show some

different approaches for clustering but like you is just a few of the system

when distance slightly and the stuff to get a the backend was only the lda

the system

you

i combination different back end and i try

it was also

a different from the others i

put me it was not

maybe

some small gain something forensic

i guess regression and you

use measure

i think the that at the adaptive score normalization was doing the work of what

the

a quality measure was do we get for others i think it was also that

you know find

what the

Hierarchical speaker clustering methods for the NIST i-vector Challenge

Nist I-Vector Special Session

Elie Khoury, Laurent El Shafey, Marc Ferras and Sebastien Marcel