Speech Transcript - On the Use of PLDA i-vector Scoring for Clustering Short Segments

hi first i want to think

two or stay here because i thought that only me colours and the cameraman will

be here

the work was

than by

die unit and like me most about it i

and he should

present this war but

unfortunately about ten days ago he got married

you prefer to go to cost oracle or something else in order to come here

and present

the more also used a quiz me and i stuck with this work

and

you have to suffer me for about ten

after that

most of this workshop also on

the n n's and spatially

a very good managing skin all i felt that i also want to say some

think about it so

i briefly

take talk about it a little be then

i give a motivation

about the clustering problem

the basic mean shift algorithm and them but discussion we need

and then i present the

clustering system some experiments and summary

so about the intense

okay next

our problem as you we have

a texas station there are many

text across each lr

have

not one driver about the driver also

changed

and we have recording four

quite

they two days three days

that you speak at to talk so we exactly know where the segment

the and the start of the n

of each segment

and we collected

these recording devices

and

and the end of the day we want to segment and no which segments were

said by

one speaker now the speaker so

each month and talk

we don't wear an

where

can take that one speaker now

then the next time speaks after two hours three hours

maybe to model

from different car

and what we have

use a bag of segments

which are unlabeled

and we want you please

mostly these segments are very short on the average one how seconds two seconds

we want to

cluster

short segments

and usually the population use white be sort of speakers for two speakers

and so on

so the issues our problem

a given

mainly short segments we want

to segment them into on morgan use group which means that we want

have

it would

cluster purity it means that each cluster we would be occupied mostly by one speaker

only

but we also want to have

speaker purity

we don't want that the same speaker will be spread between ten clusters

the basic mean shift algorithm

is that we have but

many vectors we choose and vector or find by

and b and b so the

and go

set of clothes this

vectors

and take all the

vectors which are below some threshold

and then shift

than in particular the weighted mean of these

vectors

take the mean as reference points and the gain

looking for the

neighbours that below the threshold calculate the mean and you to converse some

point and

these what we can do we

each point we

each vector

demos talk about this algorithm many times

so for more details please refer to tables

and

lda the after we find the stable point of each wrecked or we are

a group all the vectors which are close one to each other

according to some threshold

and the number of broad we have this is the number of clusters

and the points which are in each group out there

one of the cluster but we know that a canadian distance is not very good

for the purpose of speaker clustering

here what we present with cosine distance now we use the be lda scoring instead

of cosine

so instead of looking for

closest vectors in the sense of forty the as distance we look for

based high score between b lda score

and the

calculate the

a new mean

when they've function g is the weight in the weight is basically the

and be lda score

and are the difference we made we didn't do not use a threshold

to look for

class vectors

below the threshold instead we was k and then we set

okay and take the k nearest

and vectors which have the highest the

bleu score

basically we

all these creations

the and

now this case

the to a shown is not fixed

like in the original algorithm but you the depends on the

largest distance of the case vector or

but we calculate the same are

algorithm in shift so we calculate the mean according to these k

nearest vectors shifted the mean and again continue the process

hearable the

well i-vectors but i-vector also because

i will explain that we

do the small modification of them

and applying mean shift algorithm according to the bleu score

and we have the results i just as we mentioned

and

because i we compared to the brother previous or

that in previous work

the threshold was fixed

we will scroll signs used and these stunts

and we use run don't mean shifted mean that we don't

i'll go across all the points but only randomly choose them

so before clustering we of course need to train ubm

total variability matrix but before using one

build the score we found that it is better to

do pca on the data and gab just another job pca

a no we reduce our r vector from for dimensions of the four hundred to

two hundred fifty

and we tried to compare it to is just i-vector of size two hundred fifty

this work better

we don't sure why

but is the fact it was better and we apply and next would indeed whitening

and apply a p lda score on these vectors

okay

so this was explained before

the experiment setup was that

we use nist two thousand eight two we got six

into a low dimensional segments

on average to enhance seconds

and

and we have average number all five segments per

speaker so to three

a no

calculate

the results

according to

average speaker purity average cluster purity the k parameter

and the other important parameter use how many class so we have at the end

comparing to the true number of clusters

so we're starting from the beginning

we start with the

here we go system of cosine distance with

a threshold fixed threshold this is the red line

and we see that

we have a wide be we can we

have to know exactly

what is the

based threshold

to make clustering when we use k-means instead

can use them but they're holding is that we see that we

have a plucked or so it

doesn't

make a difference if we choose sorting or seventeen so it's much more robust and

when we use k nearest neighbor hold

all the

results are for

i thought speakers

next we yea instead of using

random mean shoes

useful mean shift which are much more expensive computationally

but we see that we have some gain

still be is cosine distance

and then we switch from cosine distance to be lda score

and we have

and it to beat

more gain

i have to say that

both for the lda training

and for

and w c and training in the cosine system

we trained them on short segments well too long segment

shortly we will see why we did it on short segments

this one

when we train

be lda on long segments we have very better results

but on short segments

we improve the remote results dramatically

the total variability matrix trained on long segments only

we didn't use short segments

because it was very bit

this all results

deal now only we've sort of speakers

and this is some summary of all the results we see it's better to move

from

and i fixed threshold to get a nearest neighbor hope to go from randall mean

shift to full mean shift to move to be lda

and the hope for results

it's a totally

not require a problem is how many clusters

we have after clustering problem process

with when the compared to the actual number of clusters

and the red line of the drawing a units

of the

and a fixed threshold

and

if you are looking

and the result

it's not so nice we have

true forty six clusters speakers but it was estimated

about one hundred eighty clusters means that we have many small clusters they are very

pure but

a small and too much

but when we will scale and then

we can see that we have about a factor of two about six two clusters

so we have better cave better clustering performance with much less

clusters

these are the results

when we use the cosine distance with a fixed threshold

on different arbour off

speakers from three to one hundred eighty eight

when we will compare with the

proposed algorithm we will see that

in this case the

cluster purity is better

it it's understandable many small clusters and they're all pure of one segment to segment

but the k the overall and

results

in our case in the our algorithm is better

and the average number of clusters you can see that for three speakers that okay

but

let's able to one hundred eighty eight speakers it's

by a factor of ten almost we have much more

clusters that true number of speakers

when we go to

the be lda we skin

we have

better speaker purity

and

much less class see that the

by a factor of one how to for two

and these summarize

the results

for three and seven

speakers we have

a little bit to the better results by cosine with a fixed threshold but when

we go from

fifteen speakers and more

we prefer and they've been this score is k and the and nearest neighbor

we see both the results of k and for the number of clusters

and

okay we propose new system which al

but class and performance and

much less number of clusters we pay for these and it would be by a

computationally because we moved from a random a mean shift to one and she if

the

and that's all what they have to say two

we have a question

thank you so insecure remark that for sure to utterance clustering

results with a training but in the remainder removes the longer utterances well mm disappointing

of you my some other noises so also minimizes to bother explaining this to managing

the resulting protocol for improved composite ross is twice is also implement a mattress is

possible to enrol

with thing because if you'll train it on the long segments there would be that

big mismatch between the training condition and the testing that we train you on long

segments and calculate the

on i-vectors from short segments it would be something

not appropriate

but maybe with number two into the speaker or subspace are composed to suppose maybe

to be more correct reason longer

basically much more accurate but not for our problem so yes okay

most important is a new sound so yes or no i think that there should

be some trade off between the accuracy of the a score or training score and

to see the true problem

yes

two

extension of the proposed

right

or you

a can thank you for your presentation i can you please go back to that

results section very you showed that values of k and number of speakers

stopping

maybe it's okay let us know that

go for here and then you are increasing the number of speakers and the value

of k is x is an is fixed

and that the results are going down i mean like and you like and you

try with different values of k

for different of us to the at least k is the

square of the multiplication of s b and s p smell the k of the

k nearest neighbor i mean that that's and i

this can and j o k is the best or the result with the best

but as you see that

so before

the rose

no big difference

if you use fourteen or fifteen or seventeen for each number of speakers

for which number of speakers are fixed and we can use the these rifle use

the almost the same for

and i

any number of speakers with the we tested

it to reach a plateau and stays there

i assume that is we will the increase k two

of fifty or seventy two will decrease of the results would begin go down at

some point

but for reasonable and the

"'kay" size

you just almost the same results

what data did you used to train your p lda when you use a short

segment

the same data that we used for ubm and i don't remember you should go

to costa rica to ask die buster

it sounds anyway but it it's not from the this that let's say a real

the same development and set for training the ubm and

take part of it just started in short segments in the train the building right

but we need to the short segment you're taking multiple short segments per telephone

right

we take a couple of for phone call and make multiple segments out of it

yes but it sure randomly so it from different sessions for the same okay so

that so the cuda in use the same

short segments from the same phone call

a strange automatic could be that several of them will be but

we just randomly choose suppose of a really respect i just ask on it because

i agree the this jumping back a question that

what we've seen with things not for clustering so maybe a different thing but in

terms of the p lda parameters that

you do better with training those up with the longer ones even what it's doing

short duration test this is given for speaker echo so it may not be derived

that three's only resides asking the data you did a random selection so yes very

unlikely that it was concentrated from the same call

so that was my mean first and the

and the results are all also all the segments for the clustering were

on the test set were chosen randomly and we're and we're an experiment ten times

except of the

last one of one hundred eighty eight speakers because there are only one hundred eighty

eight speakers in the dataset so we can couldn't two randomly

idea first of all one of the things that the like to in the original

it's a means if target

was it's probabilistic interpretation in the fact that the analysis start with i don't parametric

density estimation meaning that in each point you create a small either gaussian or a

triangular say a pdf

with triangular grand up with the kind of the threshold which is the uniform with

a gaussian grid up again with a notion because that's that there's the differentiation

and therefore dates

a rule is derived

by simple differentiation in order to find them all at which point where converts where

convergence

i'm wondering if you

choose a p lda let's say like so you don't

put either cosine distance or a standard i squared distance which is was initial

can you tell us because one question is not

whether these update rule

comes naturally

from the same mechanism buddies a new as explained to you i don't parametric

okay but we can get estimation

as you estimated answers no

we also isn't

one so it's more realistic what works a useful

the question

so next and speaking

On the Use of PLDA i-vector Scoring for Clustering Short Segments

Speaker Clustering and Diarization

Itay Salmun, Irit Opher, Itshak Lapidot