hi first i want to think
two or stay here because i thought that only me colours and the cameraman will
be here
the work was
than by
die unit and like me most about it i
and he should
present this war but
unfortunately about ten days ago he got married
so
you prefer to go to cost oracle or something else in order to come here
and present
the more also used a quiz me and i stuck with this work
and
so
you have to suffer me for about ten
after that
most of this workshop also on
the n n's and spatially
a very good managing skin all i felt that i also want to say some
think about it so
i briefly
take talk about it a little be then
i give a motivation
about the clustering problem
the basic mean shift algorithm and them but discussion we need
and then i present the
clustering system some experiments and summary
so about the intense
okay next
so
our problem as you we have
a texas station there are many
text across each lr
have
not one driver about the driver also
changed
and we have recording four
quite
they two days three days
that you speak at to talk so we exactly know where the segment
the and the start of the n
of each segment
and we collected
these recording devices
and
and the end of the day we want to segment and no which segments were
said by
one speaker now the speaker so
each month and talk
we don't wear an
where
can take that one speaker now
then the next time speaks after two hours three hours
maybe to model
from different car
and what we have
use a bag of segments
which are unlabeled
and we want you please
mostly these segments are very short on the average one how seconds two seconds
so
we want to
cluster
short segments
and usually the population use white be sort of speakers for two speakers
and so on
so the issues our problem
a given
mainly short segments we want
to segment them into on morgan use group which means that we want
have
it would
cluster purity it means that each cluster we would be occupied mostly by one speaker
only
but we also want to have
speaker purity
so
we don't want that the same speaker will be spread between ten clusters
the basic mean shift algorithm
is that we have but
many vectors we choose and vector or find by
and b and b so the
and go
set of clothes this
vectors
and take all the
vectors which are below some threshold
and then shift
than in particular the weighted mean of these
vectors
take the mean as reference points and the gain
looking for the
neighbours that below the threshold calculate the mean and you to converse some
point and
these what we can do we
each point we
each vector
demos talk about this algorithm many times
so for more details please refer to tables
and
lda the after we find the stable point of each wrecked or we are
a group all the vectors which are close one to each other
according to some threshold
and the number of broad we have this is the number of clusters
and the points which are in each group out there
one of the cluster but we know that a canadian distance is not very good
for the purpose of speaker clustering
so
here what we present with cosine distance now we use the be lda scoring instead
of cosine
so instead of looking for
closest vectors in the sense of forty the as distance we look for
based high score between b lda score
and the
calculate the
a new mean
when they've function g is the weight in the weight is basically the
and be lda score
and are the difference we made we didn't do not use a threshold
to look for
class vectors
below the threshold instead we was k and then we set
okay and take the k nearest
and vectors which have the highest the
bleu score
so
basically we
all these creations
the and
now this case
the to a shown is not fixed
like in the original algorithm but you the depends on the
largest distance of the case vector or
but we calculate the same are
algorithm in shift so we calculate the mean according to these k
nearest vectors shifted the mean and again continue the process
hearable the
well i-vectors but i-vector also because
i will explain that we
do the small modification of them
and applying mean shift algorithm according to the bleu score
and we have the results i just as we mentioned
and
because i we compared to the brother previous or
that in previous work
the threshold was fixed
we will scroll signs used and these stunts
and we use run don't mean shifted mean that we don't
i'll go across all the points but only randomly choose them
so before clustering we of course need to train ubm
total variability matrix but before using one
build the score we found that it is better to
do pca on the data and gab just another job pca
a no we reduce our r vector from for dimensions of the four hundred to
two hundred fifty
and we tried to compare it to is just i-vector of size two hundred fifty
this work better
we don't sure why
but is the fact it was better and we apply and next would indeed whitening
and apply a p lda score on these vectors
okay
so this was explained before
the experiment setup was that
we use nist two thousand eight two we got six
into a low dimensional segments
on average to enhance seconds
and
and we have average number all five segments per
speaker so to three
a no
we
calculate
the results
according to
average speaker purity average cluster purity the k parameter
and the other important parameter use how many class so we have at the end
comparing to the true number of clusters
so we're starting from the beginning
we start with the
here we go system of cosine distance with
a threshold fixed threshold this is the red line
and we see that
we have a wide be we can we
have to know exactly
what is the
based threshold
to make clustering when we use k-means instead
can use them but they're holding is that we see that we
have a plucked or so it
doesn't
make a difference if we choose sorting or seventeen so it's much more robust and
when we use k nearest neighbor hold
all the
results are for
i thought speakers
next we yea instead of using
random mean shoes
useful mean shift which are much more expensive computationally
but we see that we have some gain
still be is cosine distance
and then we switch from cosine distance to be lda score
and we have
and it to beat
more gain
i have to say that
both for the lda training
and for
and w c and training in the cosine system
we trained them on short segments well too long segment
shortly we will see why we did it on short segments
this one
when we train
be lda on long segments we have very better results
but on short segments
we improve the remote results dramatically
the total variability matrix trained on long segments only
we didn't use short segments
because it was very bit
this all results
deal now only we've sort of speakers
and this is some summary of all the results we see it's better to move
from
and i fixed threshold to get a nearest neighbor hope to go from randall mean
shift to full mean shift to move to be lda
and the hope for results
it's a totally
not require a problem is how many clusters
we have after clustering problem process
with when the compared to the actual number of clusters
and the red line of the drawing a units
of the
and a fixed threshold
and
if you are looking
and the result
it's not so nice we have
true forty six clusters speakers but it was estimated
s
about one hundred eighty clusters means that we have many small clusters they are very
pure but
a small and too much
but when we will scale and then
we can see that we have about a factor of two about six two clusters
so we have better cave better clustering performance with much less
clusters
these are the results
when we use the cosine distance with a fixed threshold
on different arbour off
speakers from three to one hundred eighty eight
we
when we will compare with the
proposed algorithm we will see that
in this case the
cluster purity is better
it it's understandable many small clusters and they're all pure of one segment to segment
but the k the overall and
results
in our case in the our algorithm is better
and the average number of clusters you can see that for three speakers that okay
but
let's able to one hundred eighty eight speakers it's
by a factor of ten almost we have much more
clusters that true number of speakers
when we go to
the be lda we skin
we have
better speaker purity
and
much less class see that the
by a factor of one how to for two
and these summarize
the results
for three and seven
speakers we have
a little bit to the better results by cosine with a fixed threshold but when
we go from
fifteen speakers and more
we prefer and they've been this score is k and the and nearest neighbor
we see both the results of k and for the number of clusters
and
okay we propose new system which al
but class and performance and
much less number of clusters we pay for these and it would be by a
computationally because we moved from a random a mean shift to one and she if
the
and that's all what they have to say two
we have a question
thank you so insecure remark that for sure to utterance clustering
results with a training but in the remainder removes the longer utterances well mm disappointing
of you my some other noises so also minimizes to bother explaining this to managing
the resulting protocol for improved composite ross is twice is also implement a mattress is
possible to enrol
with thing because if you'll train it on the long segments there would be that
big mismatch between the training condition and the testing that we train you on long
segments and calculate the
on i-vectors from short segments it would be something
not appropriate
but maybe with number two into the speaker or subspace are composed to suppose maybe
to be more correct reason longer
basically much more accurate but not for our problem so yes okay
most important is a new sound so yes or no i think that there should
be some trade off between the accuracy of the a score or training score and
to see the true problem
yes
two
extension of the proposed
right
or you
a can thank you for your presentation i can you please go back to that
results section very you showed that values of k and number of speakers
stopping
maybe it's okay let us know that
go for here and then you are increasing the number of speakers and the value
of k is x is an is fixed
and that the results are going down i mean like and you like and you
try with different values of k
for different of us to the at least k is the
square of the multiplication of s b and s p smell the k of the
k nearest neighbor i mean that that's and i
this can and j o k is the best or the result with the best
k
but as you see that
so before
the rose
no big difference
if you use fourteen or fifteen or seventeen for each number of speakers
for which number of speakers are fixed and we can use the these rifle use
the almost the same for
and i
any number of speakers with the we tested
it to reach a plateau and stays there
i assume that is we will the increase k two
of fifty or seventy two will decrease of the results would begin go down at
some point
but for reasonable and the
"'kay" size
you just almost the same results
what data did you used to train your p lda when you use a short
segment
the same data that we used for ubm and i don't remember you should go
to costa rica to ask die buster
i
it sounds anyway but it it's not from the this that let's say a real
the same development and set for training the ubm and
take part of it just started in short segments in the train the building right
but we need to the short segment you're taking multiple short segments per telephone
right
we take a couple of for phone call and make multiple segments out of it
yes but it sure randomly so it from different sessions for the same okay so
that so the cuda in use the same
short segments from the same phone call
a strange automatic could be that several of them will be but
we just randomly choose suppose of a really respect i just ask on it because
i agree the this jumping back a question that
what we've seen with things not for clustering so maybe a different thing but in
terms of the p lda parameters that
you do better with training those up with the longer ones even what it's doing
short duration test this is given for speaker echo so it may not be derived
that three's only resides asking the data you did a random selection so yes very
unlikely that it was concentrated from the same call
so that was my mean first and the
and the results are all also all the segments for the clustering were
on the test set were chosen randomly and we're and we're an experiment ten times
except of the
last one of one hundred eighty eight speakers because there are only one hundred eighty
eight speakers in the dataset so we can couldn't two randomly
ms
idea first of all one of the things that the like to in the original
it's a means if target
was it's probabilistic interpretation in the fact that the analysis start with i don't parametric
density estimation meaning that in each point you create a small either gaussian or a
triangular say a pdf
with triangular grand up with the kind of the threshold which is the uniform with
a gaussian grid up again with a notion because that's that there's the differentiation
and therefore dates
a rule is derived
by simple differentiation in order to find them all at which point where converts where
convergence
i'm wondering if you
choose a p lda let's say like so you don't
put either cosine distance or a standard i squared distance which is was initial
can you tell us because one question is not
whether these update rule
comes naturally
from the same mechanism buddies a new as explained to you i don't parametric
okay but we can get estimation
as you estimated answers no
we also isn't
one so it's more realistic what works a useful
the question
so next and speaking