hi everyone this is needed region problematical stopped today i'm going to present a list
of clustering for speaker diarization the course of this outcome i like a ramp
in the beginning reasons i mean to give a brief introduction to the past can
you diarization use the no i think or from results
as we all know that are initially is wow the task is equal recognition terrible
together with identification and verification
at the bottom of this feature it shows the scenario of speaker diarization tools because
i'm talking with each other based on the recording the case of speaker diarization used
to
is i when each speaker is speaking
technically no diarization can be decomposed into two steps segmentation and clustering
you this now i will go through the most commonly used framework is speaker diarization
that he's the optimal if you have a typical cluster we use h table shows
in the nineteen one which composition we bust two cameras that imitation only
no i always true method of the intention the next nist documentation and the segmentation
based on speaker change point detection
already that it in the speech segments it
a stairwell good the speech segments from the same speaker to the same cluster
in s a with respect to whether the number of clusters useful human or not
we have important operations
when the number of speakers is given to be a
no clustering always
stops when the without the number of clusters ranges and
then each of the and clusters will be used a representation of a speaker in
the conversation
when the number of speakers is nothing the we will both the threshold to those
because indirectly with does it go similarity of the merging clusters you know we you
know here that when you know t then i feel and stick to
when the
speakers in the idea of them o g p c speaker thing one thing to
reach is the threshold
after we will stop
yes
no result in the number of clusters where is the estimated number of clusters and
hence
and each of the casters will be used to represent a specific speaker in the
composition
after e
baby with applications there is always used
imagine be re-segmentation we first race present each speaker with a gmm
after that we're well beyond and h a gmm based on the gmms by adding
transitional probability
but only we will lie speech frames to the speaker gmms by viterbi decoding
although age they has been widely used
and the performance of each has been acknowledged
no asked us some shortcomings units in our work way
cope with the well as in
now he's the clusters and probably the orange speakers they can watch
in this nice
when k is the diarization and tools costco example
speaker in rule and speaker would be red
during clustering we will have a pastor or speaker eight understatement consisting of each problem
both speakers a and b
but unknown speaker not only and is because similarity of the on and the statement
of mixed each
they didn't manage to a custom speaker i
another scenario
those documents from speaker he may also be multitudes because they actually the second picture
in both cases the cost of speaker it will be biased to be could be
with a clustering going on the speech
the speech of speakers
a and b may now present already
that means those
i mean
no doubt addition they lost
future studies of the original those that can be into the statistically
in the in the that is composed of sailors from a only
with the battery go the system is composed of these statements from
see for me getting worse it to a
the clusters if a is composed of speech signals from both the a and b
all strategies with problem is to start early either
go to be able to determine rose because they get in most states in this
way we have to us to clean the you really use the way
okay the clusters the issues that is like to be known as what is that
it should be large enough to provide us it organisations people a i that it
should be clean i allowing for
involved in this one c d can be as we have
so the action a
will be a tradeoff between the two vectors
we propose a list of clustering by thinking strict threshold without age they the ideally
that will be a change the and get more faster than time t is the
number of speakers
is the only stuff clustering the clustering was a clustering
checked thresholds the resulting clusters where k is large and then the anticipated number of
speakers and
in any way to is given all that we have different implementations
when the number of speakers is nothing but we will first estimating it to be
and had
then based on a given or estimated number of speakers and only had we want
to the class to selection to select a model and how clusters problems ending clusters
each of the selected clusters where represents a specific speaker in the speech conversation
in the battles that we will apply viterbi re-segmentation to align the frames of the
whole conversation to the selected clusters
in this now and the following
we will describe how the number of speakers is no work
was gone it will work should not because similarity score magics s
each element s is thus because in our goal but no we'll let you clusters
example s j k is a speaker similarity score into the g s and have
after
finally as well be i was initially magics of five i
in the score matching s we will do and ninety conversation on it and stored
in a manual in using you of the role you one to u k
after that we want him choose the union ratio between the existing and can values
after that k
finally the lamb of speakers and had will be estimated at the point with a
maximum again that night
with a given all the estimated number of speakers in this nine and the following
we will show how do not have to selection works in but we with the
latter selecting is this and after of probability clusters of i wonder what i
no we were achieved this to find out all of the company combination and after
in these is to be the index set i one to
after that we work on how the stuff or matching for each combination by extracting
the corresponding rows and columns from s
well score magics it would be of the imaging
now takes a factor and i
in the scores that matches this way was then do the eigenvalue decomposition and each
of the in and found that the eigenvalues to be in one three
but only the in this combination of the maximum and you man summation well be
used in this is
definitely pastors
so that follows a description of the algorithm next we were able to the experiments
all experiments was having a i had use the money is being the data set
consisting of two cents is a dimension that and the as the of
you made mistakes
the duration of conversation various problems three hundred two hundred seconds
the number of speakers conversation from one to nine
in our evaluation when used are now role in addition error rates and eer as
actually
what use the pen the ground truth segmentation
as a temporal segmentation
be to you has to be noted that if in the reference euclidean speaker b
hyper
overlaps
no overlap segments will be used as individual segments
in our experiments we have to model as opposed by being a bottleneck feature extractor
with a given model no is an expensive extractor with the rest of the model
for most of the models
we used at an additional advantage as input feature and of course of the and
one change how about static y and into
in the model the acoustic input layer of the year is the carriage real compatibility
with ease contextual between you really both that and the right size
you has i hate enables the was well hidden layers well that one thousand and
it will give for the dimension of the not hidden layer wise lda and he's
being a output was used
it can only be sure
in our is known model there were nine convolutional class
only that we had no we'll collection they are five thousand and to you or
than the ones we may go after that to well collection labels were used up
to this green a the are one of the may five one hundred and twenty
eight
well use i x
in both models a five of the classification a while the number of training because
at least eleven thousand three hundred and if we
we use the conventional a st as the baseline based on a involves the conventional
clustering and the or is not mastery when use the egg expect when combined with
cosine distance as the speaker modeling and is because similarity on a on then
in the another speaker information and after selection in our restart clustering framework when use
the bic score unspeakable individual
in the re-segmentation phase
way used a speaker pair of each point we duration
well the name
when having a experiments in the scenario where the number because once again but
this table shows the performance comparison between the provisional edge v and the proposed only
a star clustering and development and evaluation sets respectively
from a comparison we have seen that the list of clustering
can provide better performance than the conventional h
to understand the reason for the computer there already
we have a purity after the whole clustering process of the two systems that control
case is given by
in the evaluation
to be the same page speech that's was required
to be in those in speaker at the reference from the comparison to
we have seen that the superiority of a restart clustering i know how
high-level speaker correctly
that it can provide a better initialization with imitation based
then we continued our experiments in this scenario we also number of speakers was not
a but
this table shows the performance comparison between the conditional basically and of the proposed or
is not clustering
development and the evaluation sets respectively
problem comparison because in that the or is not clustering can achieve better performance than
age they
or address l when used the
a report the results reported by different schemes
to have a family known database of various clustering with a number of speakers right
now again by the way how does advantage of speech in the development set was
estimated numbers of because what's more than or equal to the ground truth actually you
this paper
no means that shows that not only start clustering estimate columbus because more accurately well
as the number of estimated on us because it was not ground truth
combined with those because right here as you know strangely enough three people this can
help us to understand
the database of the audience to a jury
asked but experiments you know only those a threshold in both systems
right
results actually you don't actually got we evaluate the threshold zero point one to the
row of table one the paper we have seen that the or is not clustering
provided that statistically problem is not age they
well
no only a star clustering bad rich mess that interesting that means that the audience
to clustering is less than thirty two just a threshold
and more robust pitch there
finally we will come to a convolution
in this paper we propose an only stuff that you to h stays speaker diarization
consisting of two steps
second the number of initial clusters natural and he's anything man phenomenon that's because then
we combine no extraneous
after into the have a few number of speakers
the database of the proposed method was just a better from two aspects
back home as well had than h they based speaker diarization past well as the
number of speakers last not even all that
the second one is the a propose the similarity in magic space estimate of the
number of speakers and the resultant of speaker and a half of context of threshold
setting process relatively simple and robust
that's all of my a hessian thank you