0:00:13come come from us
0:00:14and to they i would like to present that will work work and type full uh linguistic in sees on
0:00:20bottom-up and top-down clustering
0:00:23for speaker diarization
0:00:25so let
0:00:26give a short of view of the work
0:00:29um so for
0:00:31i we give a short introduction giving the motivation of this work
0:00:34and one with the
0:00:36formulation of the problem
0:00:38to finally move want to compare is no these two clustering systems
0:00:44and finally he straight that or ideas with
0:00:46some experiments or word
0:00:53so
0:00:53um um have seen during the last reason two nine evaluation but was basically two main approaches for
0:01:00speaker diarization one he's bottom-up up also called agglomerative hierarchical clustering
0:01:06and the them
0:01:08or or a device a hierarchical clustering
0:01:12um
0:01:12we but released uh we sent to you well actually last year a class like as a a bit per
0:01:18uh give a brief education process
0:01:21for
0:01:22uh speaker diarization system
0:01:24and we we so that it get some consistent improvement for the top-down system
0:01:29but to trying to apply it on the bottom up
0:01:32we so that uh the result to word totally in consistent
0:01:36so that's what
0:01:38is that's the motivation of work is to know why it does not work and it's leads to
0:01:43try to have a look on what is the in front of an be sticking reasons on bottom-up and top
0:01:49top-down
0:01:52so that start with the formulation of the problem
0:01:56so here here you have an now just stream
0:01:59and so we want to solve the problem you spoken one so we proposed to cold G is the segmentation
0:02:04so so
0:02:06is the group of boundaries at each speaker down
0:02:09and
0:02:11uh uh S
0:02:12which is is
0:02:14speaker or grants so the list of the successive speakers
0:02:17so we is as and when is a G in this case
0:02:21and so we can
0:02:23summarise is
0:02:25setting and by the following questions so finding the optimum S and the optimum G as the argument of the
0:02:31maximum
0:02:32of
0:02:33is the probability given as a set of observations so it's is case a or B the audio stream
0:02:40so uh just using as a base and from that to stand
0:02:44uh a a question
0:02:46we can get the second mine you see on the screen
0:02:49and uh and use the dean and a it can be it is does not depend on a as of
0:02:54to so giving the the question number one there
0:02:57so a a use with this a question we can see that
0:03:00as is or you know there's which are required to solve this optimization task
0:03:05the first one you know to compute
0:03:07P or a given as and G
0:03:09is my acoustic speaker mother's
0:03:11off on uh uh so in this case it's often on gmm in may not the approach we we use
0:03:16currently a
0:03:18state-of-the-art of
0:03:19and the second
0:03:20model model so P
0:03:22S and G
0:03:23which is often on me it's uh
0:03:25maybe except in the
0:03:27someone prayer was work to to the just been presented now
0:03:30uh and
0:03:32so i looking at is a question was is that we have two main difficulties
0:03:36first
0:03:37of course we know what the speaker
0:03:40he's
0:03:41and secondly is acoustic model defined
0:03:45a perfect word
0:03:46from than thirty on the speaker but it can depend and as well
0:03:50oh on other is and C is like to the linguistic content
0:03:54so for the next part of this presentation we do the following assumption
0:03:59is that the major and reasons
0:04:01but i shouldn't is only you
0:04:03to the linguistic content
0:04:05so that's what we go are gonna like a sense
0:04:08on on is the difference of one times
0:04:10and they're gonna be written Q
0:04:14so
0:04:15considering this assumption is this assumption option can just we formulate a i a question
0:04:21uh take
0:04:22speaker and boundary and "'em" out that that are are possible speakers sequences
0:04:28so now a looking of the optimum as and G plus the optimal speaker and read that all
0:04:35so consider a now as the inference of the front and we can move on that the second nine on
0:04:41the screen
0:04:42which should correspond to monte guys a the probability of or or or or to different for names you
0:04:48and um and the third line is does a set just explained it with the bayesian rule
0:04:54um
0:04:56and next
0:04:57we can propose to do to S and she first
0:05:00a speaker diarization and do the following assumption that or the speaker a a or babble
0:05:05so we can just a a speaker john mother's so P
0:05:09of S and G can just disappear
0:05:11and the second assumption is
0:05:13that's
0:05:14we can expect
0:05:15the from and to be in that the and of the speaker and independent of G as well
0:05:19so that's why we can just from problem in the prior of Q
0:05:24so finally
0:05:26we got to a question the first for simple approach
0:05:29the second line for of maybe more complete approach
0:05:32and in comparing but of them which will lead mean to same results in perfect board
0:05:37we see that um
0:05:39uh the second question a phone are normalized
0:05:42and
0:05:44that in the first one
0:05:46we should have a normalized know that as well
0:05:48it means that's P or a given as an G has to be trained
0:05:52we
0:05:53a can think about a different for names
0:05:57so
0:05:58to summarise i
0:06:00see from this equation that
0:06:01the speaker in mentoring delta has to be up to nice to get a or with S and G
0:06:07and so that a an called solution for the top
0:06:10so um that the reason why it is to the fine was try to and you are search
0:06:15um um
0:06:17which are uh main a bottom-up and top-down approaches
0:06:22so if we move on out comparing these two approaches
0:06:26see
0:06:29i'm are with is
0:06:30just just to is one cluster and divide
0:06:33i to actively in order to get the optimum number of clusters white bottom-up is the opposite scenario we stop
0:06:39was a plenty of cluster and about them uh i to civilly
0:06:44um
0:06:45so so not is by far more popular approach
0:06:48um you get the best result of the law
0:06:52nine evaluation
0:06:53i top now is
0:06:54uh maybe a bit less than its but achieve competitive results
0:06:58um i work for sentence
0:07:00show that for single distant microphone and can lead to compare a result
0:07:05but the question is okay we start with an artist will converge into some clusters
0:07:10and how sure that this cluster corresponds to a speaker
0:07:14or another acoustically sans is like the final
0:07:19so
0:07:20yeah required so that this approach converge to a local maximum
0:07:25um in the perfect word
0:07:29operations dominates over the intra-speaker variation
0:07:33and if
0:07:34i mean
0:07:35could uh M and size resize to we should say okay bottom-up and top down should lead to exactly
0:07:41the same
0:07:42results
0:07:44but
0:07:45a of course has nothing is perfect
0:07:47yeah is there is as well the inference of linguistic contents can which can be very significant
0:07:53and may since one the speaker more there's are not well normalized
0:07:58uh i
0:07:59is the system can converse to a local maxima and we can uh
0:08:02B not
0:08:04a speaker unit but uh other acoustic units like the phone and Q
0:08:09so in the case of a down
0:08:11so the a new speaker out to and from uh
0:08:15normalized by grand mother
0:08:17so this model is to with or of the at by a lot available speech so we can expect small
0:08:22well to be we've
0:08:24and that is the
0:08:26speaker uh uh i iteratively introduce was a large amount of data us so
0:08:30we can expect
0:08:33have this new model quite a more normal light as well
0:08:38so is a huge risk as well
0:08:41a a a a a a a zero sum of their
0:08:44uh to a of the linguistic is to normalize it
0:08:48uh uh i
0:08:49to as the speaker by motion as well that's of course what we don't want to get we want to
0:08:53get the highly speaker-discriminative system
0:08:57by comparing the bottom up
0:08:59we should has a system was some very small clusters
0:09:02so which are which can
0:09:05to am i mean a local maximum and a highly uh discriminative
0:09:09a so that my from this point of view
0:09:11and the
0:09:13nation compared to bottom-up up
0:09:15but the problem is
0:09:16has a cluster a very small as a big
0:09:19is that a a is a you would risk that's the system converts but a a a a a some
0:09:24of the acoustic it and we
0:09:27normalized
0:09:29so finally just some
0:09:31i think but of the system may have the own drawback a and there or the advantages
0:09:37according to the so so speaker discrimination and the optimization to linguistic nuances
0:09:44so that's you just right now is is
0:09:47where with some
0:09:49but one work
0:09:50so
0:09:51here is a our experiment set so
0:09:53we have a a a a a a a speech activity detector is for but of the system
0:09:59um i
0:10:01on the left i think it's on the left for you as well yeah
0:10:04uh uh you have a bottom-up systems so it's a classical system
0:10:09of the art system
0:10:10yeah is the following reference you can see you are going to
0:10:13to spend too much time to
0:10:16to do about this but uh uh and you decide you as the top-down down sister
0:10:20so typical top-down system as well uh
0:10:23the so this is these are the two "'cause"
0:10:26S parents
0:10:27and next we use so a pretty freakish
0:10:30as long the following paper shown here
0:10:32so this is an option step will see the difference lead
0:10:36and a the by a and map based resegmentation segmentation and a of the this and and bodies edition of
0:10:41the features and a final the segmentation
0:10:45or the that that's sets so on
0:10:47a top training from conference meeting
0:10:50yeah from the list out you of four five six evaluation
0:10:53and for the evaluation sets so the proposed to use a a to a seven out to nine
0:10:58uh
0:10:59uh that that set which are
0:11:02a of T V shows right cord want to shook is a function of T B shows corpus
0:11:07here ah
0:11:09no additional preference
0:11:10so the first call um is that a can be the better
0:11:14and
0:11:14is the score one
0:11:16of speech
0:11:18uh of course as i our system
0:11:20the help
0:11:21does not process and and is the overlapped speech we just focus on the second one
0:11:26and
0:11:27so a for
0:11:28the we can see
0:11:30and and just looking at the is that was also apply
0:11:34occasions fixations that
0:11:36and see that okay first sub of the system
0:11:40a a better to an ounce for two Y for
0:11:44top down a well there is a uh
0:11:47um is a result a much worse for of
0:11:50for uh um
0:11:51the T V shows
0:11:54yeah signal
0:11:55it is not to as the best system which provides the best with a
0:12:00or to the that that's set and see for example T a part to a seven top down Q better
0:12:04result why for out you nine that
0:12:07as at the bottom up
0:12:08and
0:12:09we and also consider hmmm
0:12:11the results be a simplification
0:12:14so that i can see it
0:12:16vacation
0:12:18just
0:12:19uh but a a degradation in performance for the bottom-up
0:12:24a that for the top-down down
0:12:26it's a way a proof of um is the system
0:12:30so
0:12:31it's a question is uh a okay may be purification you the discrimination between clusters
0:12:38i i am a as has a down
0:12:41the propagation
0:12:43bottom-up
0:12:44well unless one normalized against phone but yeah sure
0:12:49the in this case the propagation an is you last
0:12:53so that's explain a bit it a a clear of the cluster purity
0:12:57so we propose to look at all the cluster to at by one of the system and compute the purity
0:13:03for four all of this cluster the card
0:13:06uh so the is computed
0:13:08one is the fist
0:13:10so we takes a double speaker time seconds
0:13:13and we divide by the that optimal number
0:13:16uh a uh of second of the cluster
0:13:19so that a difference a situation
0:13:21if we have a high purity and a small number
0:13:24of
0:13:25cluster
0:13:26yeah well i i one has a pretty is a purity of
0:13:30cluster
0:13:31we can expect a system
0:13:32a to be lightly to converge to some speaker you
0:13:38and are very you do not of clusters
0:13:41like like to as the system converts to as or acoustic it
0:13:45we as it to as their have been
0:13:50a
0:13:51uh we do and what happened difference in audio was a are possible and the same for the last case
0:13:56so we doing at the true G and the number of cluster
0:13:59um the for tab and so we see him in
0:14:02we we don't use we do the propagation process
0:14:07a a top down as compare about a priority
0:14:10more less with
0:14:12but
0:14:13the top down as the that's class
0:14:16and C
0:14:18um of the right the number of clusters
0:14:20we have a as uh clusters them the bottom-up up and them about cluster it's clusters none the idea and
0:14:25number of cluster to for the ground truth
0:14:27so we can expect that top down to be a in the first situation also converge
0:14:32to some speaker
0:14:34as a button up is probably the for case
0:14:37so
0:14:38well
0:14:39to see uh what happened
0:14:41right
0:14:43the purification
0:14:44we see that
0:14:45a first for the top down the purification
0:14:48use pro as a pretty is improved
0:14:51um
0:14:53i cluster
0:14:54so for sure how is uh the system to converse to speaker then without purification
0:15:01uh
0:15:02there is a consistent in purity
0:15:05uh i i have a cluster them for the top that down so
0:15:09that is not or even i have to say a in which it situation we however
0:15:14uh
0:15:16a last not for this experiment to part is
0:15:18uh looking at the from musician
0:15:20for this case a so
0:15:22we take a different clusters
0:15:24we take all the clusters um
0:15:27for a system and for each of these cluster
0:15:30right
0:15:31histogram of the different for names
0:15:34we do this for all the clusters generated for the top down and the sample for the bottom-up up
0:15:39and for
0:15:40all of the
0:15:41the four
0:15:44compute the to a cluster distance between D histogram
0:15:48is uh is the colour back like the distance
0:15:52so
0:15:53and
0:15:54X is the average of all
0:15:57for these distances for each of the system
0:16:00so um
0:16:02we can expect uh is this average distance to be small
0:16:06uh uh uh as a
0:16:08a distribution in the different phone that uh in the different clusters
0:16:12so which means that's the system my
0:16:16and we can expect it to be a high ones as a higher degree of conversion to have problems
0:16:21and so i
0:16:22the distribution i'm not equality is the different cluster
0:16:27a a the is exposed the result in seen first
0:16:30sound propagation step
0:16:33i
0:16:35are used
0:16:36is a bottom-up
0:16:38which show really that's in this guy's a cluster are better normalized
0:16:43a a pill now the propagation
0:16:45we see that there is an improvement for bus of the system
0:16:48but um a plus but if a cash
0:16:52am a very high or than the top down
0:16:55plus but if question
0:16:57which just it's explained why the purification prove that that i
0:17:01of the bottom up so to conclude
0:17:03um
0:17:05we have seen in this slides that's
0:17:07but approach products bottom-up and top-down down
0:17:09give some compare but results but
0:17:12is
0:17:12uh_huh you different behaviours
0:17:15but up not isn't more disk
0:17:18because but
0:17:20often
0:17:20a a uh a a trade off from some clusters which are last normalized against linguistic content
0:17:26well i is a top down
0:17:28uh a a off from some cluster which are better normalized but less
0:17:33speaker discriminative
0:17:36so a
0:17:37uh i i think and
0:17:39one of the conclusion of this work is a there is a good thing to note to
0:17:44nation of this two approaches
0:17:46so we recently published a bit but but i think that a lot of the or other
0:17:51things to try
0:17:53and has a a future work
0:17:55we can expect maybe design a specific propagation process
0:17:58for a a bottom-up
0:18:00taking into consideration of this linguistic in which is quite particular
0:18:05or or on a of this approach
0:18:08he france's
0:18:11and that's it
0:18:13thanks
0:18:14okay
0:18:20any question
0:18:25okay that and if i one a quick question
0:18:30i the can i think with
0:18:33and are going to take a hard thing and
0:18:37um
0:18:38can i oh oh oh
0:18:41for
0:18:42right
0:18:44as a we stick like to use
0:18:46i have seen that is
0:18:47the core of these two approaches which you are are not what the provocation was just a motivation
0:18:53which lead to these work
0:18:55but as the core of the bottom-up and call of the top down acts differently
0:19:00is is the mystic in which is isn't this case the phone and content
0:19:04of the speech
0:19:07so uh
0:19:11but it
0:19:11but to a question you
0:19:14i and
0:19:15you
0:19:17i think i think