0:00:13 | come come from us |
---|
0:00:14 | and to they i would like to present that will work work and type full uh linguistic in sees on |
---|
0:00:20 | bottom-up and top-down clustering |
---|
0:00:23 | for speaker diarization |
---|
0:00:25 | so let |
---|
0:00:26 | give a short of view of the work |
---|
0:00:29 | um so for |
---|
0:00:31 | i we give a short introduction giving the motivation of this work |
---|
0:00:34 | and one with the |
---|
0:00:36 | formulation of the problem |
---|
0:00:38 | to finally move want to compare is no these two clustering systems |
---|
0:00:44 | and finally he straight that or ideas with |
---|
0:00:46 | some experiments or word |
---|
0:00:53 | so |
---|
0:00:53 | um um have seen during the last reason two nine evaluation but was basically two main approaches for |
---|
0:01:00 | speaker diarization one he's bottom-up up also called agglomerative hierarchical clustering |
---|
0:01:06 | and the them |
---|
0:01:08 | or or a device a hierarchical clustering |
---|
0:01:12 | um |
---|
0:01:12 | we but released uh we sent to you well actually last year a class like as a a bit per |
---|
0:01:18 | uh give a brief education process |
---|
0:01:21 | for |
---|
0:01:22 | uh speaker diarization system |
---|
0:01:24 | and we we so that it get some consistent improvement for the top-down system |
---|
0:01:29 | but to trying to apply it on the bottom up |
---|
0:01:32 | we so that uh the result to word totally in consistent |
---|
0:01:36 | so that's what |
---|
0:01:38 | is that's the motivation of work is to know why it does not work and it's leads to |
---|
0:01:43 | try to have a look on what is the in front of an be sticking reasons on bottom-up and top |
---|
0:01:49 | top-down |
---|
0:01:52 | so that start with the formulation of the problem |
---|
0:01:56 | so here here you have an now just stream |
---|
0:01:59 | and so we want to solve the problem you spoken one so we proposed to cold G is the segmentation |
---|
0:02:04 | so so |
---|
0:02:06 | is the group of boundaries at each speaker down |
---|
0:02:09 | and |
---|
0:02:11 | uh uh S |
---|
0:02:12 | which is is |
---|
0:02:14 | speaker or grants so the list of the successive speakers |
---|
0:02:17 | so we is as and when is a G in this case |
---|
0:02:21 | and so we can |
---|
0:02:23 | summarise is |
---|
0:02:25 | setting and by the following questions so finding the optimum S and the optimum G as the argument of the |
---|
0:02:31 | maximum |
---|
0:02:32 | of |
---|
0:02:33 | is the probability given as a set of observations so it's is case a or B the audio stream |
---|
0:02:40 | so uh just using as a base and from that to stand |
---|
0:02:44 | uh a a question |
---|
0:02:46 | we can get the second mine you see on the screen |
---|
0:02:49 | and uh and use the dean and a it can be it is does not depend on a as of |
---|
0:02:54 | to so giving the the question number one there |
---|
0:02:57 | so a a use with this a question we can see that |
---|
0:03:00 | as is or you know there's which are required to solve this optimization task |
---|
0:03:05 | the first one you know to compute |
---|
0:03:07 | P or a given as and G |
---|
0:03:09 | is my acoustic speaker mother's |
---|
0:03:11 | off on uh uh so in this case it's often on gmm in may not the approach we we use |
---|
0:03:16 | currently a |
---|
0:03:18 | state-of-the-art of |
---|
0:03:19 | and the second |
---|
0:03:20 | model model so P |
---|
0:03:22 | S and G |
---|
0:03:23 | which is often on me it's uh |
---|
0:03:25 | maybe except in the |
---|
0:03:27 | someone prayer was work to to the just been presented now |
---|
0:03:30 | uh and |
---|
0:03:32 | so i looking at is a question was is that we have two main difficulties |
---|
0:03:36 | first |
---|
0:03:37 | of course we know what the speaker |
---|
0:03:40 | he's |
---|
0:03:41 | and secondly is acoustic model defined |
---|
0:03:45 | a perfect word |
---|
0:03:46 | from than thirty on the speaker but it can depend and as well |
---|
0:03:50 | oh on other is and C is like to the linguistic content |
---|
0:03:54 | so for the next part of this presentation we do the following assumption |
---|
0:03:59 | is that the major and reasons |
---|
0:04:01 | but i shouldn't is only you |
---|
0:04:03 | to the linguistic content |
---|
0:04:05 | so that's what we go are gonna like a sense |
---|
0:04:08 | on on is the difference of one times |
---|
0:04:10 | and they're gonna be written Q |
---|
0:04:14 | so |
---|
0:04:15 | considering this assumption is this assumption option can just we formulate a i a question |
---|
0:04:21 | uh take |
---|
0:04:22 | speaker and boundary and "'em" out that that are are possible speakers sequences |
---|
0:04:28 | so now a looking of the optimum as and G plus the optimal speaker and read that all |
---|
0:04:35 | so consider a now as the inference of the front and we can move on that the second nine on |
---|
0:04:41 | the screen |
---|
0:04:42 | which should correspond to monte guys a the probability of or or or or to different for names you |
---|
0:04:48 | and um and the third line is does a set just explained it with the bayesian rule |
---|
0:04:54 | um |
---|
0:04:56 | and next |
---|
0:04:57 | we can propose to do to S and she first |
---|
0:05:00 | a speaker diarization and do the following assumption that or the speaker a a or babble |
---|
0:05:05 | so we can just a a speaker john mother's so P |
---|
0:05:09 | of S and G can just disappear |
---|
0:05:11 | and the second assumption is |
---|
0:05:13 | that's |
---|
0:05:14 | we can expect |
---|
0:05:15 | the from and to be in that the and of the speaker and independent of G as well |
---|
0:05:19 | so that's why we can just from problem in the prior of Q |
---|
0:05:24 | so finally |
---|
0:05:26 | we got to a question the first for simple approach |
---|
0:05:29 | the second line for of maybe more complete approach |
---|
0:05:32 | and in comparing but of them which will lead mean to same results in perfect board |
---|
0:05:37 | we see that um |
---|
0:05:39 | uh the second question a phone are normalized |
---|
0:05:42 | and |
---|
0:05:44 | that in the first one |
---|
0:05:46 | we should have a normalized know that as well |
---|
0:05:48 | it means that's P or a given as an G has to be trained |
---|
0:05:52 | we |
---|
0:05:53 | a can think about a different for names |
---|
0:05:57 | so |
---|
0:05:58 | to summarise i |
---|
0:06:00 | see from this equation that |
---|
0:06:01 | the speaker in mentoring delta has to be up to nice to get a or with S and G |
---|
0:06:07 | and so that a an called solution for the top |
---|
0:06:10 | so um that the reason why it is to the fine was try to and you are search |
---|
0:06:15 | um um |
---|
0:06:17 | which are uh main a bottom-up and top-down approaches |
---|
0:06:22 | so if we move on out comparing these two approaches |
---|
0:06:26 | see |
---|
0:06:29 | i'm are with is |
---|
0:06:30 | just just to is one cluster and divide |
---|
0:06:33 | i to actively in order to get the optimum number of clusters white bottom-up is the opposite scenario we stop |
---|
0:06:39 | was a plenty of cluster and about them uh i to civilly |
---|
0:06:44 | um |
---|
0:06:45 | so so not is by far more popular approach |
---|
0:06:48 | um you get the best result of the law |
---|
0:06:52 | nine evaluation |
---|
0:06:53 | i top now is |
---|
0:06:54 | uh maybe a bit less than its but achieve competitive results |
---|
0:06:58 | um i work for sentence |
---|
0:07:00 | show that for single distant microphone and can lead to compare a result |
---|
0:07:05 | but the question is okay we start with an artist will converge into some clusters |
---|
0:07:10 | and how sure that this cluster corresponds to a speaker |
---|
0:07:14 | or another acoustically sans is like the final |
---|
0:07:19 | so |
---|
0:07:20 | yeah required so that this approach converge to a local maximum |
---|
0:07:25 | um in the perfect word |
---|
0:07:29 | operations dominates over the intra-speaker variation |
---|
0:07:33 | and if |
---|
0:07:34 | i mean |
---|
0:07:35 | could uh M and size resize to we should say okay bottom-up and top down should lead to exactly |
---|
0:07:41 | the same |
---|
0:07:42 | results |
---|
0:07:44 | but |
---|
0:07:45 | a of course has nothing is perfect |
---|
0:07:47 | yeah is there is as well the inference of linguistic contents can which can be very significant |
---|
0:07:53 | and may since one the speaker more there's are not well normalized |
---|
0:07:58 | uh i |
---|
0:07:59 | is the system can converse to a local maxima and we can uh |
---|
0:08:02 | B not |
---|
0:08:04 | a speaker unit but uh other acoustic units like the phone and Q |
---|
0:08:09 | so in the case of a down |
---|
0:08:11 | so the a new speaker out to and from uh |
---|
0:08:15 | normalized by grand mother |
---|
0:08:17 | so this model is to with or of the at by a lot available speech so we can expect small |
---|
0:08:22 | well to be we've |
---|
0:08:24 | and that is the |
---|
0:08:26 | speaker uh uh i iteratively introduce was a large amount of data us so |
---|
0:08:30 | we can expect |
---|
0:08:33 | have this new model quite a more normal light as well |
---|
0:08:38 | so is a huge risk as well |
---|
0:08:41 | a a a a a a a zero sum of their |
---|
0:08:44 | uh to a of the linguistic is to normalize it |
---|
0:08:48 | uh uh i |
---|
0:08:49 | to as the speaker by motion as well that's of course what we don't want to get we want to |
---|
0:08:53 | get the highly speaker-discriminative system |
---|
0:08:57 | by comparing the bottom up |
---|
0:08:59 | we should has a system was some very small clusters |
---|
0:09:02 | so which are which can |
---|
0:09:05 | to am i mean a local maximum and a highly uh discriminative |
---|
0:09:09 | a so that my from this point of view |
---|
0:09:11 | and the |
---|
0:09:13 | nation compared to bottom-up up |
---|
0:09:15 | but the problem is |
---|
0:09:16 | has a cluster a very small as a big |
---|
0:09:19 | is that a a is a you would risk that's the system converts but a a a a a some |
---|
0:09:24 | of the acoustic it and we |
---|
0:09:27 | normalized |
---|
0:09:29 | so finally just some |
---|
0:09:31 | i think but of the system may have the own drawback a and there or the advantages |
---|
0:09:37 | according to the so so speaker discrimination and the optimization to linguistic nuances |
---|
0:09:44 | so that's you just right now is is |
---|
0:09:47 | where with some |
---|
0:09:49 | but one work |
---|
0:09:50 | so |
---|
0:09:51 | here is a our experiment set so |
---|
0:09:53 | we have a a a a a a a speech activity detector is for but of the system |
---|
0:09:59 | um i |
---|
0:10:01 | on the left i think it's on the left for you as well yeah |
---|
0:10:04 | uh uh you have a bottom-up systems so it's a classical system |
---|
0:10:09 | of the art system |
---|
0:10:10 | yeah is the following reference you can see you are going to |
---|
0:10:13 | to spend too much time to |
---|
0:10:16 | to do about this but uh uh and you decide you as the top-down down sister |
---|
0:10:20 | so typical top-down system as well uh |
---|
0:10:23 | the so this is these are the two "'cause" |
---|
0:10:26 | S parents |
---|
0:10:27 | and next we use so a pretty freakish |
---|
0:10:30 | as long the following paper shown here |
---|
0:10:32 | so this is an option step will see the difference lead |
---|
0:10:36 | and a the by a and map based resegmentation segmentation and a of the this and and bodies edition of |
---|
0:10:41 | the features and a final the segmentation |
---|
0:10:45 | or the that that's sets so on |
---|
0:10:47 | a top training from conference meeting |
---|
0:10:50 | yeah from the list out you of four five six evaluation |
---|
0:10:53 | and for the evaluation sets so the proposed to use a a to a seven out to nine |
---|
0:10:58 | uh |
---|
0:10:59 | uh that that set which are |
---|
0:11:02 | a of T V shows right cord want to shook is a function of T B shows corpus |
---|
0:11:07 | here ah |
---|
0:11:09 | no additional preference |
---|
0:11:10 | so the first call um is that a can be the better |
---|
0:11:14 | and |
---|
0:11:14 | is the score one |
---|
0:11:16 | of speech |
---|
0:11:18 | uh of course as i our system |
---|
0:11:20 | the help |
---|
0:11:21 | does not process and and is the overlapped speech we just focus on the second one |
---|
0:11:26 | and |
---|
0:11:27 | so a for |
---|
0:11:28 | the we can see |
---|
0:11:30 | and and just looking at the is that was also apply |
---|
0:11:34 | occasions fixations that |
---|
0:11:36 | and see that okay first sub of the system |
---|
0:11:40 | a a better to an ounce for two Y for |
---|
0:11:44 | top down a well there is a uh |
---|
0:11:47 | um is a result a much worse for of |
---|
0:11:50 | for uh um |
---|
0:11:51 | the T V shows |
---|
0:11:54 | yeah signal |
---|
0:11:55 | it is not to as the best system which provides the best with a |
---|
0:12:00 | or to the that that's set and see for example T a part to a seven top down Q better |
---|
0:12:04 | result why for out you nine that |
---|
0:12:07 | as at the bottom up |
---|
0:12:08 | and |
---|
0:12:09 | we and also consider hmmm |
---|
0:12:11 | the results be a simplification |
---|
0:12:14 | so that i can see it |
---|
0:12:16 | vacation |
---|
0:12:18 | just |
---|
0:12:19 | uh but a a degradation in performance for the bottom-up |
---|
0:12:24 | a that for the top-down down |
---|
0:12:26 | it's a way a proof of um is the system |
---|
0:12:30 | so |
---|
0:12:31 | it's a question is uh a okay may be purification you the discrimination between clusters |
---|
0:12:38 | i i am a as has a down |
---|
0:12:41 | the propagation |
---|
0:12:43 | bottom-up |
---|
0:12:44 | well unless one normalized against phone but yeah sure |
---|
0:12:49 | the in this case the propagation an is you last |
---|
0:12:53 | so that's explain a bit it a a clear of the cluster purity |
---|
0:12:57 | so we propose to look at all the cluster to at by one of the system and compute the purity |
---|
0:13:03 | for four all of this cluster the card |
---|
0:13:06 | uh so the is computed |
---|
0:13:08 | one is the fist |
---|
0:13:10 | so we takes a double speaker time seconds |
---|
0:13:13 | and we divide by the that optimal number |
---|
0:13:16 | uh a uh of second of the cluster |
---|
0:13:19 | so that a difference a situation |
---|
0:13:21 | if we have a high purity and a small number |
---|
0:13:24 | of |
---|
0:13:25 | cluster |
---|
0:13:26 | yeah well i i one has a pretty is a purity of |
---|
0:13:30 | cluster |
---|
0:13:31 | we can expect a system |
---|
0:13:32 | a to be lightly to converge to some speaker you |
---|
0:13:38 | and are very you do not of clusters |
---|
0:13:41 | like like to as the system converts to as or acoustic it |
---|
0:13:45 | we as it to as their have been |
---|
0:13:50 | a |
---|
0:13:51 | uh we do and what happened difference in audio was a are possible and the same for the last case |
---|
0:13:56 | so we doing at the true G and the number of cluster |
---|
0:13:59 | um the for tab and so we see him in |
---|
0:14:02 | we we don't use we do the propagation process |
---|
0:14:07 | a a top down as compare about a priority |
---|
0:14:10 | more less with |
---|
0:14:12 | but |
---|
0:14:13 | the top down as the that's class |
---|
0:14:16 | and C |
---|
0:14:18 | um of the right the number of clusters |
---|
0:14:20 | we have a as uh clusters them the bottom-up up and them about cluster it's clusters none the idea and |
---|
0:14:25 | number of cluster to for the ground truth |
---|
0:14:27 | so we can expect that top down to be a in the first situation also converge |
---|
0:14:32 | to some speaker |
---|
0:14:34 | as a button up is probably the for case |
---|
0:14:37 | so |
---|
0:14:38 | well |
---|
0:14:39 | to see uh what happened |
---|
0:14:41 | right |
---|
0:14:43 | the purification |
---|
0:14:44 | we see that |
---|
0:14:45 | a first for the top down the purification |
---|
0:14:48 | use pro as a pretty is improved |
---|
0:14:51 | um |
---|
0:14:53 | i cluster |
---|
0:14:54 | so for sure how is uh the system to converse to speaker then without purification |
---|
0:15:01 | uh |
---|
0:15:02 | there is a consistent in purity |
---|
0:15:05 | uh i i have a cluster them for the top that down so |
---|
0:15:09 | that is not or even i have to say a in which it situation we however |
---|
0:15:14 | uh |
---|
0:15:16 | a last not for this experiment to part is |
---|
0:15:18 | uh looking at the from musician |
---|
0:15:20 | for this case a so |
---|
0:15:22 | we take a different clusters |
---|
0:15:24 | we take all the clusters um |
---|
0:15:27 | for a system and for each of these cluster |
---|
0:15:30 | right |
---|
0:15:31 | histogram of the different for names |
---|
0:15:34 | we do this for all the clusters generated for the top down and the sample for the bottom-up up |
---|
0:15:39 | and for |
---|
0:15:40 | all of the |
---|
0:15:41 | the four |
---|
0:15:44 | compute the to a cluster distance between D histogram |
---|
0:15:48 | is uh is the colour back like the distance |
---|
0:15:52 | so |
---|
0:15:53 | and |
---|
0:15:54 | X is the average of all |
---|
0:15:57 | for these distances for each of the system |
---|
0:16:00 | so um |
---|
0:16:02 | we can expect uh is this average distance to be small |
---|
0:16:06 | uh uh uh as a |
---|
0:16:08 | a distribution in the different phone that uh in the different clusters |
---|
0:16:12 | so which means that's the system my |
---|
0:16:16 | and we can expect it to be a high ones as a higher degree of conversion to have problems |
---|
0:16:21 | and so i |
---|
0:16:22 | the distribution i'm not equality is the different cluster |
---|
0:16:27 | a a the is exposed the result in seen first |
---|
0:16:30 | sound propagation step |
---|
0:16:33 | i |
---|
0:16:35 | are used |
---|
0:16:36 | is a bottom-up |
---|
0:16:38 | which show really that's in this guy's a cluster are better normalized |
---|
0:16:43 | a a pill now the propagation |
---|
0:16:45 | we see that there is an improvement for bus of the system |
---|
0:16:48 | but um a plus but if a cash |
---|
0:16:52 | am a very high or than the top down |
---|
0:16:55 | plus but if question |
---|
0:16:57 | which just it's explained why the purification prove that that i |
---|
0:17:01 | of the bottom up so to conclude |
---|
0:17:03 | um |
---|
0:17:05 | we have seen in this slides that's |
---|
0:17:07 | but approach products bottom-up and top-down down |
---|
0:17:09 | give some compare but results but |
---|
0:17:12 | is |
---|
0:17:12 | uh_huh you different behaviours |
---|
0:17:15 | but up not isn't more disk |
---|
0:17:18 | because but |
---|
0:17:20 | often |
---|
0:17:20 | a a uh a a trade off from some clusters which are last normalized against linguistic content |
---|
0:17:26 | well i is a top down |
---|
0:17:28 | uh a a off from some cluster which are better normalized but less |
---|
0:17:33 | speaker discriminative |
---|
0:17:36 | so a |
---|
0:17:37 | uh i i think and |
---|
0:17:39 | one of the conclusion of this work is a there is a good thing to note to |
---|
0:17:44 | nation of this two approaches |
---|
0:17:46 | so we recently published a bit but but i think that a lot of the or other |
---|
0:17:51 | things to try |
---|
0:17:53 | and has a a future work |
---|
0:17:55 | we can expect maybe design a specific propagation process |
---|
0:17:58 | for a a bottom-up |
---|
0:18:00 | taking into consideration of this linguistic in which is quite particular |
---|
0:18:05 | or or on a of this approach |
---|
0:18:08 | he france's |
---|
0:18:11 | and that's it |
---|
0:18:13 | thanks |
---|
0:18:14 | okay |
---|
0:18:20 | any question |
---|
0:18:25 | okay that and if i one a quick question |
---|
0:18:30 | i the can i think with |
---|
0:18:33 | and are going to take a hard thing and |
---|
0:18:37 | um |
---|
0:18:38 | can i oh oh oh |
---|
0:18:41 | for |
---|
0:18:42 | right |
---|
0:18:44 | as a we stick like to use |
---|
0:18:46 | i have seen that is |
---|
0:18:47 | the core of these two approaches which you are are not what the provocation was just a motivation |
---|
0:18:53 | which lead to these work |
---|
0:18:55 | but as the core of the bottom-up and call of the top down acts differently |
---|
0:19:00 | is is the mystic in which is isn't this case the phone and content |
---|
0:19:04 | of the speech |
---|
0:19:07 | so uh |
---|
0:19:11 | but it |
---|
0:19:11 | but to a question you |
---|
0:19:14 | i and |
---|
0:19:15 | you |
---|
0:19:17 | i think i think |
---|