0:00:15 | hi first i want to think |
---|
0:00:17 | two or stay here because i thought that only me colours and the cameraman will |
---|
0:00:22 | be here |
---|
0:00:26 | the work was |
---|
0:00:27 | than by |
---|
0:00:29 | die unit and like me most about it i |
---|
0:00:33 | and he should |
---|
0:00:36 | present this war but |
---|
0:00:37 | unfortunately about ten days ago he got married |
---|
0:00:46 | so |
---|
0:00:48 | you prefer to go to cost oracle or something else in order to come here |
---|
0:00:54 | and present |
---|
0:00:55 | the more also used a quiz me and i stuck with this work |
---|
0:01:01 | and |
---|
0:01:02 | so |
---|
0:01:03 | you have to suffer me for about ten |
---|
0:01:09 | after that |
---|
0:01:10 | most of this workshop also on |
---|
0:01:13 | the n n's and spatially |
---|
0:01:16 | a very good managing skin all i felt that i also want to say some |
---|
0:01:22 | think about it so |
---|
0:01:24 | i briefly |
---|
0:01:25 | take talk about it a little be then |
---|
0:01:28 | i give a motivation |
---|
0:01:30 | about the clustering problem |
---|
0:01:33 | the basic mean shift algorithm and them but discussion we need |
---|
0:01:37 | and then i present the |
---|
0:01:40 | clustering system some experiments and summary |
---|
0:01:45 | so about the intense |
---|
0:01:49 | okay next |
---|
0:01:57 | so |
---|
0:01:59 | our problem as you we have |
---|
0:02:01 | a texas station there are many |
---|
0:02:05 | text across each lr |
---|
0:02:09 | have |
---|
0:02:10 | not one driver about the driver also |
---|
0:02:12 | changed |
---|
0:02:14 | and we have recording four |
---|
0:02:16 | quite |
---|
0:02:17 | they two days three days |
---|
0:02:20 | that you speak at to talk so we exactly know where the segment |
---|
0:02:25 | the and the start of the n |
---|
0:02:27 | of each segment |
---|
0:02:29 | and we collected |
---|
0:02:32 | these recording devices |
---|
0:02:36 | and |
---|
0:02:38 | and the end of the day we want to segment and no which segments were |
---|
0:02:43 | said by |
---|
0:02:44 | one speaker now the speaker so |
---|
0:02:48 | each month and talk |
---|
0:02:50 | we don't wear an |
---|
0:02:51 | where |
---|
0:02:53 | can take that one speaker now |
---|
0:02:57 | then the next time speaks after two hours three hours |
---|
0:03:01 | maybe to model |
---|
0:03:02 | from different car |
---|
0:03:04 | and what we have |
---|
0:03:08 | use a bag of segments |
---|
0:03:11 | which are unlabeled |
---|
0:03:13 | and we want you please |
---|
0:03:16 | mostly these segments are very short on the average one how seconds two seconds |
---|
0:03:22 | so |
---|
0:03:24 | we want to |
---|
0:03:25 | cluster |
---|
0:03:28 | short segments |
---|
0:03:29 | and usually the population use white be sort of speakers for two speakers |
---|
0:03:35 | and so on |
---|
0:03:37 | so the issues our problem |
---|
0:03:43 | a given |
---|
0:03:44 | mainly short segments we want |
---|
0:03:47 | to segment them into on morgan use group which means that we want |
---|
0:03:53 | have |
---|
0:03:54 | it would |
---|
0:03:57 | cluster purity it means that each cluster we would be occupied mostly by one speaker |
---|
0:04:04 | only |
---|
0:04:05 | but we also want to have |
---|
0:04:07 | speaker purity |
---|
0:04:10 | so |
---|
0:04:11 | we don't want that the same speaker will be spread between ten clusters |
---|
0:04:19 | the basic mean shift algorithm |
---|
0:04:21 | is that we have but |
---|
0:04:23 | many vectors we choose and vector or find by |
---|
0:04:29 | and b and b so the |
---|
0:04:33 | and go |
---|
0:04:34 | set of clothes this |
---|
0:04:37 | vectors |
---|
0:04:39 | and take all the |
---|
0:04:40 | vectors which are below some threshold |
---|
0:04:44 | and then shift |
---|
0:04:47 | than in particular the weighted mean of these |
---|
0:04:52 | vectors |
---|
0:04:53 | take the mean as reference points and the gain |
---|
0:04:57 | looking for the |
---|
0:05:00 | neighbours that below the threshold calculate the mean and you to converse some |
---|
0:05:06 | point and |
---|
0:05:07 | these what we can do we |
---|
0:05:09 | each point we |
---|
0:05:11 | each vector |
---|
0:05:13 | demos talk about this algorithm many times |
---|
0:05:17 | so for more details please refer to tables |
---|
0:05:21 | and |
---|
0:05:23 | lda the after we find the stable point of each wrecked or we are |
---|
0:05:31 | a group all the vectors which are close one to each other |
---|
0:05:36 | according to some threshold |
---|
0:05:38 | and the number of broad we have this is the number of clusters |
---|
0:05:44 | and the points which are in each group out there |
---|
0:05:48 | one of the cluster but we know that a canadian distance is not very good |
---|
0:05:52 | for the purpose of speaker clustering |
---|
0:05:57 | so |
---|
0:06:00 | here what we present with cosine distance now we use the be lda scoring instead |
---|
0:06:05 | of cosine |
---|
0:06:08 | so instead of looking for |
---|
0:06:12 | closest vectors in the sense of forty the as distance we look for |
---|
0:06:19 | based high score between b lda score |
---|
0:06:24 | and the |
---|
0:06:26 | calculate the |
---|
0:06:28 | a new mean |
---|
0:06:29 | when they've function g is the weight in the weight is basically the |
---|
0:06:34 | and be lda score |
---|
0:06:37 | and are the difference we made we didn't do not use a threshold |
---|
0:06:44 | to look for |
---|
0:06:47 | class vectors |
---|
0:06:50 | below the threshold instead we was k and then we set |
---|
0:06:55 | okay and take the k nearest |
---|
0:07:00 | and vectors which have the highest the |
---|
0:07:03 | bleu score |
---|
0:07:07 | so |
---|
0:07:08 | basically we |
---|
0:07:11 | all these creations |
---|
0:07:14 | the and |
---|
0:07:15 | now this case |
---|
0:07:17 | the to a shown is not fixed |
---|
0:07:20 | like in the original algorithm but you the depends on the |
---|
0:07:26 | largest distance of the case vector or |
---|
0:07:31 | but we calculate the same are |
---|
0:07:35 | algorithm in shift so we calculate the mean according to these k |
---|
0:07:39 | nearest vectors shifted the mean and again continue the process |
---|
0:07:48 | hearable the |
---|
0:07:50 | well i-vectors but i-vector also because |
---|
0:07:54 | i will explain that we |
---|
0:07:55 | do the small modification of them |
---|
0:07:58 | and applying mean shift algorithm according to the bleu score |
---|
0:08:03 | and we have the results i just as we mentioned |
---|
0:08:06 | and |
---|
0:08:08 | because i we compared to the brother previous or |
---|
0:08:12 | that in previous work |
---|
0:08:13 | the threshold was fixed |
---|
0:08:15 | we will scroll signs used and these stunts |
---|
0:08:19 | and we use run don't mean shifted mean that we don't |
---|
0:08:23 | i'll go across all the points but only randomly choose them |
---|
0:08:32 | so before clustering we of course need to train ubm |
---|
0:08:38 | total variability matrix but before using one |
---|
0:08:42 | build the score we found that it is better to |
---|
0:08:46 | do pca on the data and gab just another job pca |
---|
0:08:52 | a no we reduce our r vector from for dimensions of the four hundred to |
---|
0:08:58 | two hundred fifty |
---|
0:09:00 | and we tried to compare it to is just i-vector of size two hundred fifty |
---|
0:09:04 | this work better |
---|
0:09:07 | we don't sure why |
---|
0:09:09 | but is the fact it was better and we apply and next would indeed whitening |
---|
0:09:16 | and apply a p lda score on these vectors |
---|
0:09:26 | okay |
---|
0:09:27 | so this was explained before |
---|
0:09:34 | the experiment setup was that |
---|
0:09:37 | we use nist two thousand eight two we got six |
---|
0:09:41 | into a low dimensional segments |
---|
0:09:45 | on average to enhance seconds |
---|
0:09:49 | and |
---|
0:09:50 | and we have average number all five segments per |
---|
0:09:54 | speaker so to three |
---|
0:09:57 | a no |
---|
0:09:59 | we |
---|
0:10:00 | calculate |
---|
0:10:02 | the results |
---|
0:10:03 | according to |
---|
0:10:05 | average speaker purity average cluster purity the k parameter |
---|
0:10:10 | and the other important parameter use how many class so we have at the end |
---|
0:10:15 | comparing to the true number of clusters |
---|
0:10:22 | so we're starting from the beginning |
---|
0:10:27 | we start with the |
---|
0:10:30 | here we go system of cosine distance with |
---|
0:10:35 | a threshold fixed threshold this is the red line |
---|
0:10:39 | and we see that |
---|
0:10:41 | we have a wide be we can we |
---|
0:10:44 | have to know exactly |
---|
0:10:46 | what is the |
---|
0:10:47 | based threshold |
---|
0:10:49 | to make clustering when we use k-means instead |
---|
0:10:54 | can use them but they're holding is that we see that we |
---|
0:10:58 | have a plucked or so it |
---|
0:11:01 | doesn't |
---|
0:11:03 | make a difference if we choose sorting or seventeen so it's much more robust and |
---|
0:11:10 | when we use k nearest neighbor hold |
---|
0:11:14 | all the |
---|
0:11:16 | results are for |
---|
0:11:19 | i thought speakers |
---|
0:11:22 | next we yea instead of using |
---|
0:11:26 | random mean shoes |
---|
0:11:28 | useful mean shift which are much more expensive computationally |
---|
0:11:34 | but we see that we have some gain |
---|
0:11:36 | still be is cosine distance |
---|
0:11:40 | and then we switch from cosine distance to be lda score |
---|
0:11:45 | and we have |
---|
0:11:47 | and it to beat |
---|
0:11:48 | more gain |
---|
0:11:51 | i have to say that |
---|
0:11:53 | both for the lda training |
---|
0:11:56 | and for |
---|
0:11:58 | and w c and training in the cosine system |
---|
0:12:01 | we trained them on short segments well too long segment |
---|
0:12:08 | shortly we will see why we did it on short segments |
---|
0:12:13 | this one |
---|
0:12:15 | when we train |
---|
0:12:18 | be lda on long segments we have very better results |
---|
0:12:23 | but on short segments |
---|
0:12:26 | we improve the remote results dramatically |
---|
0:12:30 | the total variability matrix trained on long segments only |
---|
0:12:35 | we didn't use short segments |
---|
0:12:37 | because it was very bit |
---|
0:12:41 | this all results |
---|
0:12:43 | deal now only we've sort of speakers |
---|
0:12:47 | and this is some summary of all the results we see it's better to move |
---|
0:12:51 | from |
---|
0:12:52 | and i fixed threshold to get a nearest neighbor hope to go from randall mean |
---|
0:12:58 | shift to full mean shift to move to be lda |
---|
0:13:04 | and the hope for results |
---|
0:13:11 | it's a totally |
---|
0:13:13 | not require a problem is how many clusters |
---|
0:13:17 | we have after clustering problem process |
---|
0:13:22 | with when the compared to the actual number of clusters |
---|
0:13:26 | and the red line of the drawing a units |
---|
0:13:31 | of the |
---|
0:13:32 | and a fixed threshold |
---|
0:13:36 | and |
---|
0:13:37 | if you are looking |
---|
0:13:39 | and the result |
---|
0:13:41 | it's not so nice we have |
---|
0:13:46 | true forty six clusters speakers but it was estimated |
---|
0:13:51 | s |
---|
0:13:52 | about one hundred eighty clusters means that we have many small clusters they are very |
---|
0:13:58 | pure but |
---|
0:14:01 | a small and too much |
---|
0:14:05 | but when we will scale and then |
---|
0:14:08 | we can see that we have about a factor of two about six two clusters |
---|
0:14:13 | so we have better cave better clustering performance with much less |
---|
0:14:19 | clusters |
---|
0:14:23 | these are the results |
---|
0:14:27 | when we use the cosine distance with a fixed threshold |
---|
0:14:31 | on different arbour off |
---|
0:14:34 | speakers from three to one hundred eighty eight |
---|
0:14:39 | we |
---|
0:14:39 | when we will compare with the |
---|
0:14:42 | proposed algorithm we will see that |
---|
0:14:44 | in this case the |
---|
0:14:47 | cluster purity is better |
---|
0:14:50 | it it's understandable many small clusters and they're all pure of one segment to segment |
---|
0:14:57 | but the k the overall and |
---|
0:15:00 | results |
---|
0:15:03 | in our case in the our algorithm is better |
---|
0:15:06 | and the average number of clusters you can see that for three speakers that okay |
---|
0:15:11 | but |
---|
0:15:12 | let's able to one hundred eighty eight speakers it's |
---|
0:15:16 | by a factor of ten almost we have much more |
---|
0:15:21 | clusters that true number of speakers |
---|
0:15:25 | when we go to |
---|
0:15:27 | the be lda we skin |
---|
0:15:32 | we have |
---|
0:15:33 | better speaker purity |
---|
0:15:35 | and |
---|
0:15:37 | much less class see that the |
---|
0:15:41 | by a factor of one how to for two |
---|
0:15:46 | and these summarize |
---|
0:15:48 | the results |
---|
0:15:50 | for three and seven |
---|
0:15:53 | speakers we have |
---|
0:15:54 | a little bit to the better results by cosine with a fixed threshold but when |
---|
0:16:01 | we go from |
---|
0:16:03 | fifteen speakers and more |
---|
0:16:06 | we prefer and they've been this score is k and the and nearest neighbor |
---|
0:16:13 | we see both the results of k and for the number of clusters |
---|
0:16:22 | and |
---|
0:16:24 | okay we propose new system which al |
---|
0:16:28 | but class and performance and |
---|
0:16:33 | much less number of clusters we pay for these and it would be by a |
---|
0:16:39 | computationally because we moved from a random a mean shift to one and she if |
---|
0:16:45 | the |
---|
0:16:49 | and that's all what they have to say two |
---|
0:16:58 | we have a question |
---|
0:17:11 | thank you so insecure remark that for sure to utterance clustering |
---|
0:17:18 | results with a training but in the remainder removes the longer utterances well mm disappointing |
---|
0:17:26 | of you my some other noises so also minimizes to bother explaining this to managing |
---|
0:17:33 | the resulting protocol for improved composite ross is twice is also implement a mattress is |
---|
0:17:40 | possible to enrol |
---|
0:17:43 | with thing because if you'll train it on the long segments there would be that |
---|
0:17:49 | big mismatch between the training condition and the testing that we train you on long |
---|
0:17:56 | segments and calculate the |
---|
0:18:00 | on i-vectors from short segments it would be something |
---|
0:18:06 | not appropriate |
---|
0:18:07 | but maybe with number two into the speaker or subspace are composed to suppose maybe |
---|
0:18:12 | to be more correct reason longer |
---|
0:18:17 | basically much more accurate but not for our problem so yes okay |
---|
0:18:24 | most important is a new sound so yes or no i think that there should |
---|
0:18:27 | be some trade off between the accuracy of the a score or training score and |
---|
0:18:35 | to see the true problem |
---|
0:18:37 | yes |
---|
0:18:38 | two |
---|
0:18:42 | extension of the proposed |
---|
0:18:46 | right |
---|
0:18:52 | or you |
---|
0:18:54 | a can thank you for your presentation i can you please go back to that |
---|
0:18:59 | results section very you showed that values of k and number of speakers |
---|
0:19:04 | stopping |
---|
0:19:07 | maybe it's okay let us know that |
---|
0:19:11 | go for here and then you are increasing the number of speakers and the value |
---|
0:19:17 | of k is x is an is fixed |
---|
0:19:20 | and that the results are going down i mean like and you like and you |
---|
0:19:24 | try with different values of k |
---|
0:19:27 | for different of us to the at least k is the |
---|
0:19:34 | square of the multiplication of s b and s p smell the k of the |
---|
0:19:38 | k nearest neighbor i mean that that's and i |
---|
0:19:43 | this can and j o k is the best or the result with the best |
---|
0:19:48 | k |
---|
0:19:50 | but as you see that |
---|
0:19:52 | so before |
---|
0:19:53 | the rose |
---|
0:19:55 | no big difference |
---|
0:19:57 | if you use fourteen or fifteen or seventeen for each number of speakers |
---|
0:20:04 | for which number of speakers are fixed and we can use the these rifle use |
---|
0:20:08 | the almost the same for |
---|
0:20:12 | and i |
---|
0:20:14 | any number of speakers with the we tested |
---|
0:20:17 | it to reach a plateau and stays there |
---|
0:20:20 | i assume that is we will the increase k two |
---|
0:20:24 | of fifty or seventy two will decrease of the results would begin go down at |
---|
0:20:29 | some point |
---|
0:20:30 | but for reasonable and the |
---|
0:20:32 | "'kay" size |
---|
0:20:34 | you just almost the same results |
---|
0:20:45 | what data did you used to train your p lda when you use a short |
---|
0:20:50 | segment |
---|
0:20:54 | the same data that we used for ubm and i don't remember you should go |
---|
0:20:58 | to costa rica to ask die buster |
---|
0:21:03 | i |
---|
0:21:04 | it sounds anyway but it it's not from the this that let's say a real |
---|
0:21:10 | the same development and set for training the ubm and |
---|
0:21:15 | take part of it just started in short segments in the train the building right |
---|
0:21:21 | but we need to the short segment you're taking multiple short segments per telephone |
---|
0:21:26 | right |
---|
0:21:28 | we take a couple of for phone call and make multiple segments out of it |
---|
0:21:32 | yes but it sure randomly so it from different sessions for the same okay so |
---|
0:21:37 | that so the cuda in use the same |
---|
0:21:40 | short segments from the same phone call |
---|
0:21:42 | a strange automatic could be that several of them will be but |
---|
0:21:48 | we just randomly choose suppose of a really respect i just ask on it because |
---|
0:21:53 | i agree the this jumping back a question that |
---|
0:21:57 | what we've seen with things not for clustering so maybe a different thing but in |
---|
0:22:02 | terms of the p lda parameters that |
---|
0:22:05 | you do better with training those up with the longer ones even what it's doing |
---|
0:22:08 | short duration test this is given for speaker echo so it may not be derived |
---|
0:22:13 | that three's only resides asking the data you did a random selection so yes very |
---|
0:22:17 | unlikely that it was concentrated from the same call |
---|
0:22:20 | so that was my mean first and the |
---|
0:22:25 | and the results are all also all the segments for the clustering were |
---|
0:22:32 | on the test set were chosen randomly and we're and we're an experiment ten times |
---|
0:22:39 | except of the |
---|
0:22:41 | last one of one hundred eighty eight speakers because there are only one hundred eighty |
---|
0:22:47 | eight speakers in the dataset so we can couldn't two randomly |
---|
0:22:55 | ms |
---|
0:23:02 | idea first of all one of the things that the like to in the original |
---|
0:23:07 | it's a means if target |
---|
0:23:09 | was it's probabilistic interpretation in the fact that the analysis start with i don't parametric |
---|
0:23:19 | density estimation meaning that in each point you create a small either gaussian or a |
---|
0:23:24 | triangular say a pdf |
---|
0:23:28 | with triangular grand up with the kind of the threshold which is the uniform with |
---|
0:23:32 | a gaussian grid up again with a notion because that's that there's the differentiation |
---|
0:23:38 | and therefore dates |
---|
0:23:41 | a rule is derived |
---|
0:23:44 | by simple differentiation in order to find them all at which point where converts where |
---|
0:23:52 | convergence |
---|
0:23:53 | i'm wondering if you |
---|
0:23:55 | choose a p lda let's say like so you don't |
---|
0:23:59 | put either cosine distance or a standard i squared distance which is was initial |
---|
0:24:07 | can you tell us because one question is not |
---|
0:24:12 | whether these update rule |
---|
0:24:14 | comes naturally |
---|
0:24:16 | from the same mechanism buddies a new as explained to you i don't parametric |
---|
0:24:23 | okay but we can get estimation |
---|
0:24:28 | as you estimated answers no |
---|
0:24:31 | we also isn't |
---|
0:24:32 | one so it's more realistic what works a useful |
---|
0:24:41 | the question |
---|
0:24:48 | so next and speaking |
---|