0:00:15 | hello my name is unless it in all the and i'm going to present the |
---|
0:00:19 | one on reminisced incriminating still applied for |
---|
0:00:22 | the task of speaker diarization |
---|
0:00:25 | and this work is the result of collaboration between me |
---|
0:00:30 | you'll continuity and the look and we'll get |
---|
0:00:33 | right but and we can put on there and the most likely from all year |
---|
0:00:38 | but i hope |
---|
0:00:39 | want to note that even though there |
---|
0:00:42 | that's that we are doing here we is diarization |
---|
0:00:45 | the model builder then going to present is not |
---|
0:00:49 | and necessarily have to be used there |
---|
0:00:51 | it can be also applied for example for speaker verification but in my presentation and |
---|
0:00:57 | considering only there is a should |
---|
0:01:00 | the first |
---|
0:01:01 | i want to start |
---|
0:01:03 | the with the |
---|
0:01:04 | a sure |
---|
0:01:07 | motivational slight widely started with the |
---|
0:01:10 | the troubles model |
---|
0:01:12 | so we are interested in doing speaker diarization by first splitting the entrance into short |
---|
0:01:20 | overlapping segments in our case this year |
---|
0:01:24 | second and how long or short |
---|
0:01:27 | and they overlap was zero point seventy five seconds |
---|
0:01:31 | then we will extract and invading programmable expect that for each segment |
---|
0:01:38 | and |
---|
0:01:39 | class there are then variance |
---|
0:01:41 | and consequently segments |
---|
0:01:43 | for the there is a nation |
---|
0:01:46 | no that there is a problem with this approach it or |
---|
0:01:50 | the drawback is that the segments are consists either |
---|
0:01:55 | the same however there |
---|
0:01:58 | really not that one and you're that might be different |
---|
0:02:01 | and we would like to |
---|
0:02:03 | utilize the information of the what did you how trustful that segment base |
---|
0:02:09 | so our assumptions here is that the quality of the segment a actually |
---|
0:02:14 | i think our ability to extract them being |
---|
0:02:17 | so |
---|
0:02:18 | we |
---|
0:02:19 | really that e |
---|
0:02:21 | that is sentiment for sure and noisy |
---|
0:02:24 | we shouldn't be really sure that the we extracted the |
---|
0:02:28 | really invading however if a segment was long and clean then the weighting can be |
---|
0:02:34 | trusted work |
---|
0:02:35 | so |
---|
0:02:36 | a in our little we propose to train invading hidden variables rather as observed the |
---|
0:02:42 | that is |
---|
0:02:44 | done usually |
---|
0:02:46 | i in this case and we have two |
---|
0:02:49 | modify |
---|
0:02:50 | and weighting extract there are so that it does not want |
---|
0:02:54 | vector invariance but rather parameters of invading distribution |
---|
0:02:58 | and also we have to have some of which can |
---|
0:03:01 | and |
---|
0:03:03 | and digestive and weighting this for |
---|
0:03:06 | the one starting with the model here we see a graphical model |
---|
0:03:09 | war |
---|
0:03:10 | i single utterance |
---|
0:03:12 | off and speech segments |
---|
0:03:15 | i don't each segment this is also our souls |
---|
0:03:19 | sdr speech segments and |
---|
0:03:21 | each segment has a |
---|
0:03:23 | a sign |
---|
0:03:26 | speaker labels |
---|
0:03:27 | which are the labels are observed the training data |
---|
0:03:31 | and then we have two sets of hidden variables |
---|
0:03:34 | i x are more human beings and y |
---|
0:03:37 | are there a human speaker variables |
---|
0:03:41 | and that here there |
---|
0:03:45 | there is only one |
---|
0:03:47 | speaker variable and the |
---|
0:03:49 | to conveying and consequently segment that each time |
---|
0:03:53 | and the |
---|
0:03:54 | the a speaker label defines which one this |
---|
0:04:00 | so we are interested in clustering this |
---|
0:04:02 | segment |
---|
0:04:04 | into speaker clusters |
---|
0:04:06 | and to be able to do so |
---|
0:04:08 | we have to know how to compute clustering posterior |
---|
0:04:11 | e o s where |
---|
0:04:13 | if it'll l denotes there |
---|
0:04:16 | it's handle all speaker labels |
---|
0:04:18 | and are is this that'll all |
---|
0:04:21 | a speech segments |
---|
0:04:23 | so let's the loser how this posterior of books |
---|
0:04:27 | and it can be expressed the this ratio wherein linear either we have a product |
---|
0:04:31 | of two children |
---|
0:04:32 | we all know is down |
---|
0:04:34 | right of some given clustering |
---|
0:04:37 | and |
---|
0:04:38 | a year and given a the likelihood of the clustering and in the original meaning |
---|
0:04:43 | that we have a star |
---|
0:04:44 | all the germs of the same one and this time here is over all possible |
---|
0:04:50 | partitions |
---|
0:04:51 | and segment is into clusters |
---|
0:04:55 | regarding the prior |
---|
0:04:57 | in our experiments we are using chain users to process prior however i |
---|
0:05:03 | probably like in that is then again i'm the only option |
---|
0:05:08 | by probably know the optimal option as well it was just convenient for us so |
---|
0:05:13 | we stick to the |
---|
0:05:14 | an s and i'm not willing to discuss prior and in this presentation anymore |
---|
0:05:21 | for a known beers you know the |
---|
0:05:22 | it is then |
---|
0:05:24 | and we are going to concentrate on the signature on the likelihood |
---|
0:05:29 | at we |
---|
0:05:31 | of the loser and a within that it can be |
---|
0:05:34 | represented as a and for that |
---|
0:05:38 | over individual likelihoods |
---|
0:05:40 | well |
---|
0:05:42 | speech segments assigned to |
---|
0:05:43 | a individual speakers |
---|
0:05:46 | but there are no segments of time to some specific speaker that the su than |
---|
0:05:50 | this |
---|
0:05:51 | is just once all that another thing that little brother |
---|
0:05:55 | s |
---|
0:05:56 | all descendants are assumed to |
---|
0:05:58 | and |
---|
0:05:59 | all the segments and i are assumed to belong to the same speaker in the |
---|
0:06:03 | shape share the same speaker variable |
---|
0:06:06 | so |
---|
0:06:07 | we can represented as the eighteen |
---|
0:06:10 | following integral |
---|
0:06:12 | here the integration is over a speaker available |
---|
0:06:15 | and until the integral we have the product of |
---|
0:06:17 | a prior over the speaker variable and the |
---|
0:06:20 | the brother |
---|
0:06:21 | or |
---|
0:06:22 | likelihood terms of individual |
---|
0:06:24 | speech segments given the speaker variable |
---|
0:06:28 | and |
---|
0:06:30 | we are willing to discuss how do you this and twelve |
---|
0:06:34 | assumptions and restriction during model like to make to be able to computed efficiently |
---|
0:06:39 | so |
---|
0:06:40 | you know no |
---|
0:06:42 | the |
---|
0:06:43 | speech segment and |
---|
0:06:44 | it got available are not connected directly by two |
---|
0:06:48 | you known invading |
---|
0:06:50 | so we have to integrate |
---|
0:06:52 | it how to be able to compute this likelihood |
---|
0:06:57 | and that's exactly what we do here |
---|
0:07:00 | and here the integration of is over human invading and until the integral we have |
---|
0:07:04 | the product of two choice |
---|
0:07:06 | the first one is a model the relation between |
---|
0:07:10 | you know invading and you know speaker available |
---|
0:07:13 | and we proposed a novel by gaussian really well |
---|
0:07:18 | and the next |
---|
0:07:20 | german |
---|
0:07:21 | a little the relation between speech segment |
---|
0:07:25 | although there are we just after and human and eighteen |
---|
0:07:28 | so |
---|
0:07:28 | at first |
---|
0:07:30 | so is gaussian efficient in the second one is also gaussian as a function of |
---|
0:07:35 | x then |
---|
0:07:36 | the whole |
---|
0:07:38 | integral can be a little computed in the closed form |
---|
0:07:41 | so basically then the first assumption that we make in our model is |
---|
0:07:46 | then it is exactly but you |
---|
0:07:48 | the robot into speech human |
---|
0:07:53 | fusion invading |
---|
0:07:54 | e can be represented this brother |
---|
0:07:57 | which is a gaussian distribution that is a function of x |
---|
0:08:01 | and the normalized zero the gaussian is |
---|
0:08:03 | non-negative function h |
---|
0:08:05 | we depends only on speech and not on the |
---|
0:08:10 | and making it so |
---|
0:08:11 | then in between lockheed inability |
---|
0:08:16 | formula for k |
---|
0:08:17 | a likelihood we see that it can be expressed the data |
---|
0:08:21 | this |
---|
0:08:22 | why this equation |
---|
0:08:23 | and that of here the likelihood depends on that |
---|
0:08:27 | the parameters to be lda which are a domain justice w and also on |
---|
0:08:32 | parameters of the embedding distribution which are x and |
---|
0:08:37 | e |
---|
0:08:37 | and the x k is the mean of the invading distribution and d is the |
---|
0:08:42 | precision matrix really |
---|
0:08:44 | i mean even though that we have now the closed form solution what is |
---|
0:08:48 | likelihoo |
---|
0:08:49 | it will be very impractical to use |
---|
0:08:51 | and |
---|
0:08:52 | we have to be one base calls me |
---|
0:08:55 | matrix |
---|
0:08:56 | matrix inversion |
---|
0:08:59 | operation for each |
---|
0:09:00 | speech then answer rate |
---|
0:09:02 | and it will be just a to them to do in |
---|
0:09:07 | real |
---|
0:09:08 | application |
---|
0:09:09 | so we have proposed to restrict our moral to a two covariance model instead of |
---|
0:09:17 | just general gaussian guilty |
---|
0:09:21 | we do this because we know that |
---|
0:09:24 | a within and across class covariance as a eulogy can be mutually data not like |
---|
0:09:29 | and if we assume that two covariance model that we can |
---|
0:09:33 | set |
---|
0:09:34 | that the loading matrix all of the two identity |
---|
0:09:38 | and assume that they |
---|
0:09:42 | relating class covariance of the consequently precision is diagonal |
---|
0:09:47 | and we have pretty close to the |
---|
0:09:50 | for all of a |
---|
0:09:52 | and weighting parameters |
---|
0:09:55 | as we like so we retreat is i |
---|
0:09:58 | that they in building precision matrix would be also there now |
---|
0:10:02 | then the whole the likelihood of expression greatly simplifies a stronger this like |
---|
0:10:11 | then getting back to living in the very interested in computing clustering posterior and score |
---|
0:10:16 | that we need and the |
---|
0:10:19 | likelihood |
---|
0:10:20 | or |
---|
0:10:21 | speech segments of thing to the same speaker |
---|
0:10:24 | a given the partition |
---|
0:10:26 | and that was computed as this integral and now we know |
---|
0:10:30 | the expression to compute the |
---|
0:10:34 | this |
---|
0:10:35 | germs under the integral which are gaussian |
---|
0:10:38 | so we have the |
---|
0:10:39 | and a product of gaussian under the integral also the prior year |
---|
0:10:44 | is standard normal distribution |
---|
0:10:47 | and it's |
---|
0:10:49 | assumed by the lda model |
---|
0:10:50 | so we can compute the whole integral form and that they result is given here |
---|
0:10:56 | on this line |
---|
0:10:59 | a |
---|
0:11:00 | please |
---|
0:11:00 | no |
---|
0:11:01 | is that |
---|
0:11:02 | even though we can compute |
---|
0:11:05 | this |
---|
0:11:05 | likelihood of the level log likelihood exactly but only doctors analysing once then |
---|
0:11:10 | it does not really matter as in will are training and test recipes |
---|
0:11:15 | this corresponds are going to cancel so we can just a regular |
---|
0:11:20 | so alright do compute the clustering posterior is we need |
---|
0:11:25 | therefore we need |
---|
0:11:27 | and then the within class precision matrix or |
---|
0:11:32 | of the terrible |
---|
0:11:34 | the lda within class president |
---|
0:11:37 | and then there's to be channel |
---|
0:11:39 | means of a and variance and vectors which are diagonal precisions of and endings |
---|
0:11:46 | and that we propose to model them by using style |
---|
0:11:51 | pretrained thinks |
---|
0:11:53 | excellent they're extractor each is shown your on the scheme |
---|
0:11:58 | in great |
---|
0:11:59 | and so this is some extent they're extracted which was train |
---|
0:12:04 | and extend the was not modified better there |
---|
0:12:07 | normally in the look |
---|
0:12:09 | use it will go the output of the presentation layer after the statistics one and |
---|
0:12:14 | later here we just three or the rest of the network is we don't really |
---|
0:12:18 | need it and is that in one really in your earlier today and within the |
---|
0:12:25 | out of the thing only will be the |
---|
0:12:27 | mean are unwilling distribution |
---|
0:12:29 | also we had |
---|
0:12:31 | an sub network which is able to extract invariant precision |
---|
0:12:36 | and this is the |
---|
0:12:37 | been for a network with |
---|
0:12:39 | two hidden layers |
---|
0:12:40 | and included things the |
---|
0:12:42 | i don't the statistics when layer and also the length |
---|
0:12:46 | then |
---|
0:12:48 | segmented into frames |
---|
0:12:50 | and their outputs of this vector which stores that there and making decision |
---|
0:12:55 | and we'll mean and precision in to be lda |
---|
0:13:00 | huh |
---|
0:13:01 | and so can be |
---|
0:13:03 | all these yellow loss can be trained together |
---|
0:13:07 | place on discriminatively dating |
---|
0:13:09 | i |
---|
0:13:10 | let's not is that |
---|
0:13:12 | we just ignore this little or |
---|
0:13:15 | well |
---|
0:13:16 | in this work |
---|
0:13:17 | then we are back to the standard expect their gaussian the lda |
---|
0:13:22 | recipe |
---|
0:13:23 | is |
---|
0:13:25 | this is in your |
---|
0:13:27 | transformation and the within class precision |
---|
0:13:30 | together just define the lda model trained on |
---|
0:13:35 | x are there are extracted from the original |
---|
0:13:38 | so how really training we propose though |
---|
0:13:41 | use multiclass cross entropy criterion to train |
---|
0:13:44 | the |
---|
0:13:45 | models |
---|
0:13:47 | but the model parameters |
---|
0:13:49 | one and reorganise the training set as a collection of supervised trials and the each |
---|
0:13:54 | of them |
---|
0:13:54 | a contains |
---|
0:13:56 | a saddle |
---|
0:13:58 | eight speech segments and corresponding speaker labels which define |
---|
0:14:03 | two clustering or this eight |
---|
0:14:06 | segment |
---|
0:14:06 | and that we used just eight segments |
---|
0:14:09 | for a reason though |
---|
0:14:10 | for the higher number of them |
---|
0:14:12 | we cannot be of |
---|
0:14:14 | what it will be just very computationally expensive to compute that |
---|
0:14:18 | was there is that |
---|
0:14:20 | so |
---|
0:14:20 | once we have train the model with this criterion we can use it for diarization |
---|
0:14:27 | just like sizes |
---|
0:14:28 | a our baseline approach and the one that we propose |
---|
0:14:33 | and the baseline we use completely of the |
---|
0:14:36 | call the diarisation recipe |
---|
0:14:40 | which |
---|
0:14:40 | extract extra there for each on this time and then there is |
---|
0:14:45 | there is a rubber systems that of that |
---|
0:14:48 | and then processed x vectors are fed into but which provides an matrix all the |
---|
0:14:54 | similarity scores |
---|
0:14:55 | and discourse and used |
---|
0:14:57 | and in two |
---|
0:14:59 | agglomerative can be a clustering algorithm |
---|
0:15:03 | which is really i algorithm starting with them |
---|
0:15:06 | constraining each segment this |
---|
0:15:09 | separate speaker label and then |
---|
0:15:12 | gradually merging |
---|
0:15:14 | clusters to at a time |
---|
0:15:17 | the baseline you this is a portion of this algorithm which |
---|
0:15:20 | and after each manners e |
---|
0:15:23 | and compute them |
---|
0:15:26 | similarity scores of the new cluster against all the rest |
---|
0:15:29 | by simply averaging the |
---|
0:15:32 | it's worse |
---|
0:15:33 | all the individual parts of this |
---|
0:15:35 | cluster |
---|
0:15:37 | a the |
---|
0:15:39 | noting stops |
---|
0:15:40 | once there are no |
---|
0:15:43 | similarity scores higher than some preset threshold |
---|
0:15:48 | in our we use not only extra there but also out of the o |
---|
0:15:54 | statistics for english and the |
---|
0:15:56 | number of |
---|
0:15:57 | frames |
---|
0:15:58 | in a segment then we center and length normalized expect there |
---|
0:16:02 | and the |
---|
0:16:03 | use |
---|
0:16:05 | the image and then just this |
---|
0:16:07 | a probabilistic and things |
---|
0:16:09 | finally they yell at similarity scores are |
---|
0:16:14 | used by age |
---|
0:16:16 | however in our case |
---|
0:16:19 | after each match we compute log-likelihood ratio scores |
---|
0:16:23 | exactly |
---|
0:16:24 | for a new cluster at all the rest |
---|
0:16:27 | and the |
---|
0:16:29 | do the experimental setup manuals |
---|
0:16:33 | and also one and two to train extract their extractor and baseline the lda |
---|
0:16:39 | and then |
---|
0:16:40 | and you we use i mean dataset to training |
---|
0:16:44 | and certainty |
---|
0:16:47 | extractor which is the this the network extracting a and b imprecisions and |
---|
0:16:52 | also retraining the lda |
---|
0:16:54 | and then |
---|
0:16:55 | we this we use there are two thousand nineteen development and evaluation sets |
---|
0:17:00 | two |
---|
0:17:01 | there is a performance |
---|
0:17:03 | and here and the results |
---|
0:17:05 | so first i have to know that the results the in the table here is |
---|
0:17:10 | slightly different than what is in the paper |
---|
0:17:13 | actually because i in part of the result document and then are actually because i |
---|
0:17:19 | here |
---|
0:17:21 | tract there is a meeting the paper i manage to improve the baseline performance like |
---|
0:17:25 | this total generated the updated results |
---|
0:17:28 | so |
---|
0:17:29 | for each models here we have two sets of the results |
---|
0:17:33 | one is |
---|
0:17:34 | where a general threshold still |
---|
0:17:38 | agglomerative clustering |
---|
0:17:39 | and i think one is when the oldest are generally and then a threshold is |
---|
0:17:45 | tuned on the development set of the |
---|
0:17:48 | us all these are |
---|
0:17:50 | energy |
---|
0:17:51 | and then there is no threshold should be |
---|
0:17:57 | a maximum likelihood optimum threshold |
---|
0:18:00 | in case |
---|
0:18:01 | the a model will probably just |
---|
0:18:04 | correct a true local will |
---|
0:18:07 | log-likelihood ratio scores |
---|
0:18:09 | if it's |
---|
0:18:10 | not the case then |
---|
0:18:12 | mandarin threshold we can still role there is no |
---|
0:18:15 | which is definitely the gaze for all the systems at least eight |
---|
0:18:19 | a first if you look at the baseline system there is a quite time gap |
---|
0:18:24 | between optimal performance and |
---|
0:18:27 | the performance we're using your own personal |
---|
0:18:30 | however |
---|
0:18:31 | we just a place there |
---|
0:18:34 | the baseline version or h t with our |
---|
0:18:38 | when we compute the log-likelihood ratios or some other each of which |
---|
0:18:42 | then we see that |
---|
0:18:43 | the |
---|
0:18:45 | calibration and an issue becomes more romance of the results with zero threshold degree substantially |
---|
0:18:53 | and even the optimal results also |
---|
0:18:56 | get a quite worse than |
---|
0:18:58 | then |
---|
0:18:59 | baseline |
---|
0:19:01 | i will work here we did not the retraining any willingly just directly the clustering |
---|
0:19:05 | algorithm |
---|
0:19:06 | if we |
---|
0:19:08 | train the same not also without using what will be taken bearings we just are |
---|
0:19:13 | trained dog |
---|
0:19:15 | a clean multiclass cross entropy as was discussed before |
---|
0:19:18 | then this calibration issue is |
---|
0:19:23 | so large extent |
---|
0:19:25 | so the difference between zero threshold and you know one is no |
---|
0:19:30 | as a remaining anyone |
---|
0:19:33 | and also we even |
---|
0:19:35 | managed to |
---|
0:19:36 | slightly improve |
---|
0:19:37 | over the zero threshold the baseline performance |
---|
0:19:42 | finally if we add to this model this and being precisions |
---|
0:19:47 | so we are using the uncertainty about emission then |
---|
0:19:51 | we |
---|
0:19:52 | further improve |
---|
0:19:54 | zero threshold performance and also the optimal |
---|
0:19:58 | so |
---|
0:19:59 | and then aligned |
---|
0:20:02 | that setting let's say that this |
---|
0:20:06 | system will give us the best performance over it was |
---|
0:20:12 | development data |
---|
0:20:13 | gender threshold we can still get better with the baseline performance |
---|
0:20:17 | but |
---|
0:20:17 | in this case it's a very close |
---|
0:20:20 | or at |
---|
0:20:21 | and |
---|
0:20:21 | the difference between the optimal performance and is zero |
---|
0:20:26 | threshold beforehand is already not that |
---|
0:20:30 | it is simple s |
---|
0:20:31 | for other models |
---|
0:20:33 | so finally though the convolution recently proposed |
---|
0:20:36 | other |
---|
0:20:38 | and our scheme to jointly train b and then invading extractor and with multi class |
---|
0:20:44 | cross entropy |
---|
0:20:46 | and this discriminative training |
---|
0:20:48 | helps |
---|
0:20:49 | two |
---|
0:20:51 | eliminate war |
---|
0:20:52 | calibration problem or for their |
---|
0:20:55 | a regional of the baseline method |
---|
0:20:58 | then we add and certainty extractor to the training and the |
---|
0:21:05 | training together with yellow the it is a further improves calibration and the main it |
---|
0:21:11 | away message here will be that even though the model that we propose that not |
---|
0:21:16 | necessarily give the best |
---|
0:21:18 | performance it's results in a better calibrated system |
---|
0:21:23 | and which is more robust |
---|
0:21:26 | so that was it rummy game q and |
---|
0:21:29 | but by |
---|