0:00:15 | i everyone my name is again and i'm working with orange labs and the value |
---|
0:00:22 | in france |
---|
0:00:24 | and then i'm going to talk about the concept of self training speaker diarization |
---|
0:00:31 | so the application we don't working on is |
---|
0:00:35 | the task of across recordings because data traditional applied on t v archives french t |
---|
0:00:41 | v archives |
---|
0:00:42 | and the goal is to index to spew costs of collections of multiple recordings |
---|
0:00:48 | in order for example two provides new mean of dataset exploration and by creating links |
---|
0:00:54 | between different it is so it's |
---|
0:00:57 | so a system is based on a two-pass approach we first |
---|
0:01:04 | process each recording separately applying some kind of speaker segmentation and clustering |
---|
0:01:10 | and then we perform a cross recording a speaker linking and try to link all |
---|
0:01:17 | within recording clusters |
---|
0:01:19 | across the whole collection |
---|
0:01:22 | so they're framework is based on the state-of-the-art speaker recognition |
---|
0:01:28 | framework |
---|
0:01:30 | we are using i-vector of the lda model edition and for clustering we use the |
---|
0:01:35 | article agglomerative clustering |
---|
0:01:39 | so we know that the lda the goal of the lda is to maximize the |
---|
0:01:44 | between speaker variability one minute |
---|
0:01:46 | minimizing the within speaker variability |
---|
0:01:50 | so what we want to |
---|
0:01:53 | investigate in our paper is can we use the target that a as training material |
---|
0:01:58 | and how good |
---|
0:02:01 | could we estimate the speaker variability |
---|
0:02:07 | so first i'm going to represent |
---|
0:02:11 | battery different from work so let's take a an audio file phone problem |
---|
0:02:14 | from a target data |
---|
0:02:17 | our target that is unable so we just have a audio files |
---|
0:02:21 | first we are extracting some features we are using a mfcc features with delta and |
---|
0:02:27 | delta-delta |
---|
0:02:29 | then we perform a combination of speech activity detection and bic clustering to extract some |
---|
0:02:36 | speakers segments |
---|
0:02:38 | on top of those segments we can extract i-vectors using pre-trained ubm and total variability |
---|
0:02:45 | matrix |
---|
0:02:49 | once we obtain a well i-vectors a reliable to score all i-vectors between each other |
---|
0:02:55 | and computer similarity scoring matrix |
---|
0:02:59 | and for that we use p lda likelihood the |
---|
0:03:03 | each are trained the p lda parameters are estimated separate |
---|
0:03:09 | once we have or similarity matrix we can apply a speaker clustering |
---|
0:03:15 | and do you results of the that are just and is a speaker clusters |
---|
0:03:21 | so we can repeat the process for is of all recordings |
---|
0:03:27 | once we've done that we can compute |
---|
0:03:30 | a collection why the similarity matrix and repeat the clustering process and this time i |
---|
0:03:36 | call it the speaker i'm thinking big because the goal is to |
---|
0:03:40 | link the within recording clusters across the whole collection |
---|
0:03:45 | and after the linking a park |
---|
0:03:48 | after the linking part we obtain a the degradation |
---|
0:03:54 | so the usual way of training the ubm t v matrix and estimate the plp |
---|
0:04:00 | of parameters is to use |
---|
0:04:03 | trained that that's that which is labeled based you can and the training procedure is |
---|
0:04:08 | pretty straightforward |
---|
0:04:11 | the problem when we |
---|
0:04:14 | apply this technique we have some kind of mismatch between a target and trained that |
---|
0:04:20 | the first we don't have the same acoustic conditions |
---|
0:04:25 | and seconds we don't necessarily have the same speakers |
---|
0:04:29 | in target and trained that also |
---|
0:04:32 | we could use a information about the target that a maybe we could have better |
---|
0:04:36 | results |
---|
0:04:38 | so what we want to investigates is the concept of self training there is there |
---|
0:04:43 | some meaning we like to only use the target that itself to estimate the parameters |
---|
0:04:51 | and then we are going to complete to the results with a combination of target |
---|
0:04:57 | and trained that |
---|
0:05:00 | so |
---|
0:05:01 | the goal of sell train data revisionist to avoid the acoustic mismatch between the training |
---|
0:05:07 | and target data |
---|
0:05:09 | so |
---|
0:05:10 | what we need to train an i-vector p lda system to train the ubm and |
---|
0:05:15 | the tv matrix we only need a clean speech segments the training is then straightforward |
---|
0:05:22 | and as for the lda parameters estimation we need several sessions by post you got |
---|
0:05:27 | in various acoustic conditions so |
---|
0:05:29 | what we need to investigates is do we have several speakers |
---|
0:05:34 | appearing in different it is that's you know what target data |
---|
0:05:37 | and assuming we know how to effectively cluster of the target data in terms of |
---|
0:05:41 | speaker can we estimate p lda parameters with those |
---|
0:05:48 | so let's have a look on the data |
---|
0:05:51 | we have around two hundred there was a of french broadcast news that drawn from |
---|
0:05:56 | a previous french evaluation campaigns |
---|
0:05:59 | so it's a combination of a tv and radio data |
---|
0:06:04 | i'm of this two hundred hours we selected two shows a target |
---|
0:06:08 | cooperate we selected there's with l c be awful and the f m story |
---|
0:06:15 | and we to all other available recordings and decided to build what we call the |
---|
0:06:22 | train corpus |
---|
0:06:24 | so if we take a look of at the data we see that we have |
---|
0:06:30 | more than forty episodes |
---|
0:06:33 | more than forty results for each other show and we what we cannot this is |
---|
0:06:37 | a speech proportion of the what i call the recording speakers which is a above |
---|
0:06:43 | fifty percent for both corpora |
---|
0:06:45 | corpora |
---|
0:06:47 | so the recurring speakers is speaker who appear in more than one if results |
---|
0:06:51 | as opposed to the one time speaker who only appear in one it is |
---|
0:06:56 | so |
---|
0:06:58 | to the em so of the previous first question |
---|
0:07:01 | yes we have several speaker appearing in different if you that you know target |
---|
0:07:07 | so no |
---|
0:07:09 | we decided to |
---|
0:07:11 | train the original system |
---|
0:07:13 | meaning we suppose we know how to |
---|
0:07:18 | cluster on the data target that so we |
---|
0:07:21 | we use we had the target that are labels in real life we do not |
---|
0:07:26 | so we don't have those labels but for |
---|
0:07:29 | experiments |
---|
0:07:30 | we decided to use them |
---|
0:07:32 | so |
---|
0:07:33 | to train the ubm and the tv matrix and estimate the p l d a |
---|
0:07:37 | parameter parameters we process the same with them |
---|
0:07:39 | with their trained that are we just replace the train data with labels my target |
---|
0:07:44 | that are with labels |
---|
0:07:46 | so what we see detailed that is that for the l c p so we |
---|
0:07:49 | are able to obtain a result |
---|
0:07:52 | so the results are present in terms of a diarisation error rates |
---|
0:07:58 | cross recording there is there is there a residual error rate |
---|
0:08:02 | so for the l c p show we had some results as for the b |
---|
0:08:07 | f m shall we will not able to estimate the lda parameters |
---|
0:08:10 | and we suppose we don't have enough data to do so that we we're gonna |
---|
0:08:14 | investigate that |
---|
0:08:18 | if we compared with the baseline results we see that if we use the information |
---|
0:08:23 | about speakers in the target that we can right we should be able to improve |
---|
0:08:29 | the baseline system |
---|
0:08:33 | so what we one |
---|
0:08:35 | to investigate is |
---|
0:08:38 | it's the minimum |
---|
0:08:40 | amount of data we need to estimate p idea parameters because |
---|
0:08:43 | we so that for the v f m shall we will not able to train |
---|
0:08:46 | p lda while for the l c d so we were able to so |
---|
0:08:51 | we just decided to find out the minimum number of it is that's we could |
---|
0:08:57 | take into the l c p so to estimate suitable p lda parameters so that |
---|
0:09:01 | the group of that with you see here is the d right the a on |
---|
0:09:08 | the l c d so |
---|
0:09:10 | as a function of the numbers of it is it's take and to estimate the |
---|
0:09:15 | p l d a parameter so |
---|
0:09:16 | the total numbers of ap that is forty five and we started the experiments with |
---|
0:09:21 | thirty visits because we see that a before the results that |
---|
0:09:27 | so what's interesting |
---|
0:09:29 | interesting to see is that we need to run thirty seven results to be able |
---|
0:09:33 | to improve the baseline results |
---|
0:09:37 | and when we have |
---|
0:09:40 | thirty seven it is that's we have forty recording speakers |
---|
0:09:44 | what's also interesting to see is that |
---|
0:09:47 | we have the same numbers of speakers and here |
---|
0:09:52 | i don't the |
---|
0:09:53 | the different number of it is that's but the resulting the art is a really |
---|
0:09:59 | well seals and he also what's interesting is that we are able to |
---|
0:10:05 | so we have the same speaker out that |
---|
0:10:08 | what |
---|
0:10:10 | what's happening here is dressed that there are more and more that are gathered for |
---|
0:10:14 | each speaker |
---|
0:10:15 | and we need a minimum amount of that are for each speaker if we take |
---|
0:10:20 | a look at the average number of session task because it's a run seven |
---|
0:10:27 | when you have thirty seven types of |
---|
0:10:31 | as for the df m show |
---|
0:10:34 | when we take it is that we only have thirty five recording speakers |
---|
0:10:38 | and are bring in five it is that in average so it's far less than |
---|
0:10:43 | for the l c d corpus and that's why we are not able to train |
---|
0:10:47 | a dog parameters |
---|
0:10:50 | so now let's place in the real case and we are now not choose not |
---|
0:10:56 | allowed to use of that target data labels |
---|
0:11:00 | so i'm the first to train the ubm and tv matrix what we need a |
---|
0:11:04 | clean speech signal so we just decided to take the output of the speaker segmentation |
---|
0:11:10 | and compute the ubm in tv matrix |
---|
0:11:14 | but we don't have any information about the speaker so we are not able to |
---|
0:11:18 | estimate period of the lda parameters |
---|
0:11:21 | so we just replace the p lda likelihood scoring by focusing based growing |
---|
0:11:28 | and then we have a working system when we look at the results of our |
---|
0:11:33 | stand with then we using t lda |
---|
0:11:39 | that not to suppress the we expect that |
---|
0:11:43 | no what we obtain a speaker clusters so |
---|
0:11:47 | what we this idea is to use the speaker clusters and try to estimate the |
---|
0:11:53 | lda experiments with those clusters |
---|
0:11:55 | when we do when we do so well the training procedure doesn't six it |
---|
0:12:04 | well we so in the oracle experiment that the number of data was limited and |
---|
0:12:11 | we also suspect that the a probability of the clusters are used to back to |
---|
0:12:16 | allow us to estimate the lda permitted |
---|
0:12:21 | so to summarize with the self training experiment |
---|
0:12:25 | for the ubm and t v training we selected segments produced by speaker segmentation we |
---|
0:12:31 | only get the segments with the duration above ten seconds |
---|
0:12:37 | and we also it shows the bic parameters so that the segments are considered tool |
---|
0:12:43 | because to train a to estimate to train the tv matrix we need a clean |
---|
0:12:47 | and we only need we need only one speaker in each segments for training |
---|
0:12:53 | as for the lda we need several session |
---|
0:12:57 | the speaker from values results so first we perform an i-vector clustering based you got |
---|
0:13:03 | a position and use the and put into a speaker clusters to perform i-vector normalization |
---|
0:13:08 | can estimate ple are limited so we just select |
---|
0:13:12 | the output speaker clusters with |
---|
0:13:16 | i-vectors coming from one |
---|
0:13:18 | more than three episodes |
---|
0:13:22 | no so we so that we are not able to train a |
---|
0:13:28 | sufficient system with only detected target that are so we decide to at some train |
---|
0:13:34 | data in the mix |
---|
0:13:36 | so it's the so the classics the idea of a domain adaptation |
---|
0:13:41 | so the main difference in this e system comparing with the baseline is that we |
---|
0:13:48 | replace the ubm and tv metrics by |
---|
0:13:51 | in this experiment ubm and tv metrics are trained |
---|
0:13:55 | on to a target that are instead of training data and then we extract i-vectors |
---|
0:13:59 | from the training data and estimate the lda parameters on the training but |
---|
0:14:05 | so |
---|
0:14:06 | when replacing the ubm and tv matrix we are able to improve around one percent |
---|
0:14:12 | in absolute |
---|
0:14:14 | in terms of d r |
---|
0:14:18 | no |
---|
0:14:20 | well why not try to applied the same process then we it with the center |
---|
0:14:24 | in experiments and take the speaker clusters to estimate a new p lda parameters |
---|
0:14:30 | so as before we the training the estimation of the lda parameter phase we i |
---|
0:14:37 | think we really don't have enough that do so |
---|
0:14:40 | and so we just decided to |
---|
0:14:43 | combined their use of training data and |
---|
0:14:47 | target the task to update the key idea parameter the classic domain adaptation scenario but |
---|
0:14:54 | we don't use any whiting parameters to balance the influence and of trained and target |
---|
0:15:00 | that are we just |
---|
0:15:01 | to the i-vectors from the training data and the i-vectors from this |
---|
0:15:07 | output speaker clusters and |
---|
0:15:08 | combining them and |
---|
0:15:10 | train new p lda parameters |
---|
0:15:13 | so when we combine the that the data we again a improve the baseline the |
---|
0:15:18 | system and again one percent in terms around one percent |
---|
0:15:23 | in terms of the whole |
---|
0:15:28 | and |
---|
0:15:29 | well now that we've done then we why not try to iterate as |
---|
0:15:35 | as long as we obtain speaker clusters we can always to use them and try |
---|
0:15:38 | to improve the estimation of purely a parameters |
---|
0:15:43 | well it doesn't so it doesn't work |
---|
0:15:46 | if you iterate it doesn't improve the system we tried two |
---|
0:15:51 | four iterations but i |
---|
0:15:53 | that it's not okay |
---|
0:15:58 | so |
---|
0:16:00 | let's have a look on the system parameters we use the site it for that |
---|
0:16:05 | or position toolkit it's a package above the psychic toolkit |
---|
0:16:10 | but library |
---|
0:16:12 | for the front end and we use thirteen mfccs with delta and delta-delta |
---|
0:16:18 | we use a two hundred and fifty six components to train the ubm |
---|
0:16:24 | the covariance make matrix is there gonna |
---|
0:16:27 | the dimension of the tv matrix is two hundred the dimension to be the eigenvoice |
---|
0:16:33 | matrix is one hundred |
---|
0:16:35 | we don't use any i can channel matrix |
---|
0:16:38 | for the speaker clustering task we use |
---|
0:16:42 | the combination of connected components clustering and the article argumentative clustering |
---|
0:16:48 | and i as i said before the metric is the data results for an error |
---|
0:16:51 | rates and we use the two hundred and fifty milliseconds |
---|
0:17:01 | so |
---|
0:17:02 | if we summarize we compare the other three for different system first three but we |
---|
0:17:08 | performed a surprise training using only external data |
---|
0:17:12 | and then we |
---|
0:17:14 | use the same training process but we replace the training data with their delicate that |
---|
0:17:19 | this is the oracle experiments |
---|
0:17:22 | and then we focused on |
---|
0:17:24 | and surprise training using only the target data and we so that that's it's |
---|
0:17:29 | that's good enough when comparing with the baseline system |
---|
0:17:34 | so we decided to take back |
---|
0:17:36 | some training data i'm applied in some kind of unsupervised domain adaptation and combined train |
---|
0:17:43 | target |
---|
0:17:46 | so |
---|
0:17:47 | to conclude can say that |
---|
0:17:49 | with so that if we don't have enough data we absolutely need to use external |
---|
0:17:54 | that bootstrap the system |
---|
0:17:57 | but the putting it even using unlabeled target that a which is and perfectly clusters |
---|
0:18:04 | with some kind of them domain adaptation we are able to improve the system |
---|
0:18:09 | so in our future work we want to in to focus on the adaptation framework |
---|
0:18:14 | and used |
---|
0:18:17 | already |
---|
0:18:19 | where we we'd like to use |
---|
0:18:23 | introduce whitening variability between train and target data |
---|
0:18:27 | and we also like to try to work on the iterative procedure because we think |
---|
0:18:32 | that if we are able to a better estimate p lda parameters after one at |
---|
0:18:38 | a rate iteration we should be able to improve the quality of clusters and some |
---|
0:18:43 | kind of iteration should be possible |
---|
0:18:46 | in fact this work was don't already we presented a we submitted a paper at |
---|
0:18:52 | interspeech it will be presented |
---|
0:18:55 | so i can already said that using one thing variability |
---|
0:19:01 | the results are really get better |
---|
0:19:05 | and the iterative procedure also walks we with two or three iterations we are able |
---|
0:19:11 | to slowly improve the that the all |
---|
0:19:14 | and another way of improve |
---|
0:19:18 | improve your remains to be seen but |
---|
0:19:22 | with what's like to try to put strapless that would any label that for example |
---|
0:19:26 | we could try to take the train that a don't use the labels and upper |
---|
0:19:31 | from causing basis clustering because we so that on our approach maybe we didn't have |
---|
0:19:36 | enough data and the target that i to apply this idea so maybe |
---|
0:19:41 | try to bootstrap with more unlabeled data could be working |
---|
0:19:47 | well thank you that that's wonderful |
---|
0:19:55 | documents so i'm for instance |
---|
0:20:06 | thank you for that are i think this is more common that a question but |
---|
0:20:09 | i believe that some of your problems with the em for the p o da |
---|
0:20:13 | our years speaker subspace dimension is higher numbers |
---|
0:20:20 | i think that that's the problem we the that i mentioned that for a t |
---|
0:20:24 | v and p l of the idea is to find a when we don't have |
---|
0:20:29 | enough target data i cannot the problem is |
---|
0:20:33 | i is difficult to estimate the one hundred i mentioned |
---|
0:20:39 | p l d l parameters if you don't have that much speakers |
---|
0:20:42 | did you try to reduce the i don't i do the focus on that well |
---|
0:20:56 | thanks to the presentation thirteen and well like to use it for d c two |
---|
0:21:01 | sounds pretty |
---|
0:21:03 | and you was presenting it on |
---|
0:21:08 | i think that last used e |
---|
0:21:10 | i use the deeper then how the school that |
---|
0:21:15 | well |
---|
0:21:16 | in my experiment |
---|
0:21:18 | the results are not very different between ilp and agglomerative clustering well i just decided |
---|
0:21:26 | to use agglomerative clustering because it's |
---|
0:21:30 | small simple simpler |
---|
0:21:34 | yes computed computation time |
---|
0:21:37 | but not really a big difference between |
---|
0:21:43 | i think |
---|
0:21:50 | so |
---|
0:21:51 | dealing with these different internal extra so one thing i |
---|
0:21:56 | see here and work was |
---|
0:21:59 | what to use a way that i |
---|
0:22:03 | why each latest specifically a little white here |
---|
0:22:07 | no we didn't fight the data are we just we just to the target clusters |
---|
0:22:13 | and the training clusters and |
---|
0:22:16 | put them together in the same dataset |
---|
0:22:20 | so if you look at the equations its own |
---|
0:22:25 | it's the same taste as if use that the whiting parameters |
---|
0:22:33 | of a value which is the relative amount of data between target and try to |
---|
0:22:38 | train better so it is almost equal to zero |
---|
0:22:43 | that's why we need to work and the availability because we are not |
---|
0:22:50 | would every for that i |
---|
0:23:03 | not that this difference anyway you're clustering experiments you decide how many clusters |
---|
0:23:13 | well the |
---|
0:23:15 | the clustering is a function of the that's which |
---|
0:23:19 | and we don't we just saw a select the screenshot by next experiment we that's |
---|
0:23:25 | why we which was to target corporate because this way we are able to do |
---|
0:23:32 | an exhaustive search on the other three shown on the one and one corpus and |
---|
0:23:37 | then |
---|
0:23:38 | we you look if the same crucial applies for the other cultures |
---|
0:23:44 | and the clustering tree structure is around zero so |
---|
0:24:01 | we still have time for a few questions |
---|
0:24:07 | okay so i was curious human centred in this work to you don't want be |
---|
0:24:12 | considered for the reader assumed to be helpful but then you are able to somehow |
---|
0:24:16 | fixed upon the |
---|
0:24:17 | a next once we know what is that |
---|
0:24:20 | i mean what was to what do you think is the most the problem would |
---|
0:24:23 | do so |
---|
0:24:25 | in this in this work the program is we want to introduce a wide thing |
---|
0:24:30 | we don't balance the influence training of target that also |
---|
0:24:34 | and the combination of training and target that we have so much training data |
---|
0:24:39 | that the |
---|
0:24:41 | the whitening parameters is really in favour of the train on the training data |
---|
0:24:48 | when we change the are balance between training target that and give more importance to |
---|
0:24:54 | the target that the films to get better results and then you see that why |
---|
0:25:00 | the routine you can improve some |
---|
0:25:02 | no more of the two or three iterations |
---|
0:25:05 | and that we also i did some kind of yes cost normalization because when you |
---|
0:25:12 | when you when you use a target that too |
---|
0:25:17 | to obtain the p l d a parameter as the distribution of lda also tends |
---|
0:25:22 | to achieve a lot |
---|
0:25:24 | for you need to one |
---|
0:25:26 | normalized to keep the same clustering speech |
---|
0:25:29 | otherwise you don't cluster |
---|
0:25:31 | the same place a total |
---|
0:25:33 | after reported average |
---|
0:25:40 | okay so if no further questions let's thank the speaker |
---|