0:00:15 | and welcome to my paper improving diarisation robustness using verification randomisation |
---|
0:00:22 | and the dover algorithm |
---|
0:00:25 | if you brief overview will start with a review of the door algorithm |
---|
0:00:30 | something that we directly but with recently to combine the outputs of multiple conversations systems |
---|
0:00:37 | actually use of that is for information fusion |
---|
0:00:40 | over this paper we're gonna focus on a another application used to achieve more robustness |
---|
0:00:45 | whatever position |
---|
0:00:48 | we describe our experiments and results and then conclude with some really an outlook |
---|
0:00:55 | i'm sure everybody's familiar with the speaker diarization task it's the answer the question who |
---|
0:01:00 | spoke when |
---|
0:01:02 | so given an input you label it according to speaker identity without having any prior |
---|
0:01:08 | knowledge |
---|
0:01:09 | speaker so the |
---|
0:01:10 | labels are anonymous label such as speaker one speaker to or |
---|
0:01:16 | positions in order to track the interaction among multiple speakers in the conversation or meeting |
---|
0:01:23 | is also critical to be able to speaker should you eer of the speech recognition |
---|
0:01:28 | system that our readable transcript |
---|
0:01:32 | and you can use it for things like speaker if one where you need to |
---|
0:01:35 | identify all the speech i mean how the same |
---|
0:01:39 | speaker source |
---|
0:01:42 | the diarization error metric as a measure also similar to most |
---|
0:01:46 | it's the racial the total duration of missed speech false alarm speech and speaker |
---|
0:01:53 | speech does this labeled according to who were spoken by |
---|
0:01:57 | and normalized for the should duration the speech |
---|
0:02:01 | l the critical |
---|
0:02:04 | thing in and are stationary computation |
---|
0:02:07 | is which will important for you know later on is actually the mapping in speaker |
---|
0:02:14 | labels that occur in the reference versus the hypothesis |
---|
0:02:19 | e labels and reference have nothing to do with a label and the clusters so |
---|
0:02:23 | we need to construct a mapping |
---|
0:02:26 | actually minimizes the error rate |
---|
0:02:29 | so in this case we will map speaker one speaker a |
---|
0:02:33 | and speaker into two speaker e |
---|
0:02:35 | and leaves speaker three amount because of the in fact is an extra speaker relative |
---|
0:02:41 | to the reference |
---|
0:02:42 | once we've done the mapping |
---|
0:02:45 | we can compute false alarm the speech |
---|
0:02:48 | speaker |
---|
0:02:52 | now system combination or ensemble methods of coding methods are very popular in machine learning |
---|
0:03:00 | sessions |
---|
0:03:02 | it "'cause" it is very powerful to zero it combine multiple classifiers |
---|
0:03:07 | to achieve a better results |
---|
0:03:09 | and coding |
---|
0:03:10 | it's just letting the majority determine optimal or soft-voting such as combining different scores in |
---|
0:03:17 | some there is not gonna make weight |
---|
0:03:20 | or to combine your already outputs by interpolation for example in order to achieve |
---|
0:03:26 | any more accurate estimate |
---|
0:03:28 | posterior probability and therefore us |
---|
0:03:31 | labels |
---|
0:03:32 | now this can be done and weighted marilyn the weighted matter so if you have |
---|
0:03:37 | the |
---|
0:03:37 | a reason to attribute more than so which to me that's |
---|
0:03:42 | you and that and the voting algorithm |
---|
0:03:45 | a popular version of this for speech recognition is the over algorithm |
---|
0:03:51 | also confusion network combination any also of the purpose of i mean the word labels |
---|
0:03:59 | from multiple asr systems like well |
---|
0:04:02 | and performing and loading and the different machines |
---|
0:04:06 | and usually this gives you know whatever when the input systems are about equally good |
---|
0:04:11 | by have different error i don't store |
---|
0:04:14 | as in the and errors |
---|
0:04:17 | now how can we use this idea of for diarization |
---|
0:04:20 | so there is a problem because these labels coming from |
---|
0:04:24 | position hypotheses |
---|
0:04:26 | are not inherently related |
---|
0:04:28 | so there are anonymous as we said what |
---|
0:04:31 | so it is not clear how to order among them |
---|
0:04:35 | we can solve this problem i |
---|
0:04:38 | extracting in that in between the different labels |
---|
0:04:41 | and then performed by doing so |
---|
0:04:43 | we can go there's map of the labels in fact as a kind of alignment |
---|
0:04:47 | lingual space or level alignment |
---|
0:04:50 | so we do it incrementally it's like for a rover for example so we start |
---|
0:04:55 | with the first analysis that for star |
---|
0:04:59 | and it as our initial alignment |
---|
0:05:01 | and we iterate over all the remaining outputs we construct a mapping the it was |
---|
0:05:07 | processed out that's |
---|
0:05:08 | so that the e diarization error between the labels is minimized |
---|
0:05:14 | we all know |
---|
0:05:16 | we can |
---|
0:05:18 | simply for the voting |
---|
0:05:20 | i'm really label for all time instants |
---|
0:05:26 | and this is what was described in our |
---|
0:05:28 | last year and inside you |
---|
0:05:32 | okay here's an example |
---|
0:05:33 | so we have three systems at c |
---|
0:05:36 | the labels are disjoint |
---|
0:05:39 | and we |
---|
0:05:42 | first start by starting with system a and then computing best map |
---|
0:05:47 | of the second system to these labels in the second the first system |
---|
0:05:51 | so in this case we will |
---|
0:05:54 | one way one |
---|
0:05:56 | to ensure a two three would in extra speaker labels so it remains |
---|
0:06:02 | we re label everything so now we have system a and system i in the |
---|
0:06:07 | same label space |
---|
0:06:08 | read the same thing again with system c |
---|
0:06:11 | so we can see here that c one should |
---|
0:06:13 | i at one |
---|
0:06:15 | t three should be mapped into |
---|
0:06:18 | c two |
---|
0:06:20 | remains map and that's the next a label |
---|
0:06:23 | doesn't have a correspondence |
---|
0:06:29 | so here we have no all three how that's the same label space |
---|
0:06:35 | and we can fall the voting |
---|
0:06:38 | for each time instance so they only when is a one to this point |
---|
0:06:44 | then we enter a region where is actually if we went i between a one |
---|
0:06:49 | human speech |
---|
0:06:52 | so no matter only we can break the time anyway that's can and example in |
---|
0:06:58 | the first one or if there are weights attached to the n b and |
---|
0:07:02 | the one with the highest weight |
---|
0:07:05 | we have a to again as the consensus and we're trying to a one |
---|
0:07:11 | we never hears because it is always in the minority |
---|
0:07:16 | and we can use the same idea to decide on speech versus non speech |
---|
0:07:21 | so |
---|
0:07:23 | we will help us speech only on those regions |
---|
0:07:26 | where at least half of the in its i think there's speech |
---|
0:07:32 | no again the natural |
---|
0:07:34 | use of this is for information fusion |
---|
0:07:38 | it is we run diarisation in the in italy stand for information for example we |
---|
0:07:43 | have multiple microphones we can i rise in italy |
---|
0:07:46 | and fused it's using dover |
---|
0:07:49 | or we could have a single input that different feature streams |
---|
0:07:52 | we can arise in the end is |
---|
0:07:56 | we used just for multiple microphones in i paper |
---|
0:08:01 | we have meeting recordings on seven microphones |
---|
0:08:05 | and you can see here that difference is doing a clustering based diarization |
---|
0:08:10 | this be wide range of results depending on which channel you choose |
---|
0:08:16 | and over actually |
---|
0:08:18 | if you're result that a slightly better than e single channel |
---|
0:08:23 | so you're free from having to figure out which is the |
---|
0:08:26 | thus the channel |
---|
0:08:29 | if you do the diarization using speaker id because you're speakers are actually all of |
---|
0:08:34 | the system |
---|
0:08:35 | you get the same effect of course but much lower at position error rate over |
---|
0:08:39 | also you average |
---|
0:08:42 | you have the single channel and you have a where single channel |
---|
0:08:46 | and it over a combination of all these out there is you have resulted actually |
---|
0:08:50 | is better |
---|
0:08:51 | the minimum |
---|
0:08:53 | all the individual channels |
---|
0:08:57 | no for this paper we gonna looking to different application of over |
---|
0:09:02 | starts with the observation that diarization algorithm is often quite sensitive to the choice of |
---|
0:09:07 | hyper parameters |
---|
0:09:09 | i give some examples later but it is basically because when you clustering |
---|
0:09:14 | you make our decisions based on comparing real values |
---|
0:09:18 | and small differences in the in this can actually yield large differences you know |
---|
0:09:24 | also the clustering is often greedy |
---|
0:09:26 | and iterative so small |
---|
0:09:29 | regions somewhere a linear model and a very large differences later on |
---|
0:09:35 | so |
---|
0:09:36 | this can be remedied by averaging over the different run essentially so |
---|
0:09:42 | okay and you run with different hyperparameters an average the results |
---|
0:09:47 | and using the over or you can used over from i'm the out of multiple |
---|
0:09:51 | different |
---|
0:09:53 | clustering solutions |
---|
0:09:58 | to experiment with this we used an old speaker clustering algorithm of for diarization develop |
---|
0:10:04 | idiomatic c |
---|
0:10:05 | you start with an equal length segmentation of during the day |
---|
0:10:10 | segments |
---|
0:10:11 | then each segment is modeled by a mixture of gaussians |
---|
0:10:16 | and e ds similarity between different segments can be evaluated i asking whether merging two |
---|
0:10:25 | gmms yields a higher over likelihood or not |
---|
0:10:29 | e |
---|
0:10:31 | duration happens by merging two best clusters that resegmenting |
---|
0:10:38 | and re-estimating so gmms |
---|
0:10:42 | l which do this until i is information criterion tells you just a clustering |
---|
0:10:52 | it like this algorithm to a collection of |
---|
0:10:56 | recordings of meetings |
---|
0:10:58 | from which we are extracted two feature streams and mfccs training after beamforming so we |
---|
0:11:04 | had multiple |
---|
0:11:06 | constraints but we marched on informing of the signal level |
---|
0:11:09 | then extracted mfccs |
---|
0:11:11 | and the beamformer would also give us the time delays of arrival which are an |
---|
0:11:15 | important feature |
---|
0:11:16 | because it indicates where the speakers are situated |
---|
0:11:21 | now |
---|
0:11:22 | there's two ways to generate more hypotheses from a single |
---|
0:11:27 | this case |
---|
0:11:28 | one is a what i call device verification meeting there either i and under |
---|
0:11:35 | what was some range |
---|
0:11:36 | and a single low also |
---|
0:11:39 | example i can every the relative weight of the feature streams |
---|
0:11:44 | or i can every the initial number |
---|
0:11:46 | other clusters in the clustering order |
---|
0:11:50 | the first one which we discuss the three what else given here for the interest |
---|
0:11:54 | of time |
---|
0:11:55 | and the other way as to randomise so i can manipulate the clustering algorithm |
---|
0:12:00 | we will not always pick the first best |
---|
0:12:03 | of clusters remark about two sometimes take the second just pure clusters |
---|
0:12:09 | and a five point in order to make these decisions over it can generate multiple |
---|
0:12:13 | clusterings |
---|
0:12:15 | and of course i used over to final design with equal weight |
---|
0:12:21 | although the of its use the same speech nonspeech classifier so we'll only differ are |
---|
0:12:26 | speaker labels not in the speech nonspeech sessions |
---|
0:12:30 | and only difference on the diarisation error is in fact on the speaker error rate |
---|
0:12:38 | it is set was from the nist meeting rich transcription evaluations from the nist two |
---|
0:12:44 | thousand seven thousand nine |
---|
0:12:46 | and we used all of the microphone channels but we combine with beamforming |
---|
0:12:53 | and you variety is actually quite considerable in this data so |
---|
0:12:58 | errors different recording sites |
---|
0:12:59 | there is different speakers from small three four |
---|
0:13:04 | sixteen twenty one respectively so it was quite heterogeneous and that's why it's a challenge |
---|
0:13:11 | to actually |
---|
0:13:13 | and analyze the hyper parameters for them into a |
---|
0:13:17 | in forty from one |
---|
0:13:19 | f sets to the test set |
---|
0:13:23 | use what happens when you vary your a stream weight one of the hyper parameters |
---|
0:13:28 | so you can see that |
---|
0:13:31 | varying along agree not use a small variation in the output rather channels up and |
---|
0:13:38 | this is the speaker error rate |
---|
0:13:40 | and more importantly |
---|
0:13:42 | the best value on it's a it's not just value on the eval set |
---|
0:13:48 | conversely the value of all citizens are was choice for the test set |
---|
0:13:54 | so this is what i mean i robustness of problems in that |
---|
0:13:59 | every when we do over a combination over all the different |
---|
0:14:03 | results |
---|
0:14:04 | we actually at a nice good result |
---|
0:14:08 | it is either better than a single results for the test set or very close |
---|
0:14:12 | to the single best result on you got stuck |
---|
0:14:18 | similarly when we vary the initial number of clusters of the algorithm |
---|
0:14:23 | we also got the l |
---|
0:14:27 | with the speaker |
---|
0:14:29 | according to a you know it is the variational the cluster number |
---|
0:14:34 | and |
---|
0:14:36 | the best choice for the test set is not the best choice for the eval |
---|
0:14:41 | set |
---|
0:14:42 | again when you do that or conversational you a good result in fact there is |
---|
0:14:47 | always better than the second best choice |
---|
0:14:49 | on the data for you also |
---|
0:14:53 | finally when we do the randomisation of the clustering specifically we flip a coin with |
---|
0:14:59 | only three we use a second best cluster each information merging |
---|
0:15:06 | and the result is surprisingly sometimes lead to better and with the first a clustering |
---|
0:15:14 | so you see here that with different random seeds we are in a range of |
---|
0:15:18 | results |
---|
0:15:19 | sometimes worse but often other and with the best first clustering |
---|
0:15:25 | and the same is true for the whole set |
---|
0:15:27 | first we cannot |
---|
0:15:29 | expect the best thing all the data to also interesting only vol instead we need |
---|
0:15:34 | to do the recognition in order to get a result |
---|
0:15:37 | so we actually improve on the best first clustering consistently by doing or correlation over |
---|
0:15:43 | different |
---|
0:15:44 | randomized results |
---|
0:15:48 | summary |
---|
0:15:49 | we have just over algorithm allows us to voting among multiple times additional sees |
---|
0:15:56 | we can use this to achieve |
---|
0:15:59 | a robustness and annotation |
---|
0:16:01 | by |
---|
0:16:04 | combining multiple hypotheses obtained from a single input |
---|
0:16:08 | e two ways that we do this is by very high utterance |
---|
0:16:12 | or introduce diversity if you will and the results |
---|
0:16:17 | and we find that the hyperparameter populations higher in over essentially freezes from the need |
---|
0:16:24 | to do that optimization |
---|
0:16:26 | and that its robustness that way |
---|
0:16:28 | now the clustering can also be randomized be overcome the limitation of the first |
---|
0:16:35 | research and clustering |
---|
0:16:37 | and e combination of the randomized results actually says |
---|
0:16:41 | higher accuracy and you the single |
---|
0:16:45 | a string that that's |
---|
0:16:50 | finally there's many more things we can do this so we can try to come |
---|
0:16:55 | i'm |
---|
0:16:56 | the different techniques so for example i are is wearing |
---|
0:17:01 | a lot of multiple dimensions |
---|
0:17:02 | or combining that with randomisation all in one and a well-known combination about |
---|
0:17:09 | we can also tried as with different |
---|
0:17:12 | like conversation than the algorithm is the gnostic to the actual |
---|
0:17:16 | and |
---|
0:17:17 | form of the diarisation algorithm |
---|
0:17:19 | so we can try with x vector of a spectral clustering or normal and systems |
---|
0:17:26 | of course region or we wish to |
---|
0:17:28 | in this multiple the corporate in order to can work in the |
---|
0:17:33 | the algorithm |
---|
0:17:35 | to other things were currently working on is can i think different diarisation algorithms |
---|
0:17:41 | as well as to generalize the to handle overlapping speech |
---|
0:17:47 | thank you very much for your time |
---|
0:17:50 | you're into question so essential to the c website |
---|
0:17:54 | and i the rest of you culture |
---|