Speech Transcript - Improving Diarization Robustness using Diversification, Randomization and the DOVER Algorithm

0:00:15	and welcome to my paper improving diarisation robustness using verification randomisation
0:00:22	and the dover algorithm
0:00:25	if you brief overview will start with a review of the door algorithm
0:00:30	something that we directly but with recently to combine the outputs of multiple conversations systems
0:00:37	actually use of that is for information fusion
0:00:40	over this paper we're gonna focus on a another application used to achieve more robustness
0:00:45	whatever position
0:00:48	we describe our experiments and results and then conclude with some really an outlook
0:00:55	i'm sure everybody's familiar with the speaker diarization task it's the answer the question who
0:01:00	spoke when
0:01:02	so given an input you label it according to speaker identity without having any prior
0:01:08	knowledge
0:01:09	speaker so the
0:01:10	labels are anonymous label such as speaker one speaker to or
0:01:16	positions in order to track the interaction among multiple speakers in the conversation or meeting
0:01:23	is also critical to be able to speaker should you eer of the speech recognition
0:01:28	system that our readable transcript
0:01:32	and you can use it for things like speaker if one where you need to
0:01:35	identify all the speech i mean how the same
0:01:39	speaker source
0:01:42	the diarization error metric as a measure also similar to most
0:01:46	it's the racial the total duration of missed speech false alarm speech and speaker
0:01:53	speech does this labeled according to who were spoken by
0:01:57	and normalized for the should duration the speech
0:02:01	l the critical
0:02:04	thing in and are stationary computation
0:02:07	is which will important for you know later on is actually the mapping in speaker
0:02:14	labels that occur in the reference versus the hypothesis
0:02:19	e labels and reference have nothing to do with a label and the clusters so
0:02:23	we need to construct a mapping
0:02:26	actually minimizes the error rate
0:02:29	so in this case we will map speaker one speaker a
0:02:33	and speaker into two speaker e
0:02:35	and leaves speaker three amount because of the in fact is an extra speaker relative
0:02:41	to the reference
0:02:42	once we've done the mapping
0:02:45	we can compute false alarm the speech
0:02:48	speaker
0:02:52	now system combination or ensemble methods of coding methods are very popular in machine learning
0:03:00	sessions
0:03:02	it "'cause" it is very powerful to zero it combine multiple classifiers
0:03:07	to achieve a better results
0:03:09	and coding
0:03:10	it's just letting the majority determine optimal or soft-voting such as combining different scores in
0:03:17	some there is not gonna make weight
0:03:20	or to combine your already outputs by interpolation for example in order to achieve
0:03:26	any more accurate estimate
0:03:28	posterior probability and therefore us
0:03:31	labels
0:03:32	now this can be done and weighted marilyn the weighted matter so if you have
0:03:37	the
0:03:37	a reason to attribute more than so which to me that's
0:03:42	you and that and the voting algorithm
0:03:45	a popular version of this for speech recognition is the over algorithm
0:03:51	also confusion network combination any also of the purpose of i mean the word labels
0:03:59	from multiple asr systems like well
0:04:02	and performing and loading and the different machines
0:04:06	and usually this gives you know whatever when the input systems are about equally good
0:04:11	by have different error i don't store
0:04:14	as in the and errors
0:04:17	now how can we use this idea of for diarization
0:04:20	so there is a problem because these labels coming from
0:04:24	position hypotheses
0:04:26	are not inherently related
0:04:28	so there are anonymous as we said what
0:04:31	so it is not clear how to order among them
0:04:35	we can solve this problem i
0:04:38	extracting in that in between the different labels
0:04:41	and then performed by doing so
0:04:43	we can go there's map of the labels in fact as a kind of alignment
0:04:47	lingual space or level alignment
0:04:50	so we do it incrementally it's like for a rover for example so we start
0:04:55	with the first analysis that for star
0:04:59	and it as our initial alignment
0:05:01	and we iterate over all the remaining outputs we construct a mapping the it was
0:05:07	processed out that's
0:05:08	so that the e diarization error between the labels is minimized
0:05:14	we all know
0:05:16	we can
0:05:18	simply for the voting
0:05:20	i'm really label for all time instants
0:05:26	and this is what was described in our
0:05:28	last year and inside you
0:05:32	okay here's an example
0:05:33	so we have three systems at c
0:05:36	the labels are disjoint
0:05:39	and we
0:05:42	first start by starting with system a and then computing best map
0:05:47	of the second system to these labels in the second the first system
0:05:51	so in this case we will
0:05:54	one way one
0:05:56	to ensure a two three would in extra speaker labels so it remains
0:06:02	we re label everything so now we have system a and system i in the
0:06:07	same label space
0:06:08	read the same thing again with system c
0:06:11	so we can see here that c one should
0:06:13	i at one
0:06:15	t three should be mapped into
0:06:18	c two
0:06:20	remains map and that's the next a label
0:06:23	doesn't have a correspondence
0:06:29	so here we have no all three how that's the same label space
0:06:35	and we can fall the voting
0:06:38	for each time instance so they only when is a one to this point
0:06:44	then we enter a region where is actually if we went i between a one
0:06:49	human speech
0:06:52	so no matter only we can break the time anyway that's can and example in
0:06:58	the first one or if there are weights attached to the n b and
0:07:02	the one with the highest weight
0:07:05	we have a to again as the consensus and we're trying to a one
0:07:11	we never hears because it is always in the minority
0:07:16	and we can use the same idea to decide on speech versus non speech
0:07:21	so
0:07:23	we will help us speech only on those regions
0:07:26	where at least half of the in its i think there's speech
0:07:32	no again the natural
0:07:34	use of this is for information fusion
0:07:38	it is we run diarisation in the in italy stand for information for example we
0:07:43	have multiple microphones we can i rise in italy
0:07:46	and fused it's using dover
0:07:49	or we could have a single input that different feature streams
0:07:52	we can arise in the end is
0:07:56	we used just for multiple microphones in i paper
0:08:01	we have meeting recordings on seven microphones
0:08:05	and you can see here that difference is doing a clustering based diarization
0:08:10	this be wide range of results depending on which channel you choose
0:08:16	and over actually
0:08:18	if you're result that a slightly better than e single channel
0:08:23	so you're free from having to figure out which is the
0:08:26	thus the channel
0:08:29	if you do the diarization using speaker id because you're speakers are actually all of
0:08:34	the system
0:08:35	you get the same effect of course but much lower at position error rate over
0:08:39	also you average
0:08:42	you have the single channel and you have a where single channel
0:08:46	and it over a combination of all these out there is you have resulted actually
0:08:50	is better
0:08:51	the minimum
0:08:53	all the individual channels
0:08:57	no for this paper we gonna looking to different application of over
0:09:02	starts with the observation that diarization algorithm is often quite sensitive to the choice of
0:09:07	hyper parameters
0:09:09	i give some examples later but it is basically because when you clustering
0:09:14	you make our decisions based on comparing real values
0:09:18	and small differences in the in this can actually yield large differences you know
0:09:24	also the clustering is often greedy
0:09:26	and iterative so small
0:09:29	regions somewhere a linear model and a very large differences later on
0:09:35	so
0:09:36	this can be remedied by averaging over the different run essentially so
0:09:42	okay and you run with different hyperparameters an average the results
0:09:47	and using the over or you can used over from i'm the out of multiple
0:09:51	different
0:09:53	clustering solutions
0:09:58	to experiment with this we used an old speaker clustering algorithm of for diarization develop
0:10:04	idiomatic c
0:10:05	you start with an equal length segmentation of during the day
0:10:10	segments
0:10:11	then each segment is modeled by a mixture of gaussians
0:10:16	and e ds similarity between different segments can be evaluated i asking whether merging two
0:10:25	gmms yields a higher over likelihood or not
0:10:29	e
0:10:31	duration happens by merging two best clusters that resegmenting
0:10:38	and re-estimating so gmms
0:10:42	l which do this until i is information criterion tells you just a clustering
0:10:52	it like this algorithm to a collection of
0:10:56	recordings of meetings
0:10:58	from which we are extracted two feature streams and mfccs training after beamforming so we
0:11:04	had multiple
0:11:06	constraints but we marched on informing of the signal level
0:11:09	then extracted mfccs
0:11:11	and the beamformer would also give us the time delays of arrival which are an
0:11:15	important feature
0:11:16	because it indicates where the speakers are situated
0:11:21	now
0:11:22	there's two ways to generate more hypotheses from a single
0:11:27	this case
0:11:28	one is a what i call device verification meeting there either i and under
0:11:35	what was some range
0:11:36	and a single low also
0:11:39	example i can every the relative weight of the feature streams
0:11:44	or i can every the initial number
0:11:46	other clusters in the clustering order
0:11:50	the first one which we discuss the three what else given here for the interest
0:11:54	of time
0:11:55	and the other way as to randomise so i can manipulate the clustering algorithm
0:12:00	we will not always pick the first best
0:12:03	of clusters remark about two sometimes take the second just pure clusters
0:12:09	and a five point in order to make these decisions over it can generate multiple
0:12:13	clusterings
0:12:15	and of course i used over to final design with equal weight
0:12:21	although the of its use the same speech nonspeech classifier so we'll only differ are
0:12:26	speaker labels not in the speech nonspeech sessions
0:12:30	and only difference on the diarisation error is in fact on the speaker error rate
0:12:38	it is set was from the nist meeting rich transcription evaluations from the nist two
0:12:44	thousand seven thousand nine
0:12:46	and we used all of the microphone channels but we combine with beamforming
0:12:53	and you variety is actually quite considerable in this data so
0:12:58	errors different recording sites
0:12:59	there is different speakers from small three four
0:13:04	sixteen twenty one respectively so it was quite heterogeneous and that's why it's a challenge
0:13:11	to actually
0:13:13	and analyze the hyper parameters for them into a
0:13:17	in forty from one
0:13:19	f sets to the test set
0:13:23	use what happens when you vary your a stream weight one of the hyper parameters
0:13:28	so you can see that
0:13:31	varying along agree not use a small variation in the output rather channels up and
0:13:38	this is the speaker error rate
0:13:40	and more importantly
0:13:42	the best value on it's a it's not just value on the eval set
0:13:48	conversely the value of all citizens are was choice for the test set
0:13:54	so this is what i mean i robustness of problems in that
0:13:59	every when we do over a combination over all the different
0:14:03	results
0:14:04	we actually at a nice good result
0:14:08	it is either better than a single results for the test set or very close
0:14:12	to the single best result on you got stuck
0:14:18	similarly when we vary the initial number of clusters of the algorithm
0:14:23	we also got the l
0:14:27	with the speaker
0:14:29	according to a you know it is the variational the cluster number
0:14:34	and
0:14:36	the best choice for the test set is not the best choice for the eval
0:14:41	set
0:14:42	again when you do that or conversational you a good result in fact there is
0:14:47	always better than the second best choice
0:14:49	on the data for you also
0:14:53	finally when we do the randomisation of the clustering specifically we flip a coin with
0:14:59	only three we use a second best cluster each information merging
0:15:06	and the result is surprisingly sometimes lead to better and with the first a clustering
0:15:14	so you see here that with different random seeds we are in a range of
0:15:18	results
0:15:19	sometimes worse but often other and with the best first clustering
0:15:25	and the same is true for the whole set
0:15:27	first we cannot
0:15:29	expect the best thing all the data to also interesting only vol instead we need
0:15:34	to do the recognition in order to get a result
0:15:37	so we actually improve on the best first clustering consistently by doing or correlation over
0:15:43	different
0:15:44	randomized results
0:15:48	summary
0:15:49	we have just over algorithm allows us to voting among multiple times additional sees
0:15:56	we can use this to achieve
0:15:59	a robustness and annotation
0:16:01	by
0:16:04	combining multiple hypotheses obtained from a single input
0:16:08	e two ways that we do this is by very high utterance
0:16:12	or introduce diversity if you will and the results
0:16:17	and we find that the hyperparameter populations higher in over essentially freezes from the need
0:16:24	to do that optimization
0:16:26	and that its robustness that way
0:16:28	now the clustering can also be randomized be overcome the limitation of the first
0:16:35	research and clustering
0:16:37	and e combination of the randomized results actually says
0:16:41	higher accuracy and you the single
0:16:45	a string that that's
0:16:50	finally there's many more things we can do this so we can try to come
0:16:55	i'm
0:16:56	the different techniques so for example i are is wearing
0:17:01	a lot of multiple dimensions
0:17:02	or combining that with randomisation all in one and a well-known combination about
0:17:09	we can also tried as with different
0:17:12	like conversation than the algorithm is the gnostic to the actual
0:17:16	and
0:17:17	form of the diarisation algorithm
0:17:19	so we can try with x vector of a spectral clustering or normal and systems
0:17:26	of course region or we wish to
0:17:28	in this multiple the corporate in order to can work in the
0:17:33	the algorithm
0:17:35	to other things were currently working on is can i think different diarisation algorithms
0:17:41	as well as to generalize the to handle overlapping speech
0:17:47	thank you very much for your time
0:17:50	you're into question so essential to the c website
0:17:54	and i the rest of you culture

Improving Diarization Robustness using Diversification, Randomization and the DOVER Algorithm

Diarization

Andreas Stolcke