Speech Transcript - UNSUPERVISED ACOUSTIC SUB-WORD UNIT DETECTION FOR QUERY-BY-EXAMPLE SPOKEN TERM DETECTION

0:00:21	but often um
0:00:23	i'm i'm have that
0:00:24	uh broke at most but university in then that lends
0:00:27	a normally on speaker diarisation
0:00:29	uh but i also do a little bit of work on uh speech recognition for spoken
0:00:34	a a document retrieval
0:00:35	so that i'm
0:00:37	very glad that i'm here
0:00:38	at the diarisation this session
0:00:40	a talking about a sorry
0:00:42	uh
0:00:44	a true a ago we got a a question uh from the touch of that an institute for veteran institute
0:00:50	if we could um process uh about two hundred in diffuse that they had a with uh for utterance
0:00:56	uh
0:00:57	that were uh taking place uh at their homes
0:00:59	one with the table top mark phones and background noise and not very clear speech every now and then
0:01:05	um
0:01:06	and we try to do this and the the first thing that we did was uh supervised adaptation of uh
0:01:11	the acoustic models
0:01:12	and for about half of the in diffuse i think we did a pretty good job we had word error
0:01:16	rates well thirty forty percent
0:01:18	so that was good enough to to build a search system
0:01:21	but the know the hall first terrible
0:01:24	um
0:01:25	i think on average the entire uh work the what error rate on average for those two hundred and if
0:01:29	was sixty three percent
0:01:32	and
0:01:33	well i don't think it was price you but
0:01:35	um
0:01:36	this was probably because of the uh acoustical uh a mismatch between data to have training data and our evaluation
0:01:43	data
0:01:43	um
0:01:44	because well we have our decoder we trained it on broadcast news and now we try to we've had a
0:01:47	weighted on on
0:01:49	uh interviews with tabletop microphones of stuff
0:01:52	um
0:01:54	and this is an issue that's actually the uh
0:01:57	well
0:01:58	we we trying to solve this in uh the we station
0:02:01	uh where we well most systems a train our models on the evaluation data itself unsupervised T and we don't
0:02:07	use any training data
0:02:09	um
0:02:10	so i thought well
0:02:11	if we can do this for dairy station
0:02:13	would it possible to do a similar thing for speech recognition
0:02:17	so skip all the training data and try to
0:02:19	uh uh train all you models only for the evaluation data itself
0:02:24	of course this is a quite a task
0:02:27	which i'm not going to solve them a so i thought maybe i should look at the acoustic models first
0:02:32	so is it possible to
0:02:34	uh train unsupervised a trained acoustic models on the evaluation data itself
0:02:38	and maybe we can do it the same way as we do it far say should just
0:02:44	so the the goal of the research that i like to talk about today
0:02:47	is that to create a system that's
0:02:50	uh able to automatically segment and cluster an audio recording
0:02:53	in um well little clusters that we call subword unit
0:02:58	uh so that these support units are able to perform a as R
0:03:03	and um
0:03:04	even this turned out to be a very difficult task because if you have
0:03:07	unsupervised on train some some kind of sub-word units that might represent phones
0:03:12	uh we need a dictionary and we re well so
0:03:15	if
0:03:16	it's the first step now here
0:03:18	that i'm can to be talking about today is
0:03:20	um
0:03:21	can we evaluate these units D separate units
0:03:25	in some query by example spoken term detection
0:03:28	i experiment
0:03:34	i
0:03:36	so that are we say she system um
0:03:38	i don't wanna say too much about it uh
0:03:41	yeah i think we we try to prevent normally with every station
0:03:45	uh that that we train on short term characteristics some from mike
0:03:49	units by uh and forcing a minimum duration constraint
0:03:52	and uh making sure that we don't use the that's especially the and duration of course is
0:03:57	important
0:03:58	um
0:03:59	these two pictures below
0:04:01	show the
0:04:02	how might might research and system works
0:04:05	it's a a club agglomerative clustering
0:04:07	start with speech nonspeech
0:04:09	uh detection
0:04:10	create initial models
0:04:12	uh
0:04:13	randomly randomly basically on a chosen data
0:04:15	and by a re aligning and retraining your models
0:04:18	you have a very good initial models and then
0:04:20	we start agglomerative clustering by uh making the best to
0:04:24	the do well the the two models that are most similar
0:04:27	um based on a big patient information criterion
0:04:30	uh we merge these two models
0:04:32	we do retraining training again
0:04:34	we pick the best to the second best to models and
0:04:36	go on on on an on on until a stopping criterion which is also the bayesian information criterion
0:04:42	do what you see uh the the hmm topology
0:04:46	where uh there's a number of
0:04:48	strings of states
0:04:50	and each of the uh strings represents one speaker
0:04:53	and uh they all contain only one single gmm
0:04:57	so the string is mainly
0:04:59	a well it's only there to one force the minimum duration
0:05:03	i
0:05:06	so um
0:05:07	obtaining these uh sub-word units
0:05:10	unsupervised supervised
0:05:11	uh we well we had to choose a name so we called it unsupervised acoustic sub-word units detector
0:05:18	of detection
0:05:19	a you so
0:05:20	and i list the difference a between our diarization system and you was system
0:05:24	uh entire a station we uh typically have multiple
0:05:27	speakers
0:05:28	uh in the case
0:05:30	uh this experiment we had
0:05:32	uh each time only one speaker the fatter and
0:05:34	the work that one
0:05:35	was speaking for about two hours so we had quite some data of the one speaker
0:05:40	minimum duration in a station for our system to half seconds
0:05:44	a the minimum duration in the U that system was forty milliseconds
0:05:48	um
0:05:50	i i guess i deal would of been thirty milliseconds because a for models of thirty milliseconds
0:05:55	but that was technically uh
0:05:56	difficult
0:05:57	so it's forty milliseconds
0:05:59	and every recession we didn't use that we don't use dealt that's in uh use that we do
0:06:03	a in there every station uh the initial number of clusters fairies and that
0:06:07	um this because we use more initial clusters if the recording is longer
0:06:12	a a you was that we just the start of but uh a lot of
0:06:15	initial cost to one of said
0:06:17	um um and we didn't actually a stop using the bayesian information criterion just
0:06:21	a of until till we had fifty seven left
0:06:24	now come back to that later my method
0:06:29	so that was how we uh out to make it generate the
0:06:32	uh a units
0:06:34	uh but we need to evaluate this since so are so um
0:06:37	we decided to do a a a a a spoken term detection
0:06:41	experiment
0:06:42	uh because we don't have a dictionary or or was small available
0:06:46	the examples
0:06:47	so uh what we are going to do is to use
0:06:49	uh uh uh provides an example from the audio self
0:06:53	and the system should be able to uh provide a list
0:06:56	oh terms that uh
0:06:58	of the other terms that are the same that data in the audio
0:07:02	um
0:07:05	so that's how we going to evaluate weighted
0:07:07	well
0:07:07	how did we uh create a system because and till now we only have to features
0:07:12	um um we do it the same as
0:07:13	uh uh has an all
0:07:15	in their query by example spoken term detection using phonetic posteriorgram gram damp let's paper
0:07:20	uh that i think was presented here and two and seven from co
0:07:24	um
0:07:25	are they do is uh a first create a posterior gram
0:07:28	of
0:07:29	uh the entire recording
0:07:31	and of tried to to draw it here at the last
0:07:34	a uh on the X axis you have time
0:07:37	on the
0:07:37	Y axes you have uh the posteriors
0:07:40	oh each time frame
0:07:42	a a of all the phones that are and the system and an our case it's the support unit
0:07:47	and when you have this posterior gram
0:07:49	you can uh calculate a similarity matrix between the query
0:07:53	and the actual recording that's the drawing on the right
0:07:56	um
0:07:58	where a a well as a similar T sure we just talk to the log likelihood of the uh in
0:08:03	product of
0:08:04	Q Q the factors of the query
0:08:06	and an the factors of the work
0:08:09	i once you've done this you can do uh dynamic time warping
0:08:12	to uh find your
0:08:14	well
0:08:16	but the the
0:08:17	but so are very similar to your example
0:08:20	a query
0:08:24	we actually uh implemented for different systems
0:08:27	the first one is that the you was that system are we automatically find are are clusters
0:08:32	second one is uh well that the system
0:08:35	a similar to that of house and but uh phones
0:08:38	the third one we just use the features directly
0:08:40	and the fourth one is a
0:08:42	but a gmm system
0:08:43	uh that was
0:08:44	uh
0:08:45	last you're percentage here by
0:08:48	uh yeah don't sound
0:08:50	hopefully a
0:08:51	france that correctly
0:08:52	i um and basically it's the a uh the a if
0:08:56	uh variant of the same uh you take the entire audio recording
0:09:00	and you train up a a gmm single gmm
0:09:03	and each dimension
0:09:05	you uh use as a
0:09:07	uh well
0:09:08	you kept you use as a posterior
0:09:10	a probability
0:09:15	these are the results we did a two experiments one on broadcast news and one on this
0:09:19	uh
0:09:20	uh in diffuse with war veterans
0:09:22	uh uh calculated mean average position for each system
0:09:25	um
0:09:26	and as you can see the mfcc system
0:09:28	uh
0:09:29	well forms
0:09:30	we were on
0:09:31	both experiments
0:09:33	uh a the phone system and what was the old three or systems
0:09:36	a did pretty well on the on the broke new news experiment
0:09:40	um do are very similar but if you go to the if use you can see that
0:09:44	especially the phone system uh field that's key
0:09:47	i think that's because of the the acoustic mismatch
0:09:50	i um
0:09:52	well the you was sat system is a little bit better than the gmm system
0:09:55	that might be because of the effects uh of the third talk to day
0:09:59	that if you do agglomerative clustering
0:10:02	um
0:10:02	we're not as well normalized for
0:10:05	think we stick
0:10:06	variance
0:10:07	which is what we try to find here but
0:10:09	i i'm not sure of it's actually significant so we have to to test
0:10:13	more and and try to
0:10:15	um
0:10:15	that in more data
0:10:17	find find out
0:10:18	um
0:10:19	think that would like to do next is uh
0:10:22	try to generate speaker independent models because these are models
0:10:25	specific for each
0:10:26	uh war and
0:10:28	um
0:10:30	maybe um
0:10:31	so that that's the acoustic step
0:10:33	maybe a to be try to find a some kind of dictionary
0:10:37	so try to find a recurrent sequences of sub-word
0:10:41	and uh also we have to not minutes
0:10:44	and a data for each
0:10:46	that interview
0:10:47	that we used to adapt are uh
0:10:49	a phone model on
0:10:50	i we might be able to use as annotated data
0:10:52	to get a little bit more information on the words that were spoken and know how to map are uh
0:10:57	so part to is to these words
0:11:00	so that's it for me thank you
0:11:02	and
0:11:08	i
0:11:13	i i want to company
0:11:15	yeah i
0:11:17	yeah
0:11:19	okay
0:11:20	and most of them are about or or you can

UNSUPERVISED ACOUSTIC SUB-WORD UNIT DETECTION FOR QUERY-BY-EXAMPLE SPOKEN TERM DETECTION

Speaker Diarization

Presented by: Marijn Huijbregts, Author(s): Marijn Huijbregts, Mitchell McLaren, David van Leeuwen, Radboud University Nijmegen, Netherlands