0:00:21 | but often um |
---|
0:00:23 | i'm i'm have that |
---|
0:00:24 | uh broke at most but university in then that lends |
---|
0:00:27 | a normally on speaker diarisation |
---|
0:00:29 | uh but i also do a little bit of work on uh speech recognition for spoken |
---|
0:00:34 | a a document retrieval |
---|
0:00:35 | so that i'm |
---|
0:00:37 | very glad that i'm here |
---|
0:00:38 | at the diarisation this session |
---|
0:00:40 | a talking about a sorry |
---|
0:00:42 | uh |
---|
0:00:44 | a true a ago we got a a question uh from the touch of that an institute for veteran institute |
---|
0:00:50 | if we could um process uh about two hundred in diffuse that they had a with uh for utterance |
---|
0:00:56 | uh |
---|
0:00:57 | that were uh taking place uh at their homes |
---|
0:00:59 | one with the table top mark phones and background noise and not very clear speech every now and then |
---|
0:01:05 | um |
---|
0:01:06 | and we try to do this and the the first thing that we did was uh supervised adaptation of uh |
---|
0:01:11 | the acoustic models |
---|
0:01:12 | and for about half of the in diffuse i think we did a pretty good job we had word error |
---|
0:01:16 | rates well thirty forty percent |
---|
0:01:18 | so that was good enough to to build a search system |
---|
0:01:21 | but the know the hall first terrible |
---|
0:01:24 | um |
---|
0:01:25 | i think on average the entire uh work the what error rate on average for those two hundred and if |
---|
0:01:29 | was sixty three percent |
---|
0:01:32 | and |
---|
0:01:33 | well i don't think it was price you but |
---|
0:01:35 | um |
---|
0:01:36 | this was probably because of the uh acoustical uh a mismatch between data to have training data and our evaluation |
---|
0:01:43 | data |
---|
0:01:43 | um |
---|
0:01:44 | because well we have our decoder we trained it on broadcast news and now we try to we've had a |
---|
0:01:47 | weighted on on |
---|
0:01:49 | uh interviews with tabletop microphones of stuff |
---|
0:01:52 | um |
---|
0:01:54 | and this is an issue that's actually the uh |
---|
0:01:57 | well |
---|
0:01:58 | we we trying to solve this in uh the we station |
---|
0:02:01 | uh where we well most systems a train our models on the evaluation data itself unsupervised T and we don't |
---|
0:02:07 | use any training data |
---|
0:02:09 | um |
---|
0:02:10 | so i thought well |
---|
0:02:11 | if we can do this for dairy station |
---|
0:02:13 | would it possible to do a similar thing for speech recognition |
---|
0:02:17 | so skip all the training data and try to |
---|
0:02:19 | uh uh train all you models only for the evaluation data itself |
---|
0:02:24 | of course this is a quite a task |
---|
0:02:27 | which i'm not going to solve them a so i thought maybe i should look at the acoustic models first |
---|
0:02:32 | so is it possible to |
---|
0:02:34 | uh train unsupervised a trained acoustic models on the evaluation data itself |
---|
0:02:38 | and maybe we can do it the same way as we do it far say should just |
---|
0:02:44 | so the the goal of the research that i like to talk about today |
---|
0:02:47 | is that to create a system that's |
---|
0:02:50 | uh able to automatically segment and cluster an audio recording |
---|
0:02:53 | in um well little clusters that we call subword unit |
---|
0:02:58 | uh so that these support units are able to perform a as R |
---|
0:03:03 | and um |
---|
0:03:04 | even this turned out to be a very difficult task because if you have |
---|
0:03:07 | unsupervised on train some some kind of sub-word units that might represent phones |
---|
0:03:12 | uh we need a dictionary and we re well so |
---|
0:03:15 | if |
---|
0:03:16 | it's the first step now here |
---|
0:03:18 | that i'm can to be talking about today is |
---|
0:03:20 | um |
---|
0:03:21 | can we evaluate these units D separate units |
---|
0:03:25 | in some query by example spoken term detection |
---|
0:03:28 | i experiment |
---|
0:03:34 | i |
---|
0:03:36 | so that are we say she system um |
---|
0:03:38 | i don't wanna say too much about it uh |
---|
0:03:41 | yeah i think we we try to prevent normally with every station |
---|
0:03:45 | uh that that we train on short term characteristics some from mike |
---|
0:03:49 | units by uh and forcing a minimum duration constraint |
---|
0:03:52 | and uh making sure that we don't use the that's especially the and duration of course is |
---|
0:03:57 | important |
---|
0:03:58 | um |
---|
0:03:59 | these two pictures below |
---|
0:04:01 | show the |
---|
0:04:02 | how might might research and system works |
---|
0:04:05 | it's a a club agglomerative clustering |
---|
0:04:07 | start with speech nonspeech |
---|
0:04:09 | uh detection |
---|
0:04:10 | create initial models |
---|
0:04:12 | uh |
---|
0:04:13 | randomly randomly basically on a chosen data |
---|
0:04:15 | and by a re aligning and retraining your models |
---|
0:04:18 | you have a very good initial models and then |
---|
0:04:20 | we start agglomerative clustering by uh making the best to |
---|
0:04:24 | the do well the the two models that are most similar |
---|
0:04:27 | um based on a big patient information criterion |
---|
0:04:30 | uh we merge these two models |
---|
0:04:32 | we do retraining training again |
---|
0:04:34 | we pick the best to the second best to models and |
---|
0:04:36 | go on on on an on on until a stopping criterion which is also the bayesian information criterion |
---|
0:04:42 | do what you see uh the the hmm topology |
---|
0:04:46 | where uh there's a number of |
---|
0:04:48 | strings of states |
---|
0:04:50 | and each of the uh strings represents one speaker |
---|
0:04:53 | and uh they all contain only one single gmm |
---|
0:04:57 | so the string is mainly |
---|
0:04:59 | a well it's only there to one force the minimum duration |
---|
0:05:03 | i |
---|
0:05:06 | so um |
---|
0:05:07 | obtaining these uh sub-word units |
---|
0:05:10 | unsupervised supervised |
---|
0:05:11 | uh we well we had to choose a name so we called it unsupervised acoustic sub-word units detector |
---|
0:05:18 | of detection |
---|
0:05:19 | a you so |
---|
0:05:20 | and i list the difference a between our diarization system and you was system |
---|
0:05:24 | uh entire a station we uh typically have multiple |
---|
0:05:27 | speakers |
---|
0:05:28 | uh in the case |
---|
0:05:30 | uh this experiment we had |
---|
0:05:32 | uh each time only one speaker the fatter and |
---|
0:05:34 | the work that one |
---|
0:05:35 | was speaking for about two hours so we had quite some data of the one speaker |
---|
0:05:40 | minimum duration in a station for our system to half seconds |
---|
0:05:44 | a the minimum duration in the U that system was forty milliseconds |
---|
0:05:48 | um |
---|
0:05:50 | i i guess i deal would of been thirty milliseconds because a for models of thirty milliseconds |
---|
0:05:55 | but that was technically uh |
---|
0:05:56 | difficult |
---|
0:05:57 | so it's forty milliseconds |
---|
0:05:59 | and every recession we didn't use that we don't use dealt that's in uh use that we do |
---|
0:06:03 | a in there every station uh the initial number of clusters fairies and that |
---|
0:06:07 | um this because we use more initial clusters if the recording is longer |
---|
0:06:12 | a a you was that we just the start of but uh a lot of |
---|
0:06:15 | initial cost to one of said |
---|
0:06:17 | um um and we didn't actually a stop using the bayesian information criterion just |
---|
0:06:21 | a of until till we had fifty seven left |
---|
0:06:24 | now come back to that later my method |
---|
0:06:29 | so that was how we uh out to make it generate the |
---|
0:06:32 | uh a units |
---|
0:06:34 | uh but we need to evaluate this since so are so um |
---|
0:06:37 | we decided to do a a a a a spoken term detection |
---|
0:06:41 | experiment |
---|
0:06:42 | uh because we don't have a dictionary or or was small available |
---|
0:06:46 | the examples |
---|
0:06:47 | so uh what we are going to do is to use |
---|
0:06:49 | uh uh uh provides an example from the audio self |
---|
0:06:53 | and the system should be able to uh provide a list |
---|
0:06:56 | oh terms that uh |
---|
0:06:58 | of the other terms that are the same that data in the audio |
---|
0:07:02 | um |
---|
0:07:05 | so that's how we going to evaluate weighted |
---|
0:07:07 | well |
---|
0:07:07 | how did we uh create a system because and till now we only have to features |
---|
0:07:12 | um um we do it the same as |
---|
0:07:13 | uh uh has an all |
---|
0:07:15 | in their query by example spoken term detection using phonetic posteriorgram gram damp let's paper |
---|
0:07:20 | uh that i think was presented here and two and seven from co |
---|
0:07:24 | um |
---|
0:07:25 | are they do is uh a first create a posterior gram |
---|
0:07:28 | of |
---|
0:07:29 | uh the entire recording |
---|
0:07:31 | and of tried to to draw it here at the last |
---|
0:07:34 | a uh on the X axis you have time |
---|
0:07:37 | on the |
---|
0:07:37 | Y axes you have uh the posteriors |
---|
0:07:40 | oh each time frame |
---|
0:07:42 | a a of all the phones that are and the system and an our case it's the support unit |
---|
0:07:47 | and when you have this posterior gram |
---|
0:07:49 | you can uh calculate a similarity matrix between the query |
---|
0:07:53 | and the actual recording that's the drawing on the right |
---|
0:07:56 | um |
---|
0:07:58 | where a a well as a similar T sure we just talk to the log likelihood of the uh in |
---|
0:08:03 | product of |
---|
0:08:04 | Q Q the factors of the query |
---|
0:08:06 | and an the factors of the work |
---|
0:08:09 | i once you've done this you can do uh dynamic time warping |
---|
0:08:12 | to uh find your |
---|
0:08:14 | well |
---|
0:08:16 | but the the |
---|
0:08:17 | but so are very similar to your example |
---|
0:08:20 | a query |
---|
0:08:24 | we actually uh implemented for different systems |
---|
0:08:27 | the first one is that the you was that system are we automatically find are are clusters |
---|
0:08:32 | second one is uh well that the system |
---|
0:08:35 | a similar to that of house and but uh phones |
---|
0:08:38 | the third one we just use the features directly |
---|
0:08:40 | and the fourth one is a |
---|
0:08:42 | but a gmm system |
---|
0:08:43 | uh that was |
---|
0:08:44 | uh |
---|
0:08:45 | last you're percentage here by |
---|
0:08:48 | uh yeah don't sound |
---|
0:08:50 | hopefully a |
---|
0:08:51 | france that correctly |
---|
0:08:52 | i um and basically it's the a uh the a if |
---|
0:08:56 | uh variant of the same uh you take the entire audio recording |
---|
0:09:00 | and you train up a a gmm single gmm |
---|
0:09:03 | and each dimension |
---|
0:09:05 | you uh use as a |
---|
0:09:07 | uh well |
---|
0:09:08 | you kept you use as a posterior |
---|
0:09:10 | a probability |
---|
0:09:15 | these are the results we did a two experiments one on broadcast news and one on this |
---|
0:09:19 | uh |
---|
0:09:20 | uh in diffuse with war veterans |
---|
0:09:22 | uh uh calculated mean average position for each system |
---|
0:09:25 | um |
---|
0:09:26 | and as you can see the mfcc system |
---|
0:09:28 | uh |
---|
0:09:29 | well forms |
---|
0:09:30 | we were on |
---|
0:09:31 | both experiments |
---|
0:09:33 | uh a the phone system and what was the old three or systems |
---|
0:09:36 | a did pretty well on the on the broke new news experiment |
---|
0:09:40 | um do are very similar but if you go to the if use you can see that |
---|
0:09:44 | especially the phone system uh field that's key |
---|
0:09:47 | i think that's because of the the acoustic mismatch |
---|
0:09:50 | i um |
---|
0:09:52 | well the you was sat system is a little bit better than the gmm system |
---|
0:09:55 | that might be because of the effects uh of the third talk to day |
---|
0:09:59 | that if you do agglomerative clustering |
---|
0:10:02 | um |
---|
0:10:02 | we're not as well normalized for |
---|
0:10:05 | think we stick |
---|
0:10:06 | variance |
---|
0:10:07 | which is what we try to find here but |
---|
0:10:09 | i i'm not sure of it's actually significant so we have to to test |
---|
0:10:13 | more and and try to |
---|
0:10:15 | um |
---|
0:10:15 | that in more data |
---|
0:10:17 | find find out |
---|
0:10:18 | um |
---|
0:10:19 | think that would like to do next is uh |
---|
0:10:22 | try to generate speaker independent models because these are models |
---|
0:10:25 | specific for each |
---|
0:10:26 | uh war and |
---|
0:10:28 | um |
---|
0:10:30 | maybe um |
---|
0:10:31 | so that that's the acoustic step |
---|
0:10:33 | maybe a to be try to find a some kind of dictionary |
---|
0:10:37 | so try to find a recurrent sequences of sub-word |
---|
0:10:41 | and uh also we have to not minutes |
---|
0:10:44 | and a data for each |
---|
0:10:46 | that interview |
---|
0:10:47 | that we used to adapt are uh |
---|
0:10:49 | a phone model on |
---|
0:10:50 | i we might be able to use as annotated data |
---|
0:10:52 | to get a little bit more information on the words that were spoken and know how to map are uh |
---|
0:10:57 | so part to is to these words |
---|
0:11:00 | so that's it for me thank you |
---|
0:11:02 | and |
---|
0:11:08 | i |
---|
0:11:13 | i i want to company |
---|
0:11:15 | yeah i |
---|
0:11:17 | yeah |
---|
0:11:19 | okay |
---|
0:11:20 | and most of them are about or or you can |
---|