0:00:06 | so this is |
---|
0:00:07 | um |
---|
0:00:08 | this is work |
---|
0:00:09 | that we did at S R I |
---|
0:00:11 | and it's in fact our |
---|
0:00:12 | foray into language recognition |
---|
0:00:15 | um |
---|
0:00:17 | and |
---|
0:00:17 | it's great because most of |
---|
0:00:19 | the |
---|
0:00:19 | the kind of background techniques were already |
---|
0:00:22 | uh |
---|
0:00:22 | splendid plenty of detail in the previous |
---|
0:00:24 | three talks |
---|
0:00:25 | so |
---|
0:00:26 | two |
---|
0:00:26 | go into that again |
---|
0:00:27 | um |
---|
0:00:29 | i will |
---|
0:00:29 | start with some preliminaries |
---|
0:00:31 | and then |
---|
0:00:32 | uh tell you about our |
---|
0:00:33 | or mental set up |
---|
0:00:35 | and the focus here will be on phonotactic |
---|
0:00:37 | uh |
---|
0:00:38 | why should not say phonotactic but phone base |
---|
0:00:40 | uh recognition |
---|
0:00:42 | uh i say that because that encompass |
---|
0:00:44 | or both |
---|
0:00:45 | phonotactic |
---|
0:00:46 | uh modelling |
---|
0:00:47 | uh such as what we've heard about |
---|
0:00:48 | before as well as |
---|
0:00:49 | uh mllr |
---|
0:00:50 | based modelling which of course is also based on phone models hence the commonalities |
---|
0:00:55 | um |
---|
0:00:56 | and |
---|
0:00:57 | i will also look at two different techniques and comparing uh two different ways of doing the phonotactic modelling |
---|
0:01:03 | um |
---|
0:01:04 | then conclude with some |
---|
0:01:05 | pointers to future work and conclusion |
---|
0:01:08 | so |
---|
0:01:09 | because we haven't participated in the past |
---|
0:01:12 | uh language recognition evaluations we actually did have access |
---|
0:01:15 | two |
---|
0:01:16 | any data after lre all five |
---|
0:01:19 | since this work was done L the sea's accuracy release L O seven but we didn't have a chance |
---|
0:01:24 | process that yeah |
---|
0:01:25 | so we're dealing with this |
---|
0:01:27 | i apologise for dealing with a rather |
---|
0:01:29 | outdated |
---|
0:01:29 | a task which is the a seven language task |
---|
0:01:32 | mallory of five |
---|
0:01:33 | it's a conversational speech sounds a lot of that a voice of america stuff |
---|
0:01:38 | um the test data consists of |
---|
0:01:40 | uh about thirty six hundred |
---|
0:01:42 | test segments |
---|
0:01:42 | and we only look at |
---|
0:01:43 | thirty second condition |
---|
0:01:45 | uh the training data and is |
---|
0:01:47 | from uh you know the same |
---|
0:01:49 | seven languages |
---|
0:01:50 | and the duration |
---|
0:01:51 | after we perform our automatic segmentation into speech nonspeech |
---|
0:01:55 | uh boils down to about fifty six hours for these three |
---|
0:01:59 | so you first languages and that way six hours for the |
---|
0:02:02 | many falling |
---|
0:02:03 | that you see here |
---|
0:02:05 | uh well i'll be reporting the equal error rate averaged over all languages |
---|
0:02:09 | uh so uh |
---|
0:02:11 | uh that's just what |
---|
0:02:12 | how to choose and maybe that |
---|
0:02:13 | that the best choice but |
---|
0:02:14 | but that |
---|
0:02:15 | what you see |
---|
0:02:16 | um the we performed two fold cross validation so |
---|
0:02:21 | uh to do calibration and fusion we |
---|
0:02:23 | uh you know which put the data and how we we |
---|
0:02:26 | uh estimate on the first have to |
---|
0:02:28 | second vice versa |
---|
0:02:29 | combined |
---|
0:02:30 | results |
---|
0:02:31 | and this is again because we didn't have |
---|
0:02:33 | an independent you |
---|
0:02:34 | that |
---|
0:02:35 | only working |
---|
0:02:36 | salary of five |
---|
0:02:39 | okay |
---|
0:02:39 | you heard all this |
---|
0:02:41 | so the the the two main stream |
---|
0:02:43 | uh |
---|
0:02:44 | techniques are really the cepstral gmm |
---|
0:02:46 | um |
---|
0:02:47 | and lately |
---|
0:02:48 | you know there's been the incorporation of |
---|
0:02:50 | uh |
---|
0:02:51 | uh session |
---|
0:02:52 | oh variability compensation |
---|
0:02:54 | with jfa and we implemented that |
---|
0:02:56 | uh |
---|
0:02:57 | using you know |
---|
0:02:59 | kind of a standard framework and just for reference |
---|
0:03:01 | gives you |
---|
0:03:02 | something like two point eight seven |
---|
0:03:04 | uh |
---|
0:03:04 | the average equal error rate |
---|
0:03:06 | um |
---|
0:03:07 | Q data |
---|
0:03:08 | and uh the alternative |
---|
0:03:10 | a popular technique of course |
---|
0:03:11 | the prlm technique |
---|
0:03:13 | and you've heard all about that already |
---|
0:03:15 | um |
---|
0:03:16 | yeah so i won't repeat it here |
---|
0:03:18 | the uh of course |
---|
0:03:20 | what |
---|
0:03:20 | popular about it is that you can then combine multiple language specific like |
---|
0:03:24 | fusion |
---|
0:03:25 | at the score level and get much |
---|
0:03:27 | results |
---|
0:03:27 | to solve for |
---|
0:03:29 | and for calibration |
---|
0:03:30 | fusion |
---|
0:03:30 | we also didn't attempt anything |
---|
0:03:32 | uh out of the ordinary |
---|
0:03:33 | uh in fact we haven't even |
---|
0:03:35 | i tried to um |
---|
0:03:36 | oh i haven't |
---|
0:03:37 | you know |
---|
0:03:38 | incorporated this uh the gaussian |
---|
0:03:40 | uh |
---|
0:03:40 | i can't modelling yet so we just use the multiclass vocal |
---|
0:03:44 | uh which is based on a lot cleaner |
---|
0:03:46 | aggression |
---|
0:03:47 | um |
---|
0:03:47 | did you both user |
---|
0:03:48 | and |
---|
0:03:49 | um |
---|
0:03:50 | calibration |
---|
0:03:51 | so |
---|
0:03:52 | first section |
---|
0:03:53 | is about |
---|
0:03:53 | phonotactic language modelling |
---|
0:03:56 | um this again is a |
---|
0:03:58 | standard technique by now saw before so |
---|
0:04:00 | stead of doing one best phone decoding we do we do for that |
---|
0:04:03 | the coding |
---|
0:04:04 | um |
---|
0:04:06 | uh this was actually |
---|
0:04:08 | you know adopted twice |
---|
0:04:10 | uh lindsay |
---|
0:04:11 | uh |
---|
0:04:12 | proposed |
---|
0:04:12 | for language I D and |
---|
0:04:14 | uh as right |
---|
0:04:15 | C proposes |
---|
0:04:15 | for |
---|
0:04:16 | speaker I D |
---|
0:04:17 | um |
---|
0:04:18 | and in both cases that shows pretty dramatic improvements and actually i wanted to |
---|
0:04:21 | respond to uh |
---|
0:04:22 | to uh something that uh uh haptic |
---|
0:04:25 | said |
---|
0:04:26 | um |
---|
0:04:27 | uh |
---|
0:04:27 | because |
---|
0:04:28 | quest |
---|
0:04:29 | uh or because of the previous talk |
---|
0:04:31 | i actually do not think that the lattice |
---|
0:04:33 | the coding |
---|
0:04:34 | in that increases your variability and in fact i think it reduces your credibility |
---|
0:04:38 | because you're not making hard decision |
---|
0:04:40 | so whereas in one hypothesis |
---|
0:04:42 | you know the recogniser |
---|
0:04:43 | so the whatever charlie might decide between |
---|
0:04:45 | having a frequency of one |
---|
0:04:47 | or you know point nine nine nine |
---|
0:04:49 | in the |
---|
0:04:50 | in the lattice |
---|
0:04:51 | approach you actually |
---|
0:04:52 | uh represent both the one best and |
---|
0:04:54 | and later hypotheses so you have |
---|
0:04:57 | you have all the hypotheses represented |
---|
0:04:59 | and they just differ |
---|
0:05:00 | by small numerical values i think it actually gives you a more robust |
---|
0:05:04 | uh |
---|
0:05:05 | feature |
---|
0:05:06 | and that was actually demonstrated |
---|
0:05:07 | by andy had in the original paper |
---|
0:05:10 | post |
---|
0:05:11 | yeah |
---|
0:05:12 | and the other reason why it works but of course gives you more |
---|
0:05:15 | granularity |
---|
0:05:16 | feature |
---|
0:05:18 | uh |
---|
0:05:20 | so much about the |
---|
0:05:21 | how we do the feature extraction |
---|
0:05:23 | um we |
---|
0:05:24 | we didn't have time to really develop new phone recognisers for this so we just took |
---|
0:05:29 | three phone recognition systems that we had lying around |
---|
0:05:32 | uh one is for american english one is for spanish and thirty four levels you know |
---|
0:05:37 | and you can see here the phone sets differ in their sizes |
---|
0:05:39 | and uh |
---|
0:05:40 | furthermore the training data |
---|
0:05:42 | is vastly different |
---|
0:05:43 | uh english we have you know basically as much data |
---|
0:05:46 | one |
---|
0:05:47 | um and for that reason we gender dependent modelling for the other two languages with less training data we we |
---|
0:05:52 | do a gender independent modelling |
---|
0:05:54 | but other than that they all use |
---|
0:05:55 | same kind of standard asr |
---|
0:05:57 | right |
---|
0:05:58 | uh |
---|
0:05:58 | plp front and vocal tract length normalisation |
---|
0:06:01 | hlda |
---|
0:06:02 | uh for dimensionality reduction |
---|
0:06:04 | and the crossword triphones intently |
---|
0:06:07 | uh acoustic model training |
---|
0:06:10 | uh the decoding of course is done without one attack |
---|
0:06:13 | constraints |
---|
0:06:14 | um |
---|
0:06:14 | but we do use |
---|
0:06:16 | following the results from lindsay we use the context dependent |
---|
0:06:19 | triphones |
---|
0:06:19 | the code |
---|
0:06:21 | uh |
---|
0:06:23 | and also go again following uh you know very nice |
---|
0:06:26 | figure from them as a couple years ago |
---|
0:06:28 | uh we use the cmllr adaptation and decoding |
---|
0:06:31 | um and by the way we tried a regular mllr as well and it didn't perform as well |
---|
0:06:36 | you know and i guess that's an agreement with |
---|
0:06:38 | another of the previous |
---|
0:06:39 | talk |
---|
0:06:41 | okay |
---|
0:06:43 | so uh the first |
---|
0:06:44 | uh |
---|
0:06:45 | i think we would like to propose |
---|
0:06:47 | is to get rid of or largely get rid of all these different kernel phone decoder |
---|
0:06:51 | and instead we |
---|
0:06:53 | we can define |
---|
0:06:54 | a universal |
---|
0:06:55 | phone set that covers several languages in our case we made up such a set |
---|
0:07:00 | uh |
---|
0:07:01 | fifty two phones |
---|
0:07:02 | and what you do is you map |
---|
0:07:04 | a new map your your individual language specific dictionaries to a |
---|
0:07:08 | uh common shared phone set |
---|
0:07:10 | and then you retrain |
---|
0:07:11 | you're acoustic models |
---|
0:07:13 | uh using the map ref |
---|
0:07:15 | uh and of course the language models |
---|
0:07:17 | uh if you perform |
---|
0:07:18 | uh for decoding with the language model with the phonotactic models should also be retrained |
---|
0:07:23 | the phone recognition accuracies uh as measured |
---|
0:07:27 | on um |
---|
0:07:28 | on individual languages are very close |
---|
0:07:30 | two |
---|
0:07:31 | what you get with universal phone set you you're not really you uh |
---|
0:07:34 | sacrificing much in terms of that |
---|
0:07:36 | see |
---|
0:07:37 | and this is the these are the following these are the |
---|
0:07:39 | language specific steps that we combine |
---|
0:07:42 | map in this fashion so |
---|
0:07:43 | we took a |
---|
0:07:44 | um american english data |
---|
0:07:46 | of two |
---|
0:07:47 | the right is mainly native and nonnative speakers |
---|
0:07:50 | and we |
---|
0:07:50 | because we know that in much of what |
---|
0:07:53 | we do both native and |
---|
0:07:54 | the nonnative speakers of |
---|
0:07:55 | her |
---|
0:07:56 | uh |
---|
0:07:56 | not in the natural frequencies but with |
---|
0:07:58 | with |
---|
0:07:59 | more than native nonnative uh |
---|
0:08:01 | focus |
---|
0:08:02 | we actually waited them so that they have equal amount of data roughly |
---|
0:08:05 | uh and then um and then and spanish and egyptian arabic now note we use the egyptian arabic here |
---|
0:08:11 | because that happen |
---|
0:08:12 | to be a dataset where we have |
---|
0:08:14 | about lies |
---|
0:08:15 | uh |
---|
0:08:16 | transitioned so we can actually perform this from happening in there you know |
---|
0:08:20 | pretty straightforward way |
---|
0:08:21 | and also these |
---|
0:08:22 | two data sets that have very little data spanish and uh egyptian |
---|
0:08:26 | they are weighted more heavily to |
---|
0:08:28 | it's even more about |
---|
0:08:30 | in terms of the overall model |
---|
0:08:33 | then this is might be a detail and |
---|
0:08:35 | known to everybody but |
---|
0:08:36 | uh |
---|
0:08:37 | window to this i thought i'd point it out |
---|
0:08:39 | uh so |
---|
0:08:40 | when we do the the log likelihood ratio scoring |
---|
0:08:43 | we actually a do not |
---|
0:08:45 | use all the languages |
---|
0:08:46 | in the denominator but only the languages that are not |
---|
0:08:49 | the target language |
---|
0:08:50 | and that gives you slightly better result |
---|
0:08:57 | okay |
---|
0:08:58 | so here the results using |
---|
0:09:00 | the uh prlm approach |
---|
0:09:02 | so |
---|
0:09:03 | have the three individual |
---|
0:09:04 | um the individual |
---|
0:09:06 | P |
---|
0:09:07 | are aligned |
---|
0:09:08 | still |
---|
0:09:08 | american based on the american english recogniser level in arabic and spanish recognisers |
---|
0:09:13 | and uh american english asthmatic |
---|
0:09:16 | back |
---|
0:09:16 | i because it has the most training data gives you the best |
---|
0:09:18 | individual results |
---|
0:09:20 | uh and might also be good because american english actually had where english has it |
---|
0:09:24 | relatively high number of |
---|
0:09:26 | uh oh |
---|
0:09:26 | testing phone so that gives you a lot of resolution |
---|
0:09:29 | in your in your |
---|
0:09:29 | coding |
---|
0:09:30 | um and then when you do the standard |
---|
0:09:33 | um |
---|
0:09:34 | prlm with first |
---|
0:09:36 | two recognisers and then three recognisers you get progress |
---|
0:09:39 | truman |
---|
0:09:40 | overall from the |
---|
0:09:41 | single best |
---|
0:09:42 | uh which is the american english to the three way |
---|
0:09:45 | you know be prlm |
---|
0:09:47 | uh with about thirty four |
---|
0:09:48 | centrality |
---|
0:09:49 | yeah |
---|
0:09:50 | and |
---|
0:09:50 | then these single decoder that uses only the multilingual |
---|
0:09:54 | a recording |
---|
0:09:55 | a gives you three point O one which is very close to the |
---|
0:09:58 | combined |
---|
0:09:59 | oh and of course vastly simpler and faster |
---|
0:10:01 | strongly |
---|
0:10:02 | because |
---|
0:10:03 | and if you combine these all you have a four way P P R line now |
---|
0:10:07 | you get another nice to uh improvement so you go from the previous result |
---|
0:10:11 | the |
---|
0:10:12 | uh you know what |
---|
0:10:13 | the three language specific stuff |
---|
0:10:14 | terms |
---|
0:10:15 | to a full weight prlm |
---|
0:10:17 | with a with a pretty significant |
---|
0:10:19 | uh twenty four percent additional reduction so |
---|
0:10:22 | usually when you add more and more of these |
---|
0:10:24 | language specific systems the improvement kind of peter out as you might expect |
---|
0:10:28 | but if you apply |
---|
0:10:30 | the multilingual |
---|
0:10:31 | uh system then you get another big |
---|
0:10:35 | uh just some details so again this might all be common knowledge but |
---|
0:10:39 | uh we did find that that was actually no gain from four grams three grams was |
---|
0:10:44 | the best |
---|
0:10:45 | in terms of the language model um |
---|
0:10:47 | you know the overall lack |
---|
0:10:48 | see |
---|
0:10:49 | so somehow the |
---|
0:10:51 | you know the the the |
---|
0:10:52 | the programs are too sparse all the models are not adequate to capture the information |
---|
0:10:56 | programs |
---|
0:10:57 | and it actually |
---|
0:10:58 | a good to do a fairly |
---|
0:11:00 | uh suboptimal smoothing in terms of language model performance |
---|
0:11:04 | the assembly at once moving works best |
---|
0:11:06 | works better than doing fancy things and i |
---|
0:11:10 | um |
---|
0:11:12 | okay |
---|
0:11:15 | so now we do have some we had something which |
---|
0:11:18 | is very easily done |
---|
0:11:19 | in like a bird |
---|
0:11:20 | system |
---|
0:11:21 | uh my understanding so we |
---|
0:11:22 | we use uh |
---|
0:11:24 | we used we augment the standard |
---|
0:11:26 | uh cepstral front end |
---|
0:11:27 | with a mlp features with multilayer perceptron and neural network features |
---|
0:11:31 | uh which works right very well when we do word recognition |
---|
0:11:35 | uh and and other tasks |
---|
0:11:36 | and we also show that a |
---|
0:11:38 | front end that is trained on saying english too |
---|
0:11:41 | the form english phone |
---|
0:11:43 | uh |
---|
0:11:43 | uh discrimination |
---|
0:11:45 | uh actually generalises to other layers so you could train |
---|
0:11:47 | and you're not |
---|
0:11:48 | to discriminating users english phones at the frame level |
---|
0:11:51 | and then |
---|
0:11:52 | uh |
---|
0:11:52 | use that |
---|
0:11:53 | train front end |
---|
0:11:54 | to train say a mandarin recogniser and you would see a nice |
---|
0:11:58 | so |
---|
0:11:58 | this is this confidence that is |
---|
0:12:00 | this |
---|
0:12:00 | a front end although it is trained on only one language will actually |
---|
0:12:04 | a generalised to other languages which is exactly what we want for the |
---|
0:12:07 | language |
---|
0:12:09 | uh did we find that the across the board for all languages |
---|
0:12:13 | um |
---|
0:12:13 | we get uh a small but |
---|
0:12:15 | consistent improvement in the |
---|
0:12:17 | the recognition accuracy at the phone level |
---|
0:12:20 | phone phone recognition or |
---|
0:12:21 | see |
---|
0:12:22 | um and now we're gonna throw this at the multilingual prlm |
---|
0:12:25 | so we can augment the |
---|
0:12:27 | multilingual prlm |
---|
0:12:28 | the |
---|
0:12:29 | uh with |
---|
0:12:30 | uh with this |
---|
0:12:31 | uh mlp feature front |
---|
0:12:32 | and you see an improvement here |
---|
0:12:34 | that uh is about |
---|
0:12:36 | um |
---|
0:12:37 | uh is it that is about |
---|
0:12:40 | um |
---|
0:12:41 | you want to see it yet though some three point O one |
---|
0:12:43 | two two point eight one |
---|
0:12:45 | and if you do this |
---|
0:12:47 | combination |
---|
0:12:48 | with the with the other |
---|
0:12:49 | language specific |
---|
0:12:50 | P L M systems |
---|
0:12:52 | you got improvement from two point oh nine to one |
---|
0:12:54 | so nice |
---|
0:12:55 | nice improvement from adding those |
---|
0:12:57 | yeah i uh |
---|
0:12:58 | an opium |
---|
0:12:59 | as others have seen but we wanted to verify it |
---|
0:13:02 | uh for this |
---|
0:13:03 | uh |
---|
0:13:03 | for this framework with the multilingual |
---|
0:13:07 | okay so now we're gonna try something diff |
---|
0:13:08 | so another thing that we use |
---|
0:13:10 | with some success |
---|
0:13:11 | in speaker |
---|
0:13:12 | identification of course is the M R trends |
---|
0:13:14 | um |
---|
0:13:15 | so |
---|
0:13:16 | why should we be able to do this |
---|
0:13:17 | or |
---|
0:13:18 | language recognition |
---|
0:13:19 | so the idea |
---|
0:13:20 | seen it |
---|
0:13:20 | uh talk about probably he at the workshop is you have a language independent |
---|
0:13:25 | set of phone models |
---|
0:13:27 | and uh you use mllr adaptation |
---|
0:13:29 | so you estimated transform |
---|
0:13:31 | to move certain phone class |
---|
0:13:33 | from there |
---|
0:13:33 | language independent locations or |
---|
0:13:36 | speaker independent or whatever the in |
---|
0:13:37 | the |
---|
0:13:38 | pennants is that you that you care about |
---|
0:13:40 | to a defendant |
---|
0:13:42 | to to a location that is specific to a subset of your data such as |
---|
0:13:45 | language or |
---|
0:13:47 | and then use the transform coefficients |
---|
0:13:49 | as features |
---|
0:13:50 | and you model them with as we have |
---|
0:13:53 | and an hour |
---|
0:13:54 | case |
---|
0:13:54 | we have eight phone classes |
---|
0:13:56 | each uh |
---|
0:13:57 | the feature vector |
---|
0:13:58 | has thirty nine components |
---|
0:14:00 | and uh the you know the the affine transform of the thirty nine by forty matrix that we get |
---|
0:14:05 | about thirteen thousand |
---|
0:14:07 | the twelve thousand |
---|
0:14:08 | raw |
---|
0:14:09 | we perform right normalisation as we do in our speaker I D's |
---|
0:14:13 | and that's our feature vector |
---|
0:14:15 | and then we do support vector machine training with linear kernels |
---|
0:14:18 | and |
---|
0:14:19 | you know the hyperplane is really the the model |
---|
0:14:21 | the language model |
---|
0:14:22 | model for the language |
---|
0:14:24 | case |
---|
0:14:24 | and the L A D's scores |
---|
0:14:26 | this |
---|
0:14:26 | from |
---|
0:14:27 | your |
---|
0:14:27 | test sample |
---|
0:14:31 | yeah the results and this is a very crude system but |
---|
0:14:34 | bear with me |
---|
0:14:35 | so um we try this |
---|
0:14:37 | first with english |
---|
0:14:38 | uh mlr reference model |
---|
0:14:40 | so we use |
---|
0:14:41 | female english |
---|
0:14:42 | speakers only an hour |
---|
0:14:44 | in our uh reference model |
---|
0:14:45 | and we get a |
---|
0:14:46 | you know we get some results |
---|
0:14:47 | some people are right |
---|
0:14:49 | um |
---|
0:14:49 | we can play this game when we actually combining male and a female transform and we get a better result |
---|
0:14:55 | insistent with |
---|
0:14:56 | with what we |
---|
0:14:56 | see and |
---|
0:14:56 | speaker i work |
---|
0:14:58 | uh but when we use a single gender independent |
---|
0:15:01 | multilingual animal a reference model |
---|
0:15:04 | we do much better |
---|
0:15:05 | so this just goes to show first that it works in principle |
---|
0:15:08 | secondly |
---|
0:15:09 | that again the multilingual phone models work |
---|
0:15:12 | we better than the line |
---|
0:15:13 | fig |
---|
0:15:14 | oh |
---|
0:15:17 | now we want to |
---|
0:15:18 | get this result down to be more competitive with our standard |
---|
0:15:21 | uh it's a cepstral uh girls |
---|
0:15:24 | so first of all we can |
---|
0:15:25 | uh we can we can use a little trick |
---|
0:15:28 | the training conversations actually pretty long compared to the test |
---|
0:15:31 | the conversation |
---|
0:15:32 | that's the test set |
---|
0:15:33 | so we can actually split our training |
---|
0:15:35 | conversations |
---|
0:15:36 | into thirty seconds |
---|
0:15:38 | uh |
---|
0:15:38 | segments and get many more data points for the svm training |
---|
0:15:42 | we can also optimise the number of gaussians in our F |
---|
0:15:45 | models |
---|
0:15:46 | to be smaller that forces the mlr to do |
---|
0:15:48 | more adaptation work |
---|
0:15:50 | in the transform uh as opposed to just using different regions |
---|
0:15:53 | if you're gaussian mixture |
---|
0:15:55 | and finally we can |
---|
0:15:56 | we can do that now |
---|
0:15:58 | to try to project out |
---|
0:16:00 | uh within language |
---|
0:16:01 | where the light |
---|
0:16:03 | so um |
---|
0:16:04 | uh so that |
---|
0:16:05 | uh that's all done you kind of incrementally and you see that the |
---|
0:16:09 | that the average uh equal error rate goes down from you know the seven |
---|
0:16:14 | to just blow for |
---|
0:16:16 | so |
---|
0:16:16 | we're not quite |
---|
0:16:17 | they are yet |
---|
0:16:18 | far as the |
---|
0:16:19 | baseline of the cepstral gmm goes but it's much more |
---|
0:16:26 | okay |
---|
0:16:27 | um |
---|
0:16:28 | now again another incremental improvement we augment the plp front end of the mlr system with a twenty five mlp |
---|
0:16:35 | features |
---|
0:16:36 | so |
---|
0:16:37 | a number of features goes from um |
---|
0:16:40 | uh thirty |
---|
0:16:40 | uh thirty and nine times forty |
---|
0:16:43 | to that one |
---|
0:16:44 | the other block diagonal |
---|
0:16:46 | opponent that |
---|
0:16:47 | accounts for adapting the mlp features |
---|
0:16:49 | which is |
---|
0:16:49 | twenty five |
---|
0:16:50 | yeah |
---|
0:16:52 | six so |
---|
0:16:52 | overall the feature dimension increases from |
---|
0:16:55 | the twelve thousand |
---|
0:16:56 | to uh to just |
---|
0:16:57 | but you are under eighteen |
---|
0:16:59 | okay |
---|
0:17:00 | and |
---|
0:17:01 | the performance goals |
---|
0:17:02 | oh |
---|
0:17:03 | okay |
---|
0:17:03 | see |
---|
0:17:04 | improves |
---|
0:17:05 | uh and well to thirteen |
---|
0:17:07 | central |
---|
0:17:08 | i i reduction |
---|
0:17:12 | okay so now i want to go back |
---|
0:17:13 | two |
---|
0:17:14 | phone |
---|
0:17:15 | phonotactic modelling |
---|
0:17:17 | and as we've seen |
---|
0:17:19 | um |
---|
0:17:20 | i you know hardly anybody uses language models anymore i'm gonna use |
---|
0:17:23 | yeah |
---|
0:17:24 | for uh for phonotactic modelling |
---|
0:17:26 | uh so we wanted to |
---|
0:17:27 | do the same |
---|
0:17:28 | uh and see if |
---|
0:17:29 | if what we saw before still work |
---|
0:17:31 | so um |
---|
0:17:33 | as we found also |
---|
0:17:35 | you know many years ago that in uh in speaker ideas that svm models |
---|
0:17:39 | uh |
---|
0:17:39 | plight phone tandem features |
---|
0:17:41 | uh what better than language |
---|
0:17:43 | and that |
---|
0:17:43 | to me |
---|
0:17:44 | because of |
---|
0:17:44 | right |
---|
0:17:45 | um |
---|
0:17:47 | so |
---|
0:17:47 | here we want to apply this to the multilingual phone right |
---|
0:17:50 | hmmm |
---|
0:17:50 | wars and we use the uh T if there are no good |
---|
0:17:54 | yeah campbell |
---|
0:17:55 | uh and we do not perform any rank normalisation |
---|
0:17:58 | sound like |
---|
0:17:59 | yeah |
---|
0:18:00 | sure |
---|
0:18:01 | the uh again we play this game that we use |
---|
0:18:03 | split our training uh |
---|
0:18:05 | conversation sides into |
---|
0:18:06 | segments that match the length |
---|
0:18:08 | the test data |
---|
0:18:09 | and that gives us more |
---|
0:18:10 | uh that is as more training samples |
---|
0:18:13 | a four or a smear |
---|
0:18:15 | uh |
---|
0:18:16 | so this was our baseline using a language model over the phone and ram |
---|
0:18:20 | uh and that was the old is all |
---|
0:18:22 | then when we do an S yeah |
---|
0:18:24 | that with |
---|
0:18:25 | with the same feature space trigrams |
---|
0:18:27 | we do slightly worse |
---|
0:18:29 | but uh |
---|
0:18:30 | whereas previously we did not get again with foreground |
---|
0:18:34 | we now i |
---|
0:18:34 | you again with foreground |
---|
0:18:36 | so with |
---|
0:18:36 | with the additional features that |
---|
0:18:39 | and uh the result actually get better than up |
---|
0:18:42 | uh finally we can uh we confuse the two uh phonotactic systems the lm based |
---|
0:18:48 | the svm based system |
---|
0:18:50 | and we got another |
---|
0:18:52 | uh back |
---|
0:18:53 | so |
---|
0:18:54 | uh |
---|
0:18:55 | apparently the svm is a better tool when it comes |
---|
0:18:57 | to modelling very sparse |
---|
0:18:59 | uh features |
---|
0:19:00 | uh |
---|
0:19:01 | and that's why we see again from going from |
---|
0:19:04 | like that |
---|
0:19:05 | we also tried uh using now |
---|
0:19:07 | but that no gain from that |
---|
0:19:09 | yeah replicating |
---|
0:19:10 | something we tried and speaker I D |
---|
0:19:12 | it didn't work |
---|
0:19:13 | um however we haven't tried to the uh you know dimensionality reduction |
---|
0:19:16 | techniques like proposed in the uh in the previous talk so that's certainly something |
---|
0:19:20 | okay |
---|
0:19:22 | okay and just kind of the grand finale |
---|
0:19:25 | what we put everything together |
---|
0:19:26 | this is our single best system the |
---|
0:19:28 | oh phone recognition or phonotactic S yeah |
---|
0:19:32 | with that result |
---|
0:19:33 | over there |
---|
0:19:35 | one word yes |
---|
0:19:36 | and this is our other baseline the cepstral gmm |
---|
0:19:40 | and then we can |
---|
0:19:41 | incrementally add |
---|
0:19:42 | at uh |
---|
0:19:43 | a phonotactic or or phone based systems |
---|
0:19:46 | uh we see again from combining the caps O all the combination |
---|
0:19:50 | start with the cepstral gmm |
---|
0:19:52 | uh so |
---|
0:19:53 | we first of all we see that doing mllr type modelling on cepstral features |
---|
0:19:58 | uh does combine with the with the cepstral gmm |
---|
0:20:02 | um |
---|
0:20:03 | the |
---|
0:20:04 | the multilingual |
---|
0:20:06 | uh |
---|
0:20:06 | prlm system |
---|
0:20:08 | is the best |
---|
0:20:08 | i think the combined with the baseline |
---|
0:20:10 | see a whopping |
---|
0:20:11 | send |
---|
0:20:12 | uh reduction |
---|
0:20:13 | there yeah |
---|
0:20:14 | um and then adding on top of these |
---|
0:20:16 | these two |
---|
0:20:17 | uh and you can have you know all the others and you get you go down for another twenty percent |
---|
0:20:22 | relative so we |
---|
0:20:23 | essentially you had the error rate from you know two point eight seven that one |
---|
0:20:27 | or |
---|
0:20:28 | uh which |
---|
0:20:30 | looks like a pretty nice reduction |
---|
0:20:32 | um |
---|
0:20:33 | the well i really |
---|
0:20:35 | told you the highlights only |
---|
0:20:36 | the fact that |
---|
0:20:37 | two different kinds of phonotactic modelling actually combine |
---|
0:20:40 | um the fact that you type |
---|
0:20:41 | cepstral modelling combine |
---|
0:20:43 | um and the interesting thing is here |
---|
0:20:46 | that |
---|
0:20:46 | adding multiple uh this |
---|
0:20:48 | what behind the P prlm |
---|
0:20:50 | adding multiple |
---|
0:20:51 | language |
---|
0:20:52 | pacific |
---|
0:20:53 | uh phonotactic model |
---|
0:20:55 | does not help |
---|
0:20:56 | okay |
---|
0:20:56 | the one thing that all these other things it's no longer useful to actually have the line |
---|
0:21:00 | pacific phone recognition |
---|
0:21:05 | okay just a quick rundown of some |
---|
0:21:07 | one of our future direction |
---|
0:21:09 | so obviously we want to |
---|
0:21:11 | uh |
---|
0:21:11 | verify these results with |
---|
0:21:13 | more recent |
---|
0:21:14 | uh lre dataset |
---|
0:21:15 | we want in particular trials on the language |
---|
0:21:17 | uh the dialogue type the path |
---|
0:21:20 | and um |
---|
0:21:21 | you know the svm approach this already |
---|
0:21:24 | it seemed as |
---|
0:21:24 | for some the previous talks |
---|
0:21:26 | can be pursued in the parallel with multiple language specific phone set |
---|
0:21:30 | um |
---|
0:21:31 | but more interestingly i think we should we train the mlp features |
---|
0:21:34 | to actually be a well matched |
---|
0:21:36 | to the multilingual phone set that we're using now |
---|
0:21:39 | at the end but |
---|
0:21:40 | so |
---|
0:21:41 | uh that's it |
---|
0:21:41 | yeah button additional improvement |
---|
0:21:43 | the um |
---|
0:21:46 | uh |
---|
0:21:46 | we could all do not very interesting we |
---|
0:21:48 | could you mlp features for all the language |
---|
0:21:50 | pacific phone recognisers |
---|
0:21:52 | the |
---|
0:21:52 | handling them |
---|
0:21:53 | we might not really |
---|
0:21:54 | sue because we |
---|
0:21:55 | trying to get rid of the lines |
---|
0:21:57 | fig one recogniser |
---|
0:21:58 | and we can of course then go to more high level feature uh features that we've tried and worked well |
---|
0:22:04 | in speaker I D such as prosodic features |
---|
0:22:06 | and constrained uh caps |
---|
0:22:09 | okay so here the the |
---|
0:22:11 | the people messages so we try |
---|
0:22:13 | there is fine |
---|
0:22:14 | uh phone based systems |
---|
0:22:16 | uh for language I D |
---|
0:22:17 | using techniques that we |
---|
0:22:18 | uh that we had previously seen uh to work well in asr and also in speaker I D |
---|
0:22:25 | uh |
---|
0:22:25 | we for the first time to our knowledge we tried using mllr svm modelling for the language reckon |
---|
0:22:30 | have a network |
---|
0:22:31 | um |
---|
0:22:32 | the uh multilingual i guess |
---|
0:22:35 | the biggest |
---|
0:22:35 | take a math |
---|
0:22:36 | that |
---|
0:22:36 | the multilingual phone model approach is |
---|
0:22:39 | is |
---|
0:22:40 | works |
---|
0:22:40 | better |
---|
0:22:41 | and as |
---|
0:22:42 | is simpler |
---|
0:22:43 | then using a combination of a language |
---|
0:22:47 | did you |
---|
0:22:47 | parable |
---|
0:22:49 | and it still gives you some games if you combine |
---|
0:22:51 | language mister |
---|
0:22:53 | uh phone recognition |
---|
0:22:54 | um the mlp front end uh can |
---|
0:22:57 | proof but not that so what others found that mlp fine then you gain line recognition carries over to these |
---|
0:23:03 | two techniques |
---|
0:23:04 | that we explored here |
---|
0:23:05 | and |
---|
0:23:06 | uh the mllr in the cepstral gmm uh approach |
---|
0:23:09 | for cepstral modelling also combine quite well |
---|
0:23:12 | um |
---|
0:23:14 | well the rest of set already so |
---|
0:23:16 | that's it |
---|
0:23:24 | any questions |
---|
0:23:25 | right |
---|
0:23:30 | thank you very much for nice to know |
---|
0:23:32 | and |
---|
0:23:33 | at the beginning you said that the multilingual phone um did you mention |
---|
0:23:37 | she works |
---|
0:23:38 | proximate |
---|
0:23:38 | the same as the language dependent one |
---|
0:23:41 | do you have |
---|
0:23:42 | i mean numbers or |
---|
0:23:43 | oh no it's not here |
---|
0:23:44 | no i i have them |
---|
0:23:46 | you know what home but |
---|
0:23:47 | but i didn't think it was really relevant |
---|
0:23:49 | 'cause when we imagine phone recognition accuracy |
---|
0:23:51 | we actually usually apply phonotactic model |
---|
0:23:54 | but in |
---|
0:23:54 | for language I D purposes we throw away the phonotactic model because we want |
---|
0:23:59 | to be very sensitive to the |
---|
0:24:00 | uh |
---|
0:24:01 | you know to the |
---|
0:24:02 | particulars of the line |
---|
0:24:03 | because |
---|
0:24:12 | maybe that was the very beginning of the top but we was more details about this |
---|
0:24:16 | discriminative |
---|
0:24:17 | mlp features that you're feeding into the |
---|
0:24:19 | hmmm |
---|
0:24:19 | phone recogniser or do you like |
---|
0:24:21 | then the posterior |
---|
0:24:23 | by some postprocessing or button next yeah so playing with |
---|
0:24:26 | yes they are actually quite |
---|
0:24:27 | plaques |
---|
0:24:28 | uh they were by the way we didn't train anything particular for this language |
---|
0:24:32 | yeah we just |
---|
0:24:33 | that's something that we have used in in word rec |
---|
0:24:35 | yeah |
---|
0:24:36 | uh in fact it was all the way you know it's basically |
---|
0:24:38 | these |
---|
0:24:39 | you just were optimised for |
---|
0:24:40 | word recognition |
---|
0:24:41 | and conversational english |
---|
0:24:43 | telephone |
---|
0:24:44 | um |
---|
0:24:48 | uh |
---|
0:24:49 | so |
---|
0:24:49 | we take um |
---|
0:24:51 | we take actually |
---|
0:24:52 | which a plp features |
---|
0:24:53 | over a nine frame window |
---|
0:24:55 | and then perform the usual kind of |
---|
0:24:58 | M L T |
---|
0:24:59 | mlp uh |
---|
0:25:00 | uh training with those |
---|
0:25:02 | those input features |
---|
0:25:03 | um we also |
---|
0:25:04 | form we also use the hats features |
---|
0:25:07 | which are kind of a derivative of the trap features |
---|
0:25:09 | uh |
---|
0:25:10 | going back to |
---|
0:25:10 | you know like uh a man |
---|
0:25:12 | work |
---|
0:25:12 | uh so those capture more long term |
---|
0:25:15 | uh |
---|
0:25:15 | critical band energies |
---|
0:25:17 | um |
---|
0:25:17 | and then we combine the posteriors from these two mlps |
---|
0:25:21 | into a single set of posterior vectors |
---|
0:25:23 | and then um |
---|
0:25:24 | then we would use it to twenty five dimensions using using uh |
---|
0:25:28 | yeah |
---|
0:25:32 | any questions |
---|
0:25:38 | what is your noise to you |
---|
0:25:40 | um |
---|
0:25:41 | i |
---|
0:25:42 | does |
---|
0:25:42 | falling on |
---|
0:25:43 | from scrooge um |
---|
0:25:45 | uh |
---|
0:25:46 | the mlp setup |
---|
0:25:47 | what are you trying |
---|
0:25:49 | do you |
---|
0:25:49 | you |
---|
0:25:50 | beautiful |
---|
0:25:51 | your mind |
---|
0:25:52 | ooh |
---|
0:25:53 | so it's an english phone set so it has another forty five |
---|
0:25:56 | uh categories |
---|
0:25:57 | that's performing |
---|
0:25:58 | frame level class |
---|
0:25:59 | cation |
---|
0:26:00 | so you trying to predict |
---|
0:26:02 | the phone at each frame |
---|
0:26:03 | uh |
---|
0:26:04 | english phone of each frame |
---|
0:26:06 | regardless of the length |
---|
0:26:08 | so as i said we did not train language specific or even the multilingual mlp |
---|
0:26:12 | we we just be using the english uh specific mlp that we had |
---|
0:26:18 | the really |
---|
0:26:19 | and perhaps try |
---|
0:26:20 | do you |
---|
0:26:21 | oh |
---|
0:26:22 | you might |
---|
0:26:22 | hmmm |
---|
0:26:23 | you know that's what it put in the future work as one of the obvious |
---|
0:26:26 | the proof |
---|
0:26:27 | that you could actually |
---|
0:26:28 | generalised concept |
---|
0:26:29 | cover all languages |
---|
0:26:30 | and then we try the animal |
---|
0:26:32 | hmmm |
---|
0:26:40 | i |
---|
0:26:41 | one |
---|
0:26:42 | um |
---|
0:26:43 | um |
---|
0:26:46 | huh |
---|
0:26:47 | oh |
---|
0:26:49 | uh |
---|
0:26:52 | it's a fake |
---|
0:26:53 | it's a it's a uh it's a mapping designed by a phonetician |
---|
0:26:57 | yeah |
---|
0:26:58 | this |
---|
0:26:58 | oh |
---|
0:26:59 | uh |
---|
0:27:01 | we plan to do that but |
---|
0:27:03 | we have |
---|
0:27:10 | i see |
---|
0:27:13 | right |
---|