0:00:15 | also be presenting the work of a whole so the sri you malay with a |
---|
0:00:19 | little from this one |
---|
0:00:22 | and this is looking at applying convolutional deep neural networks to language id noisy conditions |
---|
0:00:26 | in particular the conditions of the rats top part |
---|
0:00:31 | so start with a bit of a background mormon we might wanna do this and |
---|
0:00:35 | one domain |
---|
0:00:36 | motivation to use the in an i-vector framework that we recently proposed a speaker id |
---|
0:00:41 | then how do we kind of the noisy conditions we start looking at convolutional neural |
---|
0:00:46 | networks for that purpose |
---|
0:00:47 | and then we also present a simple a system called a scene and posterior system |
---|
0:00:52 | the language id ensure results on that |
---|
0:00:54 | that your experiment setup note walk through some results |
---|
0:00:58 | that's for a bit of the background and language id the ubm i-vector framework is |
---|
0:01:02 | pretty widely used in language are they a phone recognizers are also good option |
---|
0:01:07 | the when if you use these two together that's when you get a really nice |
---|
0:01:12 | improvement the quite complementary so in our books one of the challenge is as always |
---|
0:01:18 | been how do we get speech information the white someone pronounces something into a single |
---|
0:01:23 | system that outperformed you don't individual systems that's that what we call the challenge |
---|
0:01:28 | so we want one phonetically away a system that can produce scores that outperform few |
---|
0:01:33 | scores about |
---|
0:01:34 | so we recently solve this the speaker id |
---|
0:01:37 | at least in the telephone cases |
---|
0:01:40 | that |
---|
0:01:41 | we just a background of the nn i-vector framework |
---|
0:01:44 | so what we're doing here's combining the deep neural network that's trained for automatic speech |
---|
0:01:49 | recognition along with the popular i-vector model |
---|
0:01:52 | why we use it is to generate out first-order stats and zero what's that |
---|
0:01:56 | in particular we using with the nn in place of the ubm |
---|
0:02:01 | and what we're doing here if you look at the comparison down the bottom here |
---|
0:02:04 | the ubm plus the in the ubm is trained in unsupervised manner |
---|
0:02:08 | it's trying to |
---|
0:02:10 | represent classes would be gaussian |
---|
0:02:12 | and it's a shame generally |
---|
0:02:14 | to map to different phonetic classes however someone pronounces define a one why someone else |
---|
0:02:20 | mark |
---|
0:02:20 | phonetic completely different |
---|
0:02:22 | ubms gonna model that in different components |
---|
0:02:26 | so what the bn nn is |
---|
0:02:27 | i else is trained in unsupervised manner |
---|
0:02:30 | that means it's trying to map those same fines up to what we call seen |
---|
0:02:34 | that it's a |
---|
0:02:35 | a tight three fine state so the two different people pronouncing different wise the find |
---|
0:02:42 | a |
---|
0:02:43 | would be activating the same scene |
---|
0:02:46 | and that |
---|
0:02:47 | hopefully should capture different speaker trials |
---|
0:02:52 | so have a second speaker i |
---|
0:02:54 | it's very couples but speaker id in a the initial publication in like ask this |
---|
0:02:59 | year |
---|
0:03:01 | we got thirty percent relative improvement on telephone conditions particularly see two and c five |
---|
0:03:06 | of the nist sre twelve |
---|
0:03:09 | what i'm showing on this |
---|
0:03:11 | slide here is actually we've got three different systems the sri sre twelve submission which |
---|
0:03:17 | the fusion of six different features side-information a quarter conglomeration the |
---|
0:03:23 | and then we show some recent work done is that the mfccs and deltas and |
---|
0:03:27 | double deltas what we're calling my pca dct |
---|
0:03:30 | that we publication on that in |
---|
0:03:33 | i mean to speech i mean icassp next year |
---|
0:03:36 | just to give your reference that gives about twenty percent relative improvement i've mfccs on |
---|
0:03:41 | all conditions of sre twelve |
---|
0:03:44 | but what's really not to be in an i-vector can still bring twenty percent improvement |
---|
0:03:48 | on these two conditions c two and c five |
---|
0:03:51 | so it's very powerful there's to work to be done a microphone trials this mismatch |
---|
0:03:55 | happening we have my progress on that and we'll fact would be able to publish |
---|
0:03:59 | on that very soon |
---|
0:04:01 | so what i want to conclude here is that would now got a single system |
---|
0:04:05 | the pizza sre twelve submission |
---|
0:04:08 | so this how |
---|
0:04:09 | cannot be useful language |
---|
0:04:11 | that's the question to get there |
---|
0:04:13 | so the context years the output of the nn should include language-related information |
---|
0:04:19 | and ideally be more robust to speaker and noise variations |
---|
0:04:22 | so the reason i say that is when you training in the nn for |
---|
0:04:26 | i guess a you wanna remove speaker variability |
---|
0:04:32 | but i'll a suitable for channel degraded language |
---|
0:04:36 | in the rats program we actually and i think it was i b m o |
---|
0:04:39 | b n |
---|
0:04:41 | so i c n was particularly good for the rats noisy conditions we validated that |
---|
0:04:46 | a nap keyword spotting trials you can see that dramatic different mikes in the channel |
---|
0:04:50 | degraded speech so we said but use the same and then why should be very |
---|
0:04:54 | way that the nn |
---|
0:04:56 | and so that something that still open and we got a few review comments on |
---|
0:04:59 | that actually in we need to validate difference between nice to show the actual improvement |
---|
0:05:04 | in lid performance |
---|
0:05:06 | so we do that in future |
---|
0:05:08 | smoothing along with the c n |
---|
0:05:11 | this is essentially the same trying to close to the and then |
---|
0:05:15 | you can see we've got acoustic features that go into you had to men gmm |
---|
0:05:18 | that provides alignments for the day n and trying to what you're trying to the |
---|
0:05:21 | nn you no longer need to islam so you don't need to generate those test |
---|
0:05:25 | false coming in france |
---|
0:05:27 | and we've got acoustic features the forty dimensional log mel filterbank energies that are used |
---|
0:05:31 | for training the neural net |
---|
0:05:34 | here we stacking in our work with stacking fifteen frames together has the input for |
---|
0:05:38 | the training |
---|
0:05:39 | and we use a decision tree needs to be fancy names |
---|
0:05:43 | and as a set a we generate training alignments with the pre-trained |
---|
0:05:47 | h m and gmm which we don't need of |
---|
0:05:51 | just as an illustration |
---|
0:05:53 | c n |
---|
0:05:54 | basically front of trying to ban in appends this liar all this process here where |
---|
0:06:00 | i you've got your |
---|
0:06:03 | filterbank energies within fifteen frame context |
---|
0:06:05 | you |
---|
0:06:07 | possible be convolutional filter |
---|
0:06:09 | i think we're using |
---|
0:06:11 | size of i which is in but i |
---|
0:06:14 | and then what we're doing is the max pooling option of the n |
---|
0:06:17 | that means that each of the three |
---|
0:06:19 | for each three blocks the come out we take the maximum one |
---|
0:06:22 | and i just helps with the noise state |
---|
0:06:27 | have a single i-vector system go with this |
---|
0:06:30 | we can see that which simply plug in the c n instead of the ubm |
---|
0:06:34 | what straightforward |
---|
0:06:35 | what's interesting here is that we've got two different acoustic features first is used for |
---|
0:06:40 | the c and to get the posteriors for each of the same lines |
---|
0:06:43 | and then multiplying those posteriors would be |
---|
0:06:46 | acoustic features for language out that is to discriminate languages |
---|
0:06:51 | that is the second set of features so the number and of the two apart |
---|
0:06:55 | will be negative thought as you got extracted from features if you choose to you |
---|
0:06:58 | can use the same suffice |
---|
0:07:00 | but if you want to use model features in this in the fusion system |
---|
0:07:04 | and you need to extract posteriors using that one set of pages |
---|
0:07:07 | this is in contrast to twenty but multiple features for fusing with the ubm systems |
---|
0:07:11 | you've got extract for instance five different sets of posteriors for each feature if you |
---|
0:07:16 | had a five way fusion |
---|
0:07:18 | another aspect is you're right but it sure in the language are the features independently |
---|
0:07:22 | of those of the providing the posteriors |
---|
0:07:26 | heart currently with their ubm systems it's a bit of a balancing act you want |
---|
0:07:29 | stable posteriors but you also want would extremely discriminability a of the upper side of |
---|
0:07:35 | it |
---|
0:07:37 | in the statistics |
---|
0:07:40 | so it can we go easy a simpler an alternative system here's a simple system |
---|
0:07:44 | which take mostly and then we get the frame posteriors |
---|
0:07:48 | we forget about first order statistics |
---|
0:07:51 | we're doing here is normalized in the zero th order statistics in log domain |
---|
0:07:55 | and then we just use a simple we back end for instance here we using |
---|
0:07:58 | a neural network can use a gaussian backend assets one thing that distinguishes it from |
---|
0:08:02 | phonotactic system |
---|
0:08:04 | you can use standard language id backends which is not |
---|
0:08:08 | so here we using a count of but i'd context dependent states will try |
---|
0:08:14 | and that's a state level instead of find labels just |
---|
0:08:18 | let's look at experimental setup |
---|
0:08:21 | darpa rats program sure many of you have a how noisy these samples are |
---|
0:08:26 | i think john was talking about them anywhere on the way this part target languages |
---|
0:08:29 | tend online |
---|
0:08:31 | i can see those on the screen the |
---|
0:08:33 | this a few channel the degradation seven different channels snrs between zero and thirty |
---|
0:08:39 | the transcription that we used to train the seen in table keyword spotting task and |
---|
0:08:43 | as any two languages that that's an unusual aspect that we're trying to distinguish five |
---|
0:08:47 | languages but we training onto a plane |
---|
0:08:51 | test durations three ten thirty seconds and one twenty seconds and a metric we use |
---|
0:08:55 | here is the average equal error rate across the target languages |
---|
0:08:59 | terms of the model the one used to generate the training alignments for the and |
---|
0:09:03 | then |
---|
0:09:04 | the hmm gmm set up here we were producing three around three thousand c nines |
---|
0:09:10 | with two hundred thousand gaussians |
---|
0:09:13 | and this was multilingual training on both bastien haven't on our |
---|
0:09:17 | c n model was also trying to sign my with the multilingual training set |
---|
0:09:22 | we've got hot pocket lies with twelve hundred |
---|
0:09:25 | nodes each and we've got forty filter banks of things that frames |
---|
0:09:29 | you can see the pooling size and the filter sauce will be |
---|
0:09:33 | for be seen in convolutional stuff |
---|
0:09:37 | for the ubm model for comparison |
---|
0:09:39 | our training of two thousand forty eight component ubm |
---|
0:09:42 | and the features directly optimize the seed task the speaker id task the right based |
---|
0:09:47 | tended to for well over two language are the actual number forty dimensional two d |
---|
0:09:51 | dct log mel spectral features and this is similar to the zigzag dct |
---|
0:09:58 | work that we propose to not cast this you the pci dct the shirt the |
---|
0:10:02 | speaker are they really a that's an extension that further improves that's |
---|
0:10:08 | what about the vectors and background back and sorry about the thin and then ubm |
---|
0:10:13 | i-vectors all trained on the same data for the i-vector subspace |
---|
0:10:17 | and that by four hundred dimensional |
---|
0:10:18 | for the posterior system with collecting the three thousand average posteriors removing the silence indexes |
---|
0:10:25 | three of those am reducing to four hundred dimensional reality same as the i-vector subspace |
---|
0:10:31 | using probabilistic pca |
---|
0:10:33 | for the back end we trying simple neural network mlp |
---|
0:10:36 | i would cross entropy |
---|
0:10:38 | what we do with the data to a enlarge our training dataset is to chunk |
---|
0:10:42 | the data into thirty second chunks and i'd second chunks with fifty percent overlap i |
---|
0:10:46 | think that end up with around two million |
---|
0:10:49 | i-vectors to train on the |
---|
0:10:51 | the output is five target languages and the one house across as well |
---|
0:10:55 | i was performance got |
---|
0:10:58 | well first of all the ubm i-vector approach the ubm isn't being trained in a |
---|
0:11:02 | supervised by so what we said was well let's take the same lines from the |
---|
0:11:06 | same and where we know we've got three thousand five |
---|
0:11:09 | and let's along the frames for each of the icing lines and train each of |
---|
0:11:13 | the ubm components with that |
---|
0:11:14 | so the idea here was to try to give a fair comparison between ubm unseen |
---|
0:11:18 | system |
---|
0:11:20 | we don't nice improvement across all of those the |
---|
0:11:24 | the scene and approaches |
---|
0:11:25 | a is less to see that for ten seconds or more |
---|
0:11:28 | getting a thirty percent for more relative improvement over the ubm approach for the three |
---|
0:11:34 | second |
---|
0:11:35 | timeframe testing |
---|
0:11:36 | twenty percent relative improvement |
---|
0:11:39 | what was interesting between the posterior system and the i-vector system for the c in |
---|
0:11:43 | lid performance is actually quite similar |
---|
0:11:48 | but if we fuse that study |
---|
0:11:49 | we're gonna nice can again twenty percent relative improvement |
---|
0:11:52 | when we just component combine the two difference in the parts as |
---|
0:11:56 | optically for less than one twenty seconds is where we see for one twenty wasn't |
---|
0:12:00 | pretty |
---|
0:12:02 | when we got hidden at the ubm i-vector system to that so different modeling part |
---|
0:12:06 | actually get no |
---|
0:12:08 | kind from the fusion except in the one twenty case |
---|
0:12:10 | also another interesting problem |
---|
0:12:14 | in conclusion we compare in the robustness of the c n in the noisy conditions |
---|
0:12:18 | in particular taking that the i-vector framework and making an effective on the rest language |
---|
0:12:23 | id task |
---|
0:12:25 | we propose to in that sense and yukon the phonotactic system the scene and posterior |
---|
0:12:29 | system |
---|
0:12:30 | i which is quite a simple system and those high complementarity between these two propose |
---|
0:12:36 | i in terms of extension where do we go from here we can improve performance |
---|
0:12:39 | a little one not doing probabilistic pca before the backend classification |
---|
0:12:45 | and fusion of different language dependent si amends |
---|
0:12:49 | the schools from there is a also provides a again |
---|
0:12:54 | some the bottleneck features which i think a local might be talking about so |
---|
0:12:58 | are also good alternatives |
---|
0:13:00 | for the direct usage of the nn c n l for language |
---|
0:13:04 | thank you |
---|
0:13:12 | we have time for some question |
---|
0:13:21 | like what |
---|
0:13:26 | thanks for my start we're expecting request from right |
---|
0:13:30 | possibly |
---|
0:13:31 | so for the posterior cm impostors esteem you said it use the pca to four |
---|
0:13:36 | hundred a prior to on your on it right yep you try also to put |
---|
0:13:41 | data for vector |
---|
0:13:42 | i imagine it is |
---|
0:13:44 | yes the extension the first one on the extension says if we don't do that's |
---|
0:13:48 | that we do get a slight improvement |
---|
0:13:50 | i think the motivation for them reducing it to four hundred dimensions was for comparability |
---|
0:13:54 | with the i-vector space see what can we get in that's four hundred dimensions |
---|
0:14:16 | but |
---|
0:14:18 | so my question to do with the data that was used to train the asr |
---|
0:14:22 | and the c n i chi was it the multiple channels of the arabic and |
---|
0:14:27 | the farsi data abilities are so you trained in channel |
---|
0:14:31 | no channel condition less channel |
---|
0:14:32 | i believe that the ubm |
---|
0:14:35 | the ubm yes so we use the channel degraded data for the ubm the like |
---|
0:14:39 | outlook same data |
---|
0:14:41 | the a use arabic in the farsi data to train ubm like alignment like the |
---|
0:14:46 | ubm was exposed all five languages across all channel conditions |
---|
0:14:50 | but that was one thing you said you to the alignment of the signals and |
---|
0:14:53 | then train the states the ubm |
---|
0:14:56 | the second one there i guess that's what it was used for that |
---|
0:15:01 | supervised ubm to get the alignment with the c nine words coming through the c |
---|
0:15:05 | n and that was trained with keyword spotting tighter but the ubm it so i |
---|
0:15:09 | believe that have checked this of the couples as was trained with the with data |
---|
0:15:13 | which is across five languages |
---|
0:15:15 | that's what you think that how much you think that's an impact of having c |
---|
0:15:19 | change datasets and |
---|
0:15:21 | classifiers you think that first question i would have to say you would you would |
---|
0:15:26 | think that having five languages in the ubm |
---|
0:15:30 | plus more plus the other set languages would give it an advantage to some degree |
---|
0:15:34 | but as you said datasets changing |
---|
0:15:37 | i think so to their be a good point |
---|
0:15:45 | so match it if you're gonna do a very wide set of languages do you |
---|
0:15:50 | have a hope of having sort of a single master universal like the hungarian traps |
---|
0:15:55 | has been so successful in the past or do you think you're gonna have to |
---|
0:15:57 | build many different language d n and |
---|
0:16:00 | so what would what was saying so far as basically |
---|
0:16:04 | the mall language dependency intends that you put together you fused together |
---|
0:16:08 | the improvement reduces so perhaps if you had five the cover a good a space |
---|
0:16:15 | of the phones across different languages that might be what you michael universal collection that's |
---|
0:16:20 | appealing |
---|
0:16:27 | right |
---|