Speech Transcript - Application of Convolutional Neural Networks to Language Identification in Noisy Conditions

also be presenting the work of a whole so the sri you malay with a

little from this one

and this is looking at applying convolutional deep neural networks to language id noisy conditions

in particular the conditions of the rats top part

so start with a bit of a background mormon we might wanna do this and

one domain

motivation to use the in an i-vector framework that we recently proposed a speaker id

then how do we kind of the noisy conditions we start looking at convolutional neural

networks for that purpose

and then we also present a simple a system called a scene and posterior system

the language id ensure results on that

that your experiment setup note walk through some results

that's for a bit of the background and language id the ubm i-vector framework is

pretty widely used in language are they a phone recognizers are also good option

the when if you use these two together that's when you get a really nice

improvement the quite complementary so in our books one of the challenge is as always

been how do we get speech information the white someone pronounces something into a single

system that outperformed you don't individual systems that's that what we call the challenge

so we want one phonetically away a system that can produce scores that outperform few

scores about

so we recently solve this the speaker id

at least in the telephone cases

that

we just a background of the nn i-vector framework

so what we're doing here's combining the deep neural network that's trained for automatic speech

recognition along with the popular i-vector model

why we use it is to generate out first-order stats and zero what's that

in particular we using with the nn in place of the ubm

and what we're doing here if you look at the comparison down the bottom here

the ubm plus the in the ubm is trained in unsupervised manner

it's trying to

represent classes would be gaussian

and it's a shame generally

to map to different phonetic classes however someone pronounces define a one why someone else

mark

phonetic completely different

ubms gonna model that in different components

so what the bn nn is

i else is trained in unsupervised manner

that means it's trying to map those same fines up to what we call seen

that it's a

a tight three fine state so the two different people pronouncing different wise the find

would be activating the same scene

and that

hopefully should capture different speaker trials

so have a second speaker i

it's very couples but speaker id in a the initial publication in like ask this

year

we got thirty percent relative improvement on telephone conditions particularly see two and c five

of the nist sre twelve

what i'm showing on this

slide here is actually we've got three different systems the sri sre twelve submission which

the fusion of six different features side-information a quarter conglomeration the

and then we show some recent work done is that the mfccs and deltas and

double deltas what we're calling my pca dct

that we publication on that in

i mean to speech i mean icassp next year

just to give your reference that gives about twenty percent relative improvement i've mfccs on

all conditions of sre twelve

but what's really not to be in an i-vector can still bring twenty percent improvement

on these two conditions c two and c five

so it's very powerful there's to work to be done a microphone trials this mismatch

happening we have my progress on that and we'll fact would be able to publish

on that very soon

so what i want to conclude here is that would now got a single system

the pizza sre twelve submission

so this how

cannot be useful language

that's the question to get there

so the context years the output of the nn should include language-related information

and ideally be more robust to speaker and noise variations

so the reason i say that is when you training in the nn for

i guess a you wanna remove speaker variability

but i'll a suitable for channel degraded language

in the rats program we actually and i think it was i b m o

b n

so i c n was particularly good for the rats noisy conditions we validated that

a nap keyword spotting trials you can see that dramatic different mikes in the channel

degraded speech so we said but use the same and then why should be very

way that the nn

and so that something that still open and we got a few review comments on

that actually in we need to validate difference between nice to show the actual improvement

in lid performance

so we do that in future

smoothing along with the c n

this is essentially the same trying to close to the and then

you can see we've got acoustic features that go into you had to men gmm

that provides alignments for the day n and trying to what you're trying to the

nn you no longer need to islam so you don't need to generate those test

false coming in france

and we've got acoustic features the forty dimensional log mel filterbank energies that are used

for training the neural net

here we stacking in our work with stacking fifteen frames together has the input for

the training

and we use a decision tree needs to be fancy names

and as a set a we generate training alignments with the pre-trained

h m and gmm which we don't need of

just as an illustration

c n

basically front of trying to ban in appends this liar all this process here where

i you've got your

filterbank energies within fifteen frame context

you

possible be convolutional filter

i think we're using

size of i which is in but i

and then what we're doing is the max pooling option of the n

that means that each of the three

for each three blocks the come out we take the maximum one

and i just helps with the noise state

have a single i-vector system go with this

we can see that which simply plug in the c n instead of the ubm

what straightforward

what's interesting here is that we've got two different acoustic features first is used for

the c and to get the posteriors for each of the same lines

and then multiplying those posteriors would be

acoustic features for language out that is to discriminate languages

that is the second set of features so the number and of the two apart

will be negative thought as you got extracted from features if you choose to you

can use the same suffice

but if you want to use model features in this in the fusion system

and you need to extract posteriors using that one set of pages

this is in contrast to twenty but multiple features for fusing with the ubm systems

you've got extract for instance five different sets of posteriors for each feature if you

had a five way fusion

another aspect is you're right but it sure in the language are the features independently

of those of the providing the posteriors

heart currently with their ubm systems it's a bit of a balancing act you want

stable posteriors but you also want would extremely discriminability a of the upper side of

in the statistics

so it can we go easy a simpler an alternative system here's a simple system

which take mostly and then we get the frame posteriors

we forget about first order statistics

we're doing here is normalized in the zero th order statistics in log domain

and then we just use a simple we back end for instance here we using

a neural network can use a gaussian backend assets one thing that distinguishes it from

phonotactic system

you can use standard language id backends which is not

so here we using a count of but i'd context dependent states will try

and that's a state level instead of find labels just

let's look at experimental setup

darpa rats program sure many of you have a how noisy these samples are

i think john was talking about them anywhere on the way this part target languages

tend online

i can see those on the screen the

this a few channel the degradation seven different channels snrs between zero and thirty

the transcription that we used to train the seen in table keyword spotting task and

as any two languages that that's an unusual aspect that we're trying to distinguish five

languages but we training onto a plane

test durations three ten thirty seconds and one twenty seconds and a metric we use

here is the average equal error rate across the target languages

terms of the model the one used to generate the training alignments for the and

then

the hmm gmm set up here we were producing three around three thousand c nines

with two hundred thousand gaussians

and this was multilingual training on both bastien haven't on our

c n model was also trying to sign my with the multilingual training set

we've got hot pocket lies with twelve hundred

nodes each and we've got forty filter banks of things that frames

you can see the pooling size and the filter sauce will be

for be seen in convolutional stuff

for the ubm model for comparison

our training of two thousand forty eight component ubm

and the features directly optimize the seed task the speaker id task the right based

tended to for well over two language are the actual number forty dimensional two d

dct log mel spectral features and this is similar to the zigzag dct

work that we propose to not cast this you the pci dct the shirt the

speaker are they really a that's an extension that further improves that's

what about the vectors and background back and sorry about the thin and then ubm

i-vectors all trained on the same data for the i-vector subspace

and that by four hundred dimensional

for the posterior system with collecting the three thousand average posteriors removing the silence indexes

three of those am reducing to four hundred dimensional reality same as the i-vector subspace

using probabilistic pca

for the back end we trying simple neural network mlp

i would cross entropy

what we do with the data to a enlarge our training dataset is to chunk

the data into thirty second chunks and i'd second chunks with fifty percent overlap i

think that end up with around two million

i-vectors to train on the

the output is five target languages and the one house across as well

i was performance got

well first of all the ubm i-vector approach the ubm isn't being trained in a

supervised by so what we said was well let's take the same lines from the

same and where we know we've got three thousand five

and let's along the frames for each of the icing lines and train each of

the ubm components with that

so the idea here was to try to give a fair comparison between ubm unseen

system

we don't nice improvement across all of those the

the scene and approaches

a is less to see that for ten seconds or more

getting a thirty percent for more relative improvement over the ubm approach for the three

second

timeframe testing

twenty percent relative improvement

what was interesting between the posterior system and the i-vector system for the c in

lid performance is actually quite similar

but if we fuse that study

we're gonna nice can again twenty percent relative improvement

when we just component combine the two difference in the parts as

optically for less than one twenty seconds is where we see for one twenty wasn't

pretty

when we got hidden at the ubm i-vector system to that so different modeling part

actually get no

kind from the fusion except in the one twenty case

also another interesting problem

in conclusion we compare in the robustness of the c n in the noisy conditions

in particular taking that the i-vector framework and making an effective on the rest language

id task

we propose to in that sense and yukon the phonotactic system the scene and posterior

system

i which is quite a simple system and those high complementarity between these two propose

i in terms of extension where do we go from here we can improve performance

a little one not doing probabilistic pca before the backend classification

and fusion of different language dependent si amends

the schools from there is a also provides a again

some the bottleneck features which i think a local might be talking about so

are also good alternatives

for the direct usage of the nn c n l for language

thank you

we have time for some question

like what

thanks for my start we're expecting request from right

possibly

so for the posterior cm impostors esteem you said it use the pca to four

hundred a prior to on your on it right yep you try also to put

data for vector

i imagine it is

yes the extension the first one on the extension says if we don't do that's

that we do get a slight improvement

i think the motivation for them reducing it to four hundred dimensions was for comparability

with the i-vector space see what can we get in that's four hundred dimensions

but

so my question to do with the data that was used to train the asr

and the c n i chi was it the multiple channels of the arabic and

the farsi data abilities are so you trained in channel

no channel condition less channel

i believe that the ubm

the ubm yes so we use the channel degraded data for the ubm the like

outlook same data

the a use arabic in the farsi data to train ubm like alignment like the

ubm was exposed all five languages across all channel conditions

but that was one thing you said you to the alignment of the signals and

then train the states the ubm

the second one there i guess that's what it was used for that

supervised ubm to get the alignment with the c nine words coming through the c

n and that was trained with keyword spotting tighter but the ubm it so i

believe that have checked this of the couples as was trained with the with data

which is across five languages

that's what you think that how much you think that's an impact of having c

change datasets and

classifiers you think that first question i would have to say you would you would

think that having five languages in the ubm

plus more plus the other set languages would give it an advantage to some degree

but as you said datasets changing

i think so to their be a good point

so match it if you're gonna do a very wide set of languages do you

have a hope of having sort of a single master universal like the hungarian traps

has been so successful in the past or do you think you're gonna have to

build many different language d n and

so what would what was saying so far as basically

the mall language dependency intends that you put together you fused together

the improvement reduces so perhaps if you had five the cover a good a space

of the phones across different languages that might be what you michael universal collection that's

appealing

right

Application of Convolutional Neural Networks to Language Identification in Noisy Conditions

Neural Nets for Speaker and Language Modeling

Yun Lei, Luciana Ferrer, Aaron Lawson, Mitchell McLaren and Nicolas Scheffer