Speech Transcript - Neural Network Bottleneck Features for Language Identification

my name is problematic a and i'll be talking about that

a neural network or bottleneck features for language identification

a i did this work during my postdoc iterating bbn

and at first i will talk about the writes they darpa rats program which i

tested you don't so it's a noisy condition

then i will talk about the neural network what the mecc features and then an

application to language identification

so the darpa rats program i think that it's already introduce that so i would

like to give you some taste of the red unfortunately

there are not enough rates to taste for all of you arena there and so

i know there is a place an audio samples

they really i

so you get some impression a noise it is

so the bottleneck features

so the bottleneck feature stands for

a neural network topology where the one hidden layer is a has a significantly lower

dimension than the surrounding layers

my case was at diamond used for the bottleneck and fifteen hundred for the surrounding

layers

what actually it does it does the it does a kind of a compression in

this compress information so that we can use it in a some other ways then

adjusting neural network

it comes from the speech recognition

i where they usually use it the frigid features alone or in the conjunction we

is the baseline features that will be a final image mfccs

what i actually used still got stuck but the mac

where i have to the redhead doing you know networks

both these bottlenecks

and actually the that's second neural network takes the input from the first not from

the bottleneck then expect it in time five frames with a five frame shift

actually this was proven by that the but guys to be very good for speech

recognition so are used today the right to do different number of frames different used

different used

and so on so we you mustn't for this

right topology of the bottleneck features where is the for the first not okay used

frequency domain linear prediction coefficients peace fundamental frequency

as input actually we use the block of the log mel-filterbank it gives you about

the same results

then i have fifteen hundred sixteen hundred and eighty the bottleneck

fifteen hundred and the target

a number of target for me to targets where a state of the context dependent

clause with queen phones

usually like to the beauty garbage or use the triphones i use a queen phones

because bbn had dbn is using the queen phones

the second net actually has about the same topology just the input is different it's

actually i have a five frames

that's stuck in time so it's five times at so it's four hundred but then

other otherwise the quality same

for that's we have a two languages which were transcribed which is a farsi and

eleven time you can see the number of hours what the net was trained

and number of targets

was we just for the each system

so let's go to

language recognition

so the data that syndication as meat set the rest five target language is out

of set class

different durations and as you heard it's quite noisy so i would just keep this

i baseline discrete might baseline system description

he's

i use the p l ps might not nine plp coefficients i use a short

time gaussianization usually can see the benefit of using this for language id but for

these noisy

condition you actually helps

we use a block of these look at eleven frames respect them together and project

them to sixty dimensions of hlda

and as you

sorry is you can see in the in the next slide i tried different coefficients

to compare

you go see the results in next slide are used a ubm with one thousand

twenty four versions

i-vector was four hundred dimensional and the final classifier was neural network we found that

for this kind of task was the best

but you should you have to do something speech are described in the paper

so here are the slide with the first results ugh of them baseline system baseline

results and there are four different feature extraction is i we focused on the three

seconds and ten second condition because under twenty second was so good that it did

make sense to look at thirty second was also would after diffusion

so we mainly focus on the on these two conditions

as you see the mmi coefficients from you two dollars are but the fourth

ten second condition plp sub at the phone three second conditions

the rest this was the but mfcc features which we very using for nist

evaluations

and this is the features which of a the best two thousand thirteen that doesn't

thirteen rats evaluation for us

so these are the like the baseline a baseline features like the conventional acoustic features

so let me before agenda the results of is the bottleneck features let me talk

about the prior for over

the mainly

they use the

a context independent phonemes

which makes quite a lot of differences we will see later

and so what in two thousand thirteen in the forest evaluation jeff map from bbn

actually he use the

context independent and phonemes actually clustering on valentine arabic the dimensional thirty nine so he

to go look of these posteriors just and simply just stick it to the block

of the p l ps is the baseline and then all of this projected back

to sixty dimensions with hlda

and he got the pretty good results it's like got

feature-level fusion it's like

your idea is she she's doing so called phone a log-likelihood ratio posterior features

what she does she takes the posteriors take the log and then do the likelihood

ratio between them

usually appended deltas and sometimes you use the pca to reduce the dimension dimensionality and

then later she easier if use it is this plp

she was before christmas she was it but and she was working on a lot

as well so we could compare these features

and actually these features these features that also better than the baseline features and exceeded

are better than the phonotactic system because they did also the for like the conventional

phonotactic system in this which is there much better and that the and the phonotactics

the code like the conventional phonotactic system to make it to the fusion

and these features the speech used it

during the value process one of your told us that there was a there was

a very similar work which was submitted to ieee electronic that there is at the

end of two thousand thirteen

it will by mister strong and he applied on the clean white cream data on

the nist two thousand you have two thousand nine data

then during the presentation on two thousand and fourteen i guess

actually it's not in the paper we just in the presentation

that's your more and of from google you present in the bottleneck features

and but he's neural network is d n is actually the range to produce the

posterior probability of target language is not a phonemes

so it might open the new field of the like data-driven and data driven features

so let's go to results

so if i take so here are again this for baseline features then divide take

the look up posterior just the log posterior of the which comes out of the

neural network i think just one frame this time means of just one frame

and just build the i-vectors esteem then you can see that it can it is

better than any of the based on about

so then what i did i to eleven like going to block of the this

posteriors

and

stacked them together project the we hlda two sixty

and you can see that it's

quite better than just one frame so it means that the context is very important

and then this is what jeff might need the like the baseline features plus one

frame of and you the posteriors

and projected to hlda via just dimensions

and you can see that this is this case good but it's all the data

like fusion of two systems

so how does the what select features then

so again is just one frame

i tried also more things but it didn't help for me

so one frame of bottleneck features the diamonds nineties at

and you by take the but this at the bottom language the bottleneck from the

first neural network

and this is the stack but language is the book like from the second neural

network so you can see that a boss this teams is quite better than any

of the baseline and actually it makes sense to do the that the stack but

may architecture because you get something

something out of it

what why i'm thinking just one frame it might be a this for the case

for the button like for the state but make features that i'm doing this taking

between the between the nets of it might be that the context it solidity so

that there

so then i did some i have some analysis slides

and the first thing was obviously to a try to tune the bottleneck size

so the usually they use it for speech recognition used a user usually at so

i took eighty and it is the baseline and then try to very the button

excise but is it is you see

the at was the best

if you go to sixty and i

it's kind of stuff to rate both so i stick with at because it was

the baseline formant

the other thing i was interested in force how it stand if we if what

what's but the target for the neural network should be

so we did of is the context dependent phonemes

but how it how is it is the context independent

so it's much easier to train the system is context independent phonemes

then this context dependent because we do not need to build the lvcsr sistine the

training of the neural network is much faster and so and so on

bob if you look at the results the results

a query speaker use the context dependent context dependent phones

i think it's because of you are modelling of the final estimate structure in does

the this feature space

then

the

we have it as i said at the beginning we had we have a to

language is we do we have a transcription for

it's farsi eleven time

and so i

the dues to set of features one was trained on farsi one on lemon time

you can see that they perform about the same

actual data used because it is you can see of the final a slide

and evaluate would you buy we need to choose just one i which is the

levenshtein one because it's just

slightly better

you would not see doing to do in test proceed but test but in that

the reigning the farsi has much higher target like that might much more

context dependent phones so that the reigning was more time so it for like training

convenience the levenshtein but the

so then into thai wasn't two thousand thirteen what jeff did for the rats evaluation

he did the kind of fusion of several six teams

recording language dependent sixteen

and i was explained on the picture

so the language do what is the language dependency for usually we have just ubm

and i-vector and you're not to obtain on the same data which are usually all

data we have

so what we did is to train the gmm on one language which that's a

just are a big language just on dari farsi bunch two or two

and all other languages and then i-vector and in it was collected all of them

and then at the end we to be a just a simple average of the

scores we didn't want to train diffusion because it's more parameters so we do we

because the fusion was then train this other systems

personally i do like that structure that match because the complexity of the systems grows

quite a lot but i think is doing it takes advantage of different of different

alignment of the of the ubms so how does it look like and the results

so that the first line is the

is the baseline where we trained everything on when we try to train the ubm

on all languages

one only down

then next six lines

are the sistine the separate sistine where we train the ubm only on the particle

language

so if you if you see the results none of these be the baseline which

is kind of the is

but then you by take average of this of the of discourse and score it

you can see that there is a very nice benefit of doing this so it's

those of fifteen percent forty second twenty five percent for ten second

and

we had also shan't we had we what we did also the so that in

the rats there are eight

it should be nine

g h channels but the source channel so i did the same for that no

on the channel level

it perform about the same

and then i do those of the average of all of them i

it's also about the same so

there is some separation is also

due to some point it improves

it would be good to the right the what we small saying that the like

the d n and alignment which might be which might be something similar different alignment

to look at is or

so let's look at the final sideways the fusion

the first line is the plp sistine

then

the then i have a fusion of three system which is the stacked bottleneck sistine

with for false eleven time and then the feature level fusion with acoustic system

and you can see that there is about the thirty percent improvement

then

the same one to compare if we did length both all system language dependent so

we saw that like thirty percent or twenty five percent improvement from the

from the language dependent us esteem and here we can see that the fusion still

can't gain the same gain as if we do not the language dependency which is

which is very nice

but the thirty percent from the fusion over the single best system if you do

the language dependency or not

then one of your suggest it's to do something words isolated maybe it was a

review from sri

the

and

also are after the rats evaluation i actually extent streamers from within each and he

said just introduced to do that so what we had is the blue

lou stream

and what's

kind of day deed was very easy to for me to try so i didn't

got the what to make here but i just use entire network and use the

posteriors which were here and dialect defeated due to another mlp and produce the scores

and then i could to use it

so you can see that there is

that's actually for me

the posterior system was voters then the like the stacked bottleneck with i-vectors

but yesterday we can i compared the results this image and actually they are see

steam

like the c n posterior system is a little bit better than mine system here

we talk a little bit it might be because of the c n is behaving

much better than the indian and for noisy condition

which we need to train it to try but the fusion was fusion with these

two approaches is very nice

the conclusion the bottleneck features provides very nice gain

it's very nicely compete of is the with the conventional phonotactic system which we did

before actually it it's much better

and as i said before we

for than for the rats evaluation this year we had also phonotactic systems and none

of then made it to the final fusion

and there are much bigger gains for longer audio files

as i said this what events you more noise during the

the

that is that the make this trained for direct to like to the bottom x

but on the direct task that he's doing the net with the target are the

languages for this case

then it might open to new space for the that the ribbon feature extraction

thank you

thank you problem do we have any questions

how to train the neural networks for the deck

so i for this task is the bbn training told is a stochastic gradient tests

and gpus and each net was trained like it's three day so i have two

nets it was about a week to train trying to

activation function but it was is

i do they remembered activation function in the in the in the a hidden layers

but i know that for the button make these the linear one so there wasn't

there was a linear activation function for the bottleneck was also shown that for the

speech recognition that is providing a better

but the results i can i it's between actually it's in the paper

i satisfy the deleted

so that the same questions that's image this you tell what was used to train

your asr in your d n and

same s

since all the channels

yes all channels and

the d n the d n and for that's the bottleneck or the but make

features contain from the keyword spotting data so it's different data from the ubm and

the and the this

okay so you'd also at different datasets on there so what are the questions here's

how much

what is your sense on the sensitivity to so the to do the indians it

seems like there's it's a start all that a good asr system i and label

your data then trained indian answer the question

maybe people the places i seven had experiences

what people think this sensitivity is on saying i start off with a very good

alignment

see that you start to train at the end

do you get the sense on that you know anything maybe not this work but

otherwise

that's hardware so i

what is but here that you really need to beautiful lvcsr reason is to be

good

what i like is what a nazi armour nancy more noise doing that actually does

not need to do that the irises subsystem you just a side so we can

use actually the language id data directly you trying to neural network on the post

like in the language posteriors so use the same data actually as norm assisting

i played is that a little bit as well i did those of the bottleneck

speakers the because if you do it what he was doing actually on the j

g workshop

that he train the net like the d n and to produce the

line the targets the languages are then he did a because you have it you

have a

the posterior probability of language each frame so we need to do the some timing

so what he did he just the average

and which is

good for three seconds but is not good for ten seconds

so what i did then i to the exactly this a posteriors as the features

i to the output of the layer before the features and then it helps because

then and is that if you just i-vectors justin

and then it helps to do something actually for that i would have much smaller

system to do the to the i-vector system

so that might be

on support not to do the ldc as

just to follow on in response to dogs question with the keyword spotting died of

that was transmitted different time to language are data one thing so i observed in

the speaker i data was that retransmission a time as it trying to the side

as the atmosphere and the transmission affects

so the channel is bearing i've a time so in one regard we got this

keyword spotting that of that kind has different channels of language at daylight

a even though it's theoretically the same equipment that sending that that's a different effect

that's coming three so it's nice to see that is still working despite that different

a similar question now is for instance in the clean sre data with saying difference

between or a problem trying to classify microphone a trials when most you're trying that

if we take your network is telephone speech

the one here last statements on the thought was that the bottleneck features are a

great even in noisy conditions so of course got very matched data he at do

you have any theories of how the bottleneck features my car in mismatched conditions i

last minute because of various system

appears sensitive to it i wonder if the bottleneck smart little bit more distant just

because the compression factor

i think it would depend on the train data for the d n and

okay so what we did for that's it for best

together with speech

we had adjust the clean data for training to the nn so we just say

okay so what we go do if the test data will be noisy so we

just to get thirty percent of the training data and we just artificially i denotes

that help a lot

so then the d n and source of the noisy condition

since that reputed question our

if you have to do very many languages the

could you imagine having from for universal recognizer system for the d n are you

think you'll have to be very

i think that the people need to build at least a few d n n's

because i think that mitch you said that you try to those of the like

the farsi eleven time and then

the universal one right

so you might common much more if it was better than to separate one or

the fusion of these two

so we had someone in our lab construct the multilingual dictionary between these two languages

that was the best

of the three systems that we tried but we also found the fusion of all

three was best in fact our primary system was the fusion of the t c

and in systems but also three a c in an i-vector systems for the site

languages but all with one language id feature

if that might if you're member distinction between the d n and has a certain

age and language id as a set which we just maintain one image and em

sales code language id feature that the c and then change and that was

a very good fusion

i in terms of the sre all i read i to we found that having

the multiple languages

if you get the good scart across the different phone right that's one of stuff

to converge

then i think that you would need to a few systems

not many for three four

and it would be better than to have one universal

okay

Neural Network Bottleneck Features for Language Identification

Neural Nets for Speaker and Language Modeling

Pavel Matejka, Le Zhang, Tim Ng, Sri Harish Mallidi, Ondrej Glembek, Jeff Ma and Bing Zhang