my name is problematic a and i'll be talking about that
a neural network or bottleneck features for language identification
a i did this work during my postdoc iterating bbn
and at first i will talk about the writes they darpa rats program which i
tested you don't so it's a noisy condition
then i will talk about the neural network what the mecc features and then an
application to language identification
so the darpa rats program i think that it's already introduce that so i would
like to give you some taste of the red unfortunately
there are not enough rates to taste for all of you arena there and so
i know there is a place an audio samples
i
i
they really i
i
so you get some impression a noise it is
so the bottleneck features
so the bottleneck feature stands for
a neural network topology where the one hidden layer is a has a significantly lower
dimension than the surrounding layers
my case was at diamond used for the bottleneck and fifteen hundred for the surrounding
layers
what actually it does it does the it does a kind of a compression in
this compress information so that we can use it in a some other ways then
adjusting neural network
it comes from the speech recognition
i where they usually use it the frigid features alone or in the conjunction we
is the baseline features that will be a final image mfccs
what i actually used still got stuck but the mac
where i have to the redhead doing you know networks
both these bottlenecks
and actually the that's second neural network takes the input from the first not from
the bottleneck then expect it in time five frames with a five frame shift
actually this was proven by that the but guys to be very good for speech
recognition so are used today the right to do different number of frames different used
different used
and so on so we you mustn't for this
right topology of the bottleneck features where is the for the first not okay used
frequency domain linear prediction coefficients peace fundamental frequency
as input actually we use the block of the log mel-filterbank it gives you about
the same results
then i have fifteen hundred sixteen hundred and eighty the bottleneck
fifteen hundred and the target
a number of target for me to targets where a state of the context dependent
clause with queen phones
usually like to the beauty garbage or use the triphones i use a queen phones
because bbn had dbn is using the queen phones
the second net actually has about the same topology just the input is different it's
actually i have a five frames
that's stuck in time so it's five times at so it's four hundred but then
other otherwise the quality same
for that's we have a two languages which were transcribed which is a farsi and
eleven time you can see the number of hours what the net was trained
and number of targets
was we just for the each system
so let's go to
language recognition
so the data that syndication as meat set the rest five target language is out
of set class
different durations and as you heard it's quite noisy so i would just keep this
like
i baseline discrete might baseline system description
he's
i use the p l ps might not nine plp coefficients i use a short
time gaussianization usually can see the benefit of using this for language id but for
these noisy
condition you actually helps
we use a block of these look at eleven frames respect them together and project
them to sixty dimensions of hlda
and as you
sorry is you can see in the in the next slide i tried different coefficients
to compare
you go see the results in next slide are used a ubm with one thousand
twenty four versions
i-vector was four hundred dimensional and the final classifier was neural network we found that
for this kind of task was the best
but you should you have to do something speech are described in the paper
so here are the slide with the first results ugh of them baseline system baseline
results and there are four different feature extraction is i we focused on the three
seconds and ten second condition because under twenty second was so good that it did
make sense to look at thirty second was also would after diffusion
so we mainly focus on the on these two conditions
as you see the mmi coefficients from you two dollars are but the fourth
ten second condition plp sub at the phone three second conditions
the rest this was the but mfcc features which we very using for nist
evaluations
and this is the features which of a the best two thousand thirteen that doesn't
thirteen rats evaluation for us
so these are the like the baseline a baseline features like the conventional acoustic features
so let me before agenda the results of is the bottleneck features let me talk
about the prior for over
the mainly
they use the
a context independent phonemes
which makes quite a lot of differences we will see later
and so what in two thousand thirteen in the forest evaluation jeff map from bbn
actually he use the
context independent and phonemes actually clustering on valentine arabic the dimensional thirty nine so he
to go look of these posteriors just and simply just stick it to the block
of the p l ps is the baseline and then all of this projected back
to sixty dimensions with hlda
and he got the pretty good results it's like got
feature-level fusion it's like
your idea is she she's doing so called phone a log-likelihood ratio posterior features
what she does she takes the posteriors take the log and then do the likelihood
ratio between them
usually appended deltas and sometimes you use the pca to reduce the dimension dimensionality and
then later she easier if use it is this plp
she was before christmas she was it but and she was working on a lot
as well so we could compare these features
and actually these features these features that also better than the baseline features and exceeded
are better than the phonotactic system because they did also the for like the conventional
phonotactic system in this which is there much better and that the and the phonotactics
the code like the conventional phonotactic system to make it to the fusion
and these features the speech used it
during the value process one of your told us that there was a there was
a very similar work which was submitted to ieee electronic that there is at the
end of two thousand thirteen
it will by mister strong and he applied on the clean white cream data on
the nist two thousand you have two thousand nine data
then during the presentation on two thousand and fourteen i guess
actually it's not in the paper we just in the presentation
that's your more and of from google you present in the bottleneck features
and but he's neural network is d n is actually the range to produce the
posterior probability of target language is not a phonemes
so it might open the new field of the like data-driven and data driven features
so let's go to results
so if i take so here are again this for baseline features then divide take
the look up posterior just the log posterior of the which comes out of the
neural network i think just one frame this time means of just one frame
and just build the i-vectors esteem then you can see that it can it is
better than any of the based on about
so then what i did i to eleven like going to block of the this
posteriors
and
stacked them together project the we hlda two sixty
and you can see that it's
quite better than just one frame so it means that the context is very important
and then this is what jeff might need the like the baseline features plus one
frame of and you the posteriors
and projected to hlda via just dimensions
and you can see that this is this case good but it's all the data
like fusion of two systems
so how does the what select features then
so again is just one frame
i tried also more things but it didn't help for me
so one frame of bottleneck features the diamonds nineties at
and you by take the but this at the bottom language the bottleneck from the
first neural network
and this is the stack but language is the book like from the second neural
network so you can see that a boss this teams is quite better than any
of the baseline and actually it makes sense to do the that the stack but
may architecture because you get something
something out of it
what why i'm thinking just one frame it might be a this for the case
for the button like for the state but make features that i'm doing this taking
between the between the nets of it might be that the context it solidity so
that there
so then i did some i have some analysis slides
and the first thing was obviously to a try to tune the bottleneck size
so the usually they use it for speech recognition used a user usually at so
i took eighty and it is the baseline and then try to very the button
excise but is it is you see
the at was the best
if you go to sixty and i
it's kind of stuff to rate both so i stick with at because it was
the baseline formant
the other thing i was interested in force how it stand if we if what
what's but the target for the neural network should be
so we did of is the context dependent phonemes
but how it how is it is the context independent
so it's much easier to train the system is context independent phonemes
then this context dependent because we do not need to build the lvcsr sistine the
training of the neural network is much faster and so and so on
bob if you look at the results the results
a query speaker use the context dependent context dependent phones
i think it's because of you are modelling of the final estimate structure in does
the this feature space
then
the
we have it as i said at the beginning we had we have a to
language is we do we have a transcription for
it's farsi eleven time
and so i
the dues to set of features one was trained on farsi one on lemon time
you can see that they perform about the same
actual data used because it is you can see of the final a slide
and evaluate would you buy we need to choose just one i which is the
levenshtein one because it's just
slightly better
you would not see doing to do in test proceed but test but in that
the reigning the farsi has much higher target like that might much more
context dependent phones so that the reigning was more time so it for like training
convenience the levenshtein but the
so then into thai wasn't two thousand thirteen what jeff did for the rats evaluation
he did the kind of fusion of several six teams
recording language dependent sixteen
and i was explained on the picture
so the language do what is the language dependency for usually we have just ubm
and i-vector and you're not to obtain on the same data which are usually all
data we have
so what we did is to train the gmm on one language which that's a
just are a big language just on dari farsi bunch two or two
and all other languages and then i-vector and in it was collected all of them
and then at the end we to be a just a simple average of the
scores we didn't want to train diffusion because it's more parameters so we do we
because the fusion was then train this other systems
so
personally i do like that structure that match because the complexity of the systems grows
quite a lot but i think is doing it takes advantage of different of different
alignment of the of the ubms so how does it look like and the results
so that the first line is the
is the baseline where we trained everything on when we try to train the ubm
on all languages
one only down
then next six lines
are the sistine the separate sistine where we train the ubm only on the particle
language
so if you if you see the results none of these be the baseline which
is kind of the is
but then you by take average of this of the of discourse and score it
you can see that there is a very nice benefit of doing this so it's
those of fifteen percent forty second twenty five percent for ten second
and
we had also shan't we had we what we did also the so that in
the rats there are eight
it should be nine
g h channels but the source channel so i did the same for that no
on the channel level
it perform about the same
and then i do those of the average of all of them i
it's also about the same so
there is some separation is also
due to some point it improves
it would be good to the right the what we small saying that the like
the d n and alignment which might be which might be something similar different alignment
to look at is or
so let's look at the final sideways the fusion
the first line is the plp sistine
then
the then i have a fusion of three system which is the stacked bottleneck sistine
with for false eleven time and then the feature level fusion with acoustic system
and you can see that there is about the thirty percent improvement
then
the same one to compare if we did length both all system language dependent so
we saw that like thirty percent or twenty five percent improvement from the
from the language dependent us esteem and here we can see that the fusion still
can't gain the same gain as if we do not the language dependency which is
which is very nice
but the thirty percent from the fusion over the single best system if you do
the language dependency or not
then one of your suggest it's to do something words isolated maybe it was a
review from sri
the
and
also are after the rats evaluation i actually extent streamers from within each and he
said just introduced to do that so what we had is the blue
lou stream
and what's
kind of day deed was very easy to for me to try so i didn't
got the what to make here but i just use entire network and use the
posteriors which were here and dialect defeated due to another mlp and produce the scores
and then i could to use it
so you can see that there is
that's actually for me
the posterior system was voters then the like the stacked bottleneck with i-vectors
but yesterday we can i compared the results this image and actually they are see
steam
like the c n posterior system is a little bit better than mine system here
we talk a little bit it might be because of the c n is behaving
much better than the indian and for noisy condition
which we need to train it to try but the fusion was fusion with these
two approaches is very nice
the conclusion the bottleneck features provides very nice gain
it
it's very nicely compete of is the with the conventional phonotactic system which we did
before actually it it's much better
and as i said before we
for than for the rats evaluation this year we had also phonotactic systems and none
of then made it to the final fusion
and there are much bigger gains for longer audio files
a
as i said this what events you more noise during the
the
that is that the make this trained for direct to like to the bottom x
but on the direct task that he's doing the net with the target are the
languages for this case
then it might open to new space for the that the ribbon feature extraction
thank you
thank you problem do we have any questions
how to train the neural networks for the deck
so i for this task is the bbn training told is a stochastic gradient tests
and gpus and each net was trained like it's three day so i have two
nets it was about a week to train trying to
activation function but it was is
i do they remembered activation function in the in the in the a hidden layers
but i know that for the button make these the linear one so there wasn't
there was a linear activation function for the bottleneck was also shown that for the
speech recognition that is providing a better
but the results i can i it's between actually it's in the paper
i satisfy the deleted
so that the same questions that's image this you tell what was used to train
your asr in your d n and
same s
since all the channels
yes all channels and
the d n the d n and for that's the bottleneck or the but make
features contain from the keyword spotting data so it's different data from the ubm and
the and the this
okay so you'd also at different datasets on there so what are the questions here's
how much
what is your sense on the sensitivity to so the to do the indians it
seems like there's it's a start all that a good asr system i and label
your data then trained indian answer the question
maybe people the places i seven had experiences
what people think this sensitivity is on saying i start off with a very good
alignment
see that you start to train at the end
do you get the sense on that you know anything maybe not this work but
otherwise
that's hardware so i
what is but here that you really need to beautiful lvcsr reason is to be
good
what i like is what a nazi armour nancy more noise doing that actually does
not need to do that the irises subsystem you just a side so we can
use actually the language id data directly you trying to neural network on the post
like in the language posteriors so use the same data actually as norm assisting
i played is that a little bit as well i did those of the bottleneck
speakers the because if you do it what he was doing actually on the j
g workshop
that he train the net like the d n and to produce the
line the targets the languages are then he did a because you have it you
have a
the posterior probability of language each frame so we need to do the some timing
so what he did he just the average
and which is
good for three seconds but is not good for ten seconds
so what i did then i to the exactly this a posteriors as the features
i to the output of the layer before the features and then it helps because
then and is that if you just i-vectors justin
and then it helps to do something actually for that i would have much smaller
system to do the to the i-vector system
so that might be
on support not to do the ldc as
just to follow on in response to dogs question with the keyword spotting died of
that was transmitted different time to language are data one thing so i observed in
the speaker i data was that retransmission a time as it trying to the side
as the atmosphere and the transmission affects
so the channel is bearing i've a time so in one regard we got this
keyword spotting that of that kind has different channels of language at daylight
a even though it's theoretically the same equipment that sending that that's a different effect
that's coming three so it's nice to see that is still working despite that different
a similar question now is for instance in the clean sre data with saying difference
between or a problem trying to classify microphone a trials when most you're trying that
if we take your network is telephone speech
the one here last statements on the thought was that the bottleneck features are a
great even in noisy conditions so of course got very matched data he at do
you have any theories of how the bottleneck features my car in mismatched conditions i
last minute because of various system
appears sensitive to it i wonder if the bottleneck smart little bit more distant just
because the compression factor
i think it would depend on the train data for the d n and
okay so what we did for that's it for best
together with speech
we had adjust the clean data for training to the nn so we just say
okay so what we go do if the test data will be noisy so we
just to get thirty percent of the training data and we just artificially i denotes
that help a lot
so then the d n and source of the noisy condition
since that reputed question our
if you have to do very many languages the
could you imagine having from for universal recognizer system for the d n are you
think you'll have to be very
i think that the people need to build at least a few d n n's
because i think that mitch you said that you try to those of the like
the farsi eleven time and then
the universal one right
so you might common much more if it was better than to separate one or
the fusion of these two
so we had someone in our lab construct the multilingual dictionary between these two languages
that was the best
of the three systems that we tried but we also found the fusion of all
three was best in fact our primary system was the fusion of the t c
and in systems but also three a c in an i-vector systems for the site
languages but all with one language id feature
if that might if you're member distinction between the d n and has a certain
age and language id as a set which we just maintain one image and em
sales code language id feature that the c and then change and that was
a very good fusion
i in terms of the sre all i read i to we found that having
the multiple languages
if you get the good scart across the different phone right that's one of stuff
to converge
then i think that you would need to a few systems
not many for three four
and it would be better than to have one universal
okay