also be presenting the work of a whole so the sri you malay with a
little from this one
and this is looking at applying convolutional deep neural networks to language id noisy conditions
in particular the conditions of the rats top part
so start with a bit of a background mormon we might wanna do this and
one domain
motivation to use the in an i-vector framework that we recently proposed a speaker id
then how do we kind of the noisy conditions we start looking at convolutional neural
networks for that purpose
and then we also present a simple a system called a scene and posterior system
the language id ensure results on that
that your experiment setup note walk through some results
that's for a bit of the background and language id the ubm i-vector framework is
pretty widely used in language are they a phone recognizers are also good option
the when if you use these two together that's when you get a really nice
improvement the quite complementary so in our books one of the challenge is as always
been how do we get speech information the white someone pronounces something into a single
system that outperformed you don't individual systems that's that what we call the challenge
so we want one phonetically away a system that can produce scores that outperform few
scores about
so we recently solve this the speaker id
at least in the telephone cases
that
we just a background of the nn i-vector framework
so what we're doing here's combining the deep neural network that's trained for automatic speech
recognition along with the popular i-vector model
why we use it is to generate out first-order stats and zero what's that
in particular we using with the nn in place of the ubm
and what we're doing here if you look at the comparison down the bottom here
the ubm plus the in the ubm is trained in unsupervised manner
it's trying to
represent classes would be gaussian
and it's a shame generally
to map to different phonetic classes however someone pronounces define a one why someone else
mark
phonetic completely different
ubms gonna model that in different components
so what the bn nn is
i else is trained in unsupervised manner
that means it's trying to map those same fines up to what we call seen
that it's a
a tight three fine state so the two different people pronouncing different wise the find
a
would be activating the same scene
and that
hopefully should capture different speaker trials
so have a second speaker i
it's very couples but speaker id in a the initial publication in like ask this
year
we got thirty percent relative improvement on telephone conditions particularly see two and c five
of the nist sre twelve
what i'm showing on this
slide here is actually we've got three different systems the sri sre twelve submission which
the fusion of six different features side-information a quarter conglomeration the
and then we show some recent work done is that the mfccs and deltas and
double deltas what we're calling my pca dct
that we publication on that in
i mean to speech i mean icassp next year
just to give your reference that gives about twenty percent relative improvement i've mfccs on
all conditions of sre twelve
but what's really not to be in an i-vector can still bring twenty percent improvement
on these two conditions c two and c five
so it's very powerful there's to work to be done a microphone trials this mismatch
happening we have my progress on that and we'll fact would be able to publish
on that very soon
so what i want to conclude here is that would now got a single system
the pizza sre twelve submission
so this how
cannot be useful language
that's the question to get there
so the context years the output of the nn should include language-related information
and ideally be more robust to speaker and noise variations
so the reason i say that is when you training in the nn for
i guess a you wanna remove speaker variability
but i'll a suitable for channel degraded language
in the rats program we actually and i think it was i b m o
b n
so i c n was particularly good for the rats noisy conditions we validated that
a nap keyword spotting trials you can see that dramatic different mikes in the channel
degraded speech so we said but use the same and then why should be very
way that the nn
and so that something that still open and we got a few review comments on
that actually in we need to validate difference between nice to show the actual improvement
in lid performance
so we do that in future
smoothing along with the c n
this is essentially the same trying to close to the and then
you can see we've got acoustic features that go into you had to men gmm
that provides alignments for the day n and trying to what you're trying to the
nn you no longer need to islam so you don't need to generate those test
false coming in france
and we've got acoustic features the forty dimensional log mel filterbank energies that are used
for training the neural net
here we stacking in our work with stacking fifteen frames together has the input for
the training
and we use a decision tree needs to be fancy names
and as a set a we generate training alignments with the pre-trained
h m and gmm which we don't need of
just as an illustration
c n
basically front of trying to ban in appends this liar all this process here where
i you've got your
filterbank energies within fifteen frame context
you
possible be convolutional filter
i think we're using
size of i which is in but i
and then what we're doing is the max pooling option of the n
that means that each of the three
for each three blocks the come out we take the maximum one
and i just helps with the noise state
have a single i-vector system go with this
we can see that which simply plug in the c n instead of the ubm
what straightforward
what's interesting here is that we've got two different acoustic features first is used for
the c and to get the posteriors for each of the same lines
and then multiplying those posteriors would be
acoustic features for language out that is to discriminate languages
that is the second set of features so the number and of the two apart
will be negative thought as you got extracted from features if you choose to you
can use the same suffice
but if you want to use model features in this in the fusion system
and you need to extract posteriors using that one set of pages
this is in contrast to twenty but multiple features for fusing with the ubm systems
you've got extract for instance five different sets of posteriors for each feature if you
had a five way fusion
another aspect is you're right but it sure in the language are the features independently
of those of the providing the posteriors
heart currently with their ubm systems it's a bit of a balancing act you want
stable posteriors but you also want would extremely discriminability a of the upper side of
it
in the statistics
so it can we go easy a simpler an alternative system here's a simple system
which take mostly and then we get the frame posteriors
we forget about first order statistics
we're doing here is normalized in the zero th order statistics in log domain
and then we just use a simple we back end for instance here we using
a neural network can use a gaussian backend assets one thing that distinguishes it from
phonotactic system
you can use standard language id backends which is not
so here we using a count of but i'd context dependent states will try
and that's a state level instead of find labels just
let's look at experimental setup
darpa rats program sure many of you have a how noisy these samples are
i think john was talking about them anywhere on the way this part target languages
tend online
i can see those on the screen the
this a few channel the degradation seven different channels snrs between zero and thirty
the transcription that we used to train the seen in table keyword spotting task and
as any two languages that that's an unusual aspect that we're trying to distinguish five
languages but we training onto a plane
test durations three ten thirty seconds and one twenty seconds and a metric we use
here is the average equal error rate across the target languages
terms of the model the one used to generate the training alignments for the and
then
the hmm gmm set up here we were producing three around three thousand c nines
with two hundred thousand gaussians
and this was multilingual training on both bastien haven't on our
c n model was also trying to sign my with the multilingual training set
we've got hot pocket lies with twelve hundred
nodes each and we've got forty filter banks of things that frames
you can see the pooling size and the filter sauce will be
for be seen in convolutional stuff
for the ubm model for comparison
our training of two thousand forty eight component ubm
and the features directly optimize the seed task the speaker id task the right based
tended to for well over two language are the actual number forty dimensional two d
dct log mel spectral features and this is similar to the zigzag dct
work that we propose to not cast this you the pci dct the shirt the
speaker are they really a that's an extension that further improves that's
what about the vectors and background back and sorry about the thin and then ubm
i-vectors all trained on the same data for the i-vector subspace
and that by four hundred dimensional
for the posterior system with collecting the three thousand average posteriors removing the silence indexes
three of those am reducing to four hundred dimensional reality same as the i-vector subspace
using probabilistic pca
for the back end we trying simple neural network mlp
i would cross entropy
what we do with the data to a enlarge our training dataset is to chunk
the data into thirty second chunks and i'd second chunks with fifty percent overlap i
think that end up with around two million
i-vectors to train on the
the output is five target languages and the one house across as well
i was performance got
well first of all the ubm i-vector approach the ubm isn't being trained in a
supervised by so what we said was well let's take the same lines from the
same and where we know we've got three thousand five
and let's along the frames for each of the icing lines and train each of
the ubm components with that
so the idea here was to try to give a fair comparison between ubm unseen
system
we don't nice improvement across all of those the
the scene and approaches
a is less to see that for ten seconds or more
getting a thirty percent for more relative improvement over the ubm approach for the three
second
timeframe testing
twenty percent relative improvement
what was interesting between the posterior system and the i-vector system for the c in
lid performance is actually quite similar
but if we fuse that study
we're gonna nice can again twenty percent relative improvement
when we just component combine the two difference in the parts as
optically for less than one twenty seconds is where we see for one twenty wasn't
pretty
when we got hidden at the ubm i-vector system to that so different modeling part
actually get no
kind from the fusion except in the one twenty case
also another interesting problem
in conclusion we compare in the robustness of the c n in the noisy conditions
in particular taking that the i-vector framework and making an effective on the rest language
id task
we propose to in that sense and yukon the phonotactic system the scene and posterior
system
i which is quite a simple system and those high complementarity between these two propose
i in terms of extension where do we go from here we can improve performance
a little one not doing probabilistic pca before the backend classification
and fusion of different language dependent si amends
the schools from there is a also provides a again
some the bottleneck features which i think a local might be talking about so
are also good alternatives
for the direct usage of the nn c n l for language
thank you
we have time for some question
like what
thanks for my start we're expecting request from right
possibly
so for the posterior cm impostors esteem you said it use the pca to four
hundred a prior to on your on it right yep you try also to put
data for vector
i imagine it is
yes the extension the first one on the extension says if we don't do that's
that we do get a slight improvement
i think the motivation for them reducing it to four hundred dimensions was for comparability
with the i-vector space see what can we get in that's four hundred dimensions
but
so my question to do with the data that was used to train the asr
and the c n i chi was it the multiple channels of the arabic and
the farsi data abilities are so you trained in channel
no channel condition less channel
i believe that the ubm
the ubm yes so we use the channel degraded data for the ubm the like
outlook same data
the a use arabic in the farsi data to train ubm like alignment like the
ubm was exposed all five languages across all channel conditions
but that was one thing you said you to the alignment of the signals and
then train the states the ubm
the second one there i guess that's what it was used for that
supervised ubm to get the alignment with the c nine words coming through the c
n and that was trained with keyword spotting tighter but the ubm it so i
believe that have checked this of the couples as was trained with the with data
which is across five languages
that's what you think that how much you think that's an impact of having c
change datasets and
classifiers you think that first question i would have to say you would you would
think that having five languages in the ubm
plus more plus the other set languages would give it an advantage to some degree
but as you said datasets changing
i think so to their be a good point
so match it if you're gonna do a very wide set of languages do you
have a hope of having sort of a single master universal like the hungarian traps
has been so successful in the past or do you think you're gonna have to
build many different language d n and
so what would what was saying so far as basically
the mall language dependency intends that you put together you fused together
the improvement reduces so perhaps if you had five the cover a good a space
of the phones across different languages that might be what you michael universal collection that's
appealing
right