okay so by no means but also
and whatever send a work with microwave so these hand plucking
we're basically we test and evaluate our address the m utterance system
in a like a different mean on this and so this was a system that
we already presented
and at an obvious to test how we disaffected on different scenarios
so for someone a the one the motivation why we started using this architecture and
how
how we started using this
there we will lead to a very of file is the m probably you will
be quite
already aware this that i guess it's
nice to have some tracks near
then we will all the details of the screen men so we will detailed a
system description
the reference i-vector system that we will compare
our proposed system we
the different scenarios we're gonna tested
and results
and finally we will conclude the work
so
we all know what we take these already the process of automatically identifying language for
a given a spoken utterance
and typically this has been done for many years
rely you know acoustic model so these systems
basically have the state is first some i-vector extraction
and then some classification states
and last years we're seeing a really a strong
a new line that it's deep neural networks
and it can be more or less divided in three different approaches
one is the and two and systems we have seen that it's a very nice
solution but we are not achieving best results so far
then we have the what an x
and then
after computing but as we go to the i-vector
struck some and we keep the fuel line
and then we have this signals
sorry for type other
so this would be a and this paper we wanna focus on the end-to-end approach
so we want to improve the end-to-end approach for this
we would be a very like stander the nn for language recognition when we try
to use and to an approach
basically we have
some parameters as input
then we have one or several he'd over the years with some nonlinearity
and we try to compute the probability of some
some of the
the language we are gone test
in the last layer so for this we use a softmax it with us probabilities
one of the main drawbacks of these
be system is that
we need some context if we try to get an output frame-by-frame we are not
gonna get then you would result so this system relies on stacking several acoustic frames
in order to all the time context
and that has many problems one and we have a fixed
length but probably will not work best for all different conditions
and it's like bright the
since a deep to union
so how can we model these in a better way
the theoretical answer he's recommend your networks so basically we have same
structure that before
but this once we have recourse if connections
all the others are saying
what's the problem with this one's we have a vanishing gradient problem
the a basically what happened sees
in theory it's a very nice model
but when we try to train these networks because of these records you can extra
we end up having all day they weights going to either zero
or something really high
so there are ways to avoid this but usually is very tricky it depends a
lot on the task on the data so it's not
really useful
and here is where the others the m columns
basically stm means they first
stander the nn
and we replace
all day hidden nodes
with this l s t m block that we have here
so let's go to the theory of this blog
basically it seems kind of scary when you see first
but it's pretty simple after you look at provider
we have a flow of information that goes from the bottom to the top
and as in any
a standard you know we have a nonlinear function
that we
this one here
and is bessel thing all the others the n is that it has a minimal
use
we take this one
so that
the all the other stuff that we have there are three different gates duh what
they do he's they let
or b
they later they don't lead the information go through
so here we have a input data
the if it's activated we will lead the input
of a new times that we'll for war
if it's not it won't
we have they forget gate
that's what it that is basically we set the memory so
so if we speech calculated it will would that sell to zero otherwise it will
keep the state of the of the previous time step
and the output gates
note that gate we'll let the computer
computer output
here
go to the network or not
and then what we have of course is a vector and connex so
the output of
well as time goes the input
of day next one you know data
so it's basically trying to meaning they are in and model
but with this case we avoid that problem because that gates work
both in this time but also entering time so when we have we're doing the
back propagation
and we have some ever that's a stew maybe rice the weight
that forget gate that would be a that input gate it's but also clock that
error from going
many times so we avoid that problem
the system that we used for language recognition
been doesn't rely on stacking acoustic frame so we receive only one frame at the
time
we will have one or two hidden layers and the relay here will be a
unidirectional it is the m
we impose
impose war
these connections that we have here
that basically
it allows the network to decide things on the like depending on time so we
it supposed to improve they the performance of a memory cell
the output we will use a softmax right like in the nn
cross entropy error function
and for training
what we do he's in the first area will have a very balanced nice dataset
so we need to do any implies either
but on more difficult to know is we will have some and but also the
data
so what we do in order to avoid problems with them but data
we just over something so we take random sites of two seconds and then we
have six hours
of all the other languages in every iteration
so that it so that we have
for every iteration is different
then we we'll use
to compute the final score of an utterance we will do operates of day softmax
output
but taking into account only the last ten percent
of this course i was playing ability later right
and then finally we will use a multiclass linear logistic regression calibration we use simple
we will compare the system to a reference i-vector system needs a very straightforward using
mfccs the see exactly the same features that we used for that is the m
we we'll one thousand twenty four gaussian components for the ubm
the i-vector ease of size four hundred
it's based cosine distance scoring that's
it controls are it depending on how many languages we have snow would
this was working better
the and doing lda you're doing the lda so that's why we decided to take
a cosine distance scoring
if we have more languages it would be better to use lda but the difference
was a small enough to note that a too much since there
and this is the most implementing quality and need has exactly the same by recent
technique always trained with the same
same data
so these are the three scenarios that we are going values to compare and test
these
these network personnel you e
a subset
on the nist
two thousand nine language permission evaluation
so that is that we use is that there is coming from the three seconds
task
this is a subset the a pretty minutes it's like very set so that the
it is the in will work based
so it's a very kind of d c subset in the in the two thousand
and nine evaluation what we d d's first we have a imbalance meetings of cts
voice of america so we draw all the cts data then we will avoid that
buttons makes and also we will avoid i mean a mismatched
in training so we have only one dataset
a for the languages we wanted to have also a high amount of data
so we to only those then which is that had at least two hundred of
more hours
i'm we also then want to have unbalanced data so we got those datasets so
all of them
two hundred hours per available for training
and that lid
two d subset of we have here
it's not a soul seven it's not the most difficult like we so before it's
just those that happened these two hundred hours a of voice of america data
and we use only a three seconds task because historically we so that for starter
addresses
is where the neural networks outperform more director so we wanted to be in that
in that scenario
then seconds note that we want to test is they that said
of nist language is no one listened to for some fifteen
here we don't avoid any of the difficulties so we have a meeting so cts
and brought about and b and b s
and we will keep
everything
we have seen the there's of this so it's twenty language is scroll in six
cluster accordion similarity so it's supposed to be more challenging because the languages are closer
within a cluster
that model training data it's also gonna be like it during just followed we have
some then which is we lessen have a lower something which is with more than
hundred hours
and split that we need ease
eighty five percent for training fifteen percent for testing
that's something we wouldn't do again if we like run experiments again this is what
we need
the time so before i mean this set and everything
and we thought it would be nice to have more data for training but after
that we ran some experiments and we found that having
it'll be less training data but more they've data we'll help experiments
but we keep exactly what we use in the one best
and that's a test what we need these with that fifty percent we took chance
of three seconds ten seconds and three seconds
two meeting a little bit the a
the performance of on the and the one less
and then that are texan area will be they test set of nist language relational
oneness
we discover a broad runs of speech durations it's not been beans anymore
and we has a big mismatches between training and unable as we so before
so the results that we have first this is kind of aside result is not
that important but as we are using a unidirectional it is the em what we
have is that the output at a given
times them
things that depends
not only on the input of that
times that are also and all the input of the previous inputs
so the last output is always more reliable
then the ones before
and we thought that maybe we were affecting they performance if we take the first
outputs that are less reliable so we just started dropping all the first outputs and
seen how that affected the performance
this is this so for this one
we don't really care about the
the moderated we have here we only got about how improves
so the absolute
equal error rate doesn't matter only the relative difference
and we found a taking into account only the last ten percent
would be a very optimal point
and we also so that taking into account only the very last score only one
output of a softmax
we were as good as taking the last ten percent but we do they
the last ten percent or
so these are the results
on they
on they first scenario
remember that this is the one do we only voice of america a languages
two hundred hours per language for training
we have here
this is the different architecture that we use we both three had one hidden layer
those two layers and then we have different size of the he'll data from like
this is the smallest we two hundred fifty six
are the begins with one hundred twenty four one thousand and four
this is the a size in terms of number of parameters
of all the models
and be so the results that we obtain
so the reference i-vector system and a seventeen percent almost equal error rate
and point sixty now see a rates
and we see that pretty mats all day and as the em approach is clearly
outperformed that
and i'm not of them has a much smaller number of parameters
so those are really good results but we are in these
balance error you
as we can see the best system us like
fifteen percent a better error
and has like
i four percent gain in terms of size
and we also wanted to check how complementary information that these others the m and
the i-vector were struggling so we fuse the best alice consistent with the reference i-vector
system
and the result whether the way remotes
that's better
we twelve percent
which is like fifteen percent better than they based system i'll
this is the completion metrics doesn't have much information but we can see i'm not
only in terms of accuracy but comparison with other languages how would be performed in
this subset
these are the results in that the dev set of a language recognition evaluation
to for some fifteen
for this one we just we didn't do an experiment with different detectors we were
a little bit and harris we use only they based system on the previous scenario
we what which was to don't layers of size five hundred total
and what we can see here is that the
that is the m
performs
much better than the than the i-vector or on three seconds
while on thirty seconds
we d scenario where we have these mismatches between that the bases and these buttons
on the data sets
this end to end system is not that so we still results for what are
like that were always outperform an i-vector why this and to an approach i it's
able to extract more information from sort lessons but not that matt's for longer
that would think that we so here is that even though the results for longer
utterances
is
a way worse than the one of the i-vector
diffusion used pretty much always
better than any of a single system
so even if the even when the when there is the m is working worse
than the i-vector
we are able to strut different information that will help in a file and system
so we were also quite be quite happy with the results
this is they
they do that we have for three seconds where we can see that the l
is the em outperforms and over twenty percent relative improvement
over the i-vector
and we see also that the a diffusion works always
better
that in any of a single system
and now here we go on to the results of at all but this set
of language recognition evaluation
and here the things get much more so
first of all
we have first column is that is the and second column is a i-vector
third one is the fusion of both
noncheating one the one we used for the listening
and a point one
is exactly the same but using like the succeeding fusion so we use a two
fold cross validation
so we will use in how of the test set
for training they fusion on the other half
of course that that's
that is not alone in the evaluation
but we wanted to know how like
whether the
the systems were learning complementary information
or whether they weren't so what with always maybe we've used in a good way
we can distract how maps how complementary information
so for the messages that we have to take from here is that versa for
at the end you learning these very hot a scenario is able to
get a result they need a comparable with a i-vector but when it gets much
worse as when the base increases because the i-vector is able to extract better than
better results when it is the m
a status
on the
on that same performance
but the good thing is that we don't have such a big might minutes or
we are able to do a good solution
we can steal even when you're
on the known as the rest we can use the in room we diffusion
the performance of that i-vector
so
it's conclude the work
basically the same take a messages are
first of all on a control unbalanced scenario
we have we promising results
it's a very simple system we eighty five percent this parameters
and that it's able to get fifteen percent relative improvement
problem is the once it gets
on an imbalance in a real more exciting england the results are not
as good
and finally we know that the that on strong mismatches and harder scenario it we
are not able to strike information within a
so there is a need for variability compensation but we still think that it's a
you really promising approach
that we need to simpler a systems that can get quite good results
lots of questions
so also
just the small comment you say that you're averaging the outputs of the ten percent
of the last frames
you are always using than posants for free second test of a thirty second test
we always using them person did you try to just
a rate for the thirty last frame independently of duration of your this
we e actually not for they how this areas but for the aec once we
need a like a lot of
things like not only averaging about like i mean or selecting only based on was
one or
just a drawl all the ones that are out that yours
and we found that is not really work need to the little thing to note
in there
but maybe in day in a more telling in serious it would be a with
with the way we haven't right
is it possible to go back to slide twenty four second here
sorry i notice you're always getting improvement with i guess when you're good to elicit
iain's versus the i-vector but when you look at the in which case and think
when you want to the fusion
be
fusion with emotion actually did worse than the i-vector system three point to each six
or each seven and the i-vector had one point nine that's the only one where
you didn't get an improvement was really reason why
you saw maybe when it happened may be used or you know stm actually had
words performance of guessing is because
you got more realist em system in
so i'm not completed is are but we have some kind of idea of why
that happened the idea is that
for training day systems
what we d is these oversampling
so on english there was one of the i think it was reduced english that
had only half an hour
so you know to train the others the n
that of course hardly the need has a war is useful but i think it
also hardy follow the fusion
so when you have
one
well in with that has a less data for today the nn for that is
the m
you can more or less we'd oversampling because that
an infusion usually you need much less
much less data in general so in all the other clusters that since a lot
because you anything they are imbalance
for calibration you have stealing a of all the blame
but for the english one i think it yes that we do not have enough
data for calibrating
so the fusion for training sufficient to so i think that was there is
the diffusion is not well trained because of not having enough data one of the
languages
i've a question and i found it quite interesting that you're
a list em has fewer parameters than the and the i-vector system
and i'm wondering about the
the time complexity how long does it like to train it and test time
some compared to the i-vector system
the a training time is much longer because we had a lot of iterations i
think it's also because of the way we trained the that we use a different
subset per iteration we need a lot of them so
actually i think that there is also have your are not the best we could
see because these was only and evaluation and so some of the networks were still
improving when we had to stop band and there's run them as they were sewing
training time eight side and w has much fewer parameters each word
but testing time he's way faster
and of course of one thing is that was you have the network trained you
only need to the day before while in the editorial you have new data you
have always extra i-vector before
before doing scoring
anymore questions
so then there's lots of time for costly i guess we'll back end at five
o'clock
forty four special tools
so that's target speaker can