kind of the transition from the systems and in the previous salary into the n
n's in what
people do deeply we all could have
no i don't think so
we all clodo presented in both but i think this is a good transition "'cause"
we did have some kind of new things that we did that i wanna talk
about
this is work with my colleagues correct cell and daniel from johns hopkins of both
of whom were unable unfortunately to get spousal permission to attend this work so but
they have that excuses rags wife had their second child two weeks ago and daniels
is due in about two so
they have a reason
so i'm gonna present an overview of the d n an i-vector system that we
submitted to leo every fifteen
i wanna hear give a shout out nist for introducing his fixed training data condition
which actually allowed us to make a very competitive system with only three people which
is a not very common in our is historically
the approach that we used algorithmically i'll go in the more detail but we use
the n n's unlike some of the previous presentations you've seen we were able to
get good performance not just with the bottleneck features but also with the nn state
labels i'll talk about that
we used a three different kind of i-vectors i'll explain that more but
everyone had acoustic systems and those are very good we're able to do quite well
with the phonotactic i-vector system as well and here we're trying for the first time
a joint i-vector which does both things at once
because we had a fairly powerful system that we were comfortable with and we didn't
trust that we had enough development data we used i think the you simplest and
most naive fusion of anybody a net seem to work for us because we actually
got of using game which i think also we were one of the few
and that was just to some the scores together and then scale "'em" with the
duration model that all talk about
and lastly as i think it's been mentioned but i wanna go with a little
bit more because this was a limited data task data augmentation turned out to be
very helpful for us
so in the top i'll go through our bayes the i-vectors system design a talk
about the two ways that we use the d n n's that have both been
touched on previously today
and i'll talk about the use alternate i-vectors to we experimented with
talks a more specifically about the lre fifteen task and how we use the data
and what we learn later about how we could have used the data
and trying to that will talk about the results that we had in the summation
in some interesting things that we've learned since both about whatever other systems could have
done and also how we could've done better with the systems that we that we
use
so here's a but block diagram of
our lid system
it's a little i-vector system so we can be split into two parts the first
uses the unlabeled data to the to do the ubm and the t matrix learning
and then the supervised system is basically the two covariance model
within class across class covariance that's first used in lda to reduce the dimension and
in the same matrices are used for the gaussian scoring following on after that
we've done for awhile rather than having a separate back end to do the work
we do a discriminative refinement of these gaussian parameters
to produce a system that not only performs a little bit better but also produces
naturally calibrated scores
and we do that in a two-step process first we learn a scale factor of
this within class covariance
and then we go into all the class means and adjust them to better or
provide the discriminative power and that's we we're using the mmi algorithm from gmm training
in a really simplified mode
and of course that's the same criterion is the multiclass cross entropy but all everybody
uses for every day
so just layout data
talk more about how we use the d n and together people dimension it but
let me have some pictures of so you can see a better of what we're
doing splitting up the normal use the gmm to do the alignment and then compute
the stats after the fire
from that
where splitting it out in two ways and using the announced the first is simply
to replace the mfccs with bottleneck features
from indiana and we are just using a straightforward bottleneck note that kind of anything
else
and then the
second system
is a little bit more complicated were used to the nn to generate the frame
posteriors for the signals are for the cluster all state
that used to label the data and you the alignment and then you use the
ubm after that
the unit time are to draw indian and but this is daniel's best rendition of
a
of a probably d n and a couple of things that the power particular perhaps
about our system or about the cali way of doing things
which by the way we do highly recommend
is it uses this t-norm would just kind of like to max pooling so there
is a there's an expansion in a contractual made at each layer that's how the
nonlinearity come there
what else we i think probably nobody says these days but we're not using fmllr
which i think it is common
for our purposes
you can see we basically use the same architecture either for this you know posteriors
are or we introduce the bottleneck to the one that's just gonna be the bottleneck
that goes
the that's a little in your layer before the
in the middle that one there
we have
about nine thousand output state so it is it is a pretty big ubm that
we get out of this
and of course it's trained using switchboard one "'cause" that's what we were given for
the a fixed data condition
in you know
so let me talk about desire is a little bit the one that
we're all familiar what we're gonna fall acoustic i-vector this is based on a gaussian
probability model and german output in a little parentheses for use a given that the
alignments already know otherwise it would be much more complicated
and but because of that it's a big gaussian supervector problem there's closed form solution
for the map estimate that the i-vector
there's an em algorithm for the that the estimation
the second approach is phonotactic thing now i think the you guys mentioned that used
it for a number of areas before
the this is well i'll talk about the details of the or lighter but that
the king is we can still have sort of a gaussian model for an i-vector
but the output now is the latent model we're talking about the weights of gmm
instead of the means
and those things are naturally gonna be count based so we need a multinomial probability
model out not a gaussian probability model
and the way we do that with is to go from log space with the
softmax singular probability part
even when they're fairly simple formula unfortunately there's not a closed form solution for what
is the optimal i-vectors of these additions method iteration
and similarly there's not a two year for the t matrix that we know what
yet so there is a alternating maximization algorithm
so we presented this phonotactics a thing for lid the four
and in the meantime we don't think it okay we have two systems we have
an acoustic in a phonotactic are we gonna combine
actually the first thing we knew score fusion and yes we did that and use
that works
and then we are a little more except well
about two i-vector systems there are doing the same thing why don't i stack the
i-vectors together and get one big i-vector and then run one i-vector system and does
that work
and yes that works two
and we thought of as more and said well
why the widely twos independent i-vector extractors
what can i make one latent variable the both models
the means of the gmm the latent gmm the generated code and the weights of
the gmm generated to cut
the fact is the math says that you can i'll go into a little more
detail but basically this is
a permutation of the subspace gmm that the input we was talking about in two
thousand eight thousand nine
to see leslie workshop and sense
so there are algorithms for doing this we had to manipulate them a little bit
for our purpose
so a couple of the tails how to do this we have some references in
the paper
so complex in particular that we're doing differently than then if you just to get
out of what bandwidth
the first is he did everything but sort of ml estimates so we didn't have
any prior didn't how many backup
obviously for acoustic we don't wanna use ml i-vectors we wanna use map i-vectors
we've actually shown previously that for a phonotactic system map is also beneficial and if
we're gonna do a jointly it's
critical the to be the same criterion for both things because it back
it is a joint optimization of
map of the overall likelihood plus that the prior
a nice trick we can do with this joint i-vector is since this closed form
solution for the acoustic we can
initialize the newton's method with the acoustic and then just refine it using the phonotactic
as well
and that gets us to a starting point pretty easily where we can then do
winning greatly simplify the newton's descent
in particular by pretending everything is independent of each other which is a huge spud
improvement because the doing full haskins in this update
is anybody who's ever looked at it is a pretty tedious
so once we do that
it's essentially rather than being much slower than acoustic i-vector system it's essentially the same
order it's very simple
so that no okay
the lre fifteen task which has been discussed
this i guess isn't happening here there is telephone and broadcast narrowband speech with it
twenty language six confusable clusters
but the limited training condition is very important element from what we were able to
get away with
and of course that means
both that you have limited a little data to more only twenty languages but also
means that you can only train your supervised the nn
on the switchboard english because that's the only thing that had transcripts
which is not our favourite thing to do it was kind of limiting but it
allows nist exercise the technology
and because of the languages didn't have much data that was also would keep
so all of our systems
basically because we had a small team we didn't built too much complicated stuff
i described really everything that we did
so we had two different ways of the using the d n and we had
three different kinds of i-vectors that we could have built out of each of the
to the in an i-vector the unit systems
out of that we could've done six things i'll talk about a few that were
interesting and the ones that we actually
but everything was the same classifier
as i missed because the systems are already calibrated a by this mmi process
we didn't have to use a complicated back end
the thing we get introduced because there is we knew there was this range of
durations that had to be exercised
i think the simplest way that we could get there was to re reuse some
work that we had done previously on making a
duration dependent backend where there's a continuous function which maps
duration into scale factor score
between of the raw score and the true log likelihood estimate that you're trying to
make
and that there's a justification for that function but for our purposes the important thing
is that
it's very simply trainable because it's just got to free parameters
so then you can use this cross entropy criterion and figure out the best parameters
and then because we have is a very simple system
we just at all scores together assume that they were independent estimates of things and
then rescaled the whole thing to bring it back in
and we found that to be helpful for us
another thing about lre fifteen which was mentioned but maybe to go to be more
familiar with the task you it went past incorrectly is very important
so nist the
proposed these somewhat on task of close to the texan across each of the clusters
what we did is
it is generated each cluster is an id score which means that each cluster had
a id posteriors on the one since the ri six clusters of means we gave
an scores from the six which means if nist wanted to evaluate across cluster performance
it was meeting this
and we had to convert these ideas to detection log likelihood ratios which is something
we've all over how to do your
but one thing i wanna mention about our system is we didn't do anything
cluster specific anywhere we just change train a twenty language lid system
and then just the
spun on the scores for each of the clusters because that's what nist one
i think we would like in the future for a more generic lid task
not the key element that i mentioned is the
other with limited training data so
we had figure out what to do with that
as i mentioned we have the unsupervised and supervised power we took the theory which
was later proven not quite ready to we would use everything we could
for the unsupervised data which included switchboard which is english only and was not one
of the languages
for that we could've done better than that all talk about it
and then for the classifier design we did find it helpful
to do augmentation to do duration modeling a cut so we can use all sides
we use segments that were duration
appropriate for the lid task
and we used argumentation used augmentation to change the limited clean data
and try and give us more examples of things to learn what i-vectors would look
like
to go into the augmentation a little bit more
many of these are standard things the this big thing indian ends now is to
do augmentation
so sample rate perturbation additive noise
right made a kind of forty kind of an additive noise but maybe that's more
interesting we did throw in reverb
and a multi band compression is kind of a signal processing thing that you might
see in an audio signal
but the thing i wanna mention and the thing that we have actually don't have
been slides but if you look in the paper
the most effective single augmentation for us in the task was to run to use
"'em" so you were encoder decoder against
which kind of the makes sense
as a thing to do
and as former speech coding to a fairly attractive
so our submission performance
these are the for things that we submitted our primary wasn't fact one of the
bottom which looks like it's pretty good choice out of the were available to us
so we did a joint i-vector on the bottleneck features we have well i'll show
later of the when some more stable i guess other through that they know what
dimensional ways in this submission
our second basis them was actually slightly better than our bottleneck system and again
that makes that the best sort of phonotactic system i think than anybody saw because
everyone else from the bottlenecks will be the only really good thing to do
and fusion provided again partly because we have simple fusion and partly because we have
two systems which are pretty good
so we get a couple things post email with we found someone educational the first
one i will go in the much details in the paper but
within the family of gaussian scoring there's a question of whether you count trials as
independent are not which in speaker you typically pertain you all had one you only
had one trial for
enrolment is all one hour
that turned on and what we submitted we usually see it is slightly better turns
out for this develop a slightly worse
i have no idea work
the other thing that might be a little bit more interesting is the list usage
we spent quite invaded time even with their the metadata trying this
decide what to do with the ubm and t
but i think that it turned out to work best
we didn't try because we thought of as a dom idea which is to just
use
only the lid data
and only for cuts
which i forget exactly but i think that's only three or four thousand cuts or
something
it ought to be nowhere near enough to train a t matrix we thought
but without or
so here again there's a more numbers splitting things out the first thing which is
kind of interesting for us as we went and rented to this acoustic baseline so
what we would have done with previous technology we are definitely better with all the
stuff we have i dunno if we're not instantly better but we're better
thing
i'm sorry
the ldc is now we split out with the scene on system the three different
kinds of i-vectors and the first thing is the phonotactic system by itself
is actually better than the acoustic system which is what we have seen before and
i think that's
well linguist might about whether it's really a phonotactic system to look at the counts
of frame posteriors but
that aside it's i think the best performing phonotactic system that's out of for lid
right now and then you see also that the joint i-vector doesn't five given noticeable
gain over the acoustic
so that's
okay and the fusion still work let me just go one so then included in
we were able to get pretty good performance in this evaluation with a small team
and of relatively straightforward system
we think that there is still whole in the signal count system that doesn't have
to be just bottleneck
and we were able show that
we think that the phonotactic in the joint i-vectors the joint i-vector especially is a
nice simple way to capture that
that information is one of things that enables the signal system to be competitive
we think it is helpful to use a really simple fusion if you have this
discriminatively trained classifier to start with
and find the data augmentation it can be a very valuable thing for the manager
at
limited data
thank you
we have time some questions
thank you for told you proposed able to collect all is doing marks
we can focus to the lower levels double is the tools like to other tools
for d is a classical for more old home too
yes there are we always use the same mailed again gaussian classifiers
no matter what kind of i-vectors
"'cause" distribution is not
no the intention is the i-vector could still have been in a gaussian space that
that's this is why we like this kind of
a subspace there are other count subspace algorithms like lda a non-negative matrix factorization i
think that in for example is compared some of those
where the subspaces in the linear probability space and that
i don't think would be well modeled by gaussian fact i know it wouldn't be
well my time gaussian pretty comfortable that "'cause" it's positive
but by going into the log space i think you it does
it really is going to lda and that's right tools okay cindy
but i'm very much like the additional processing that you're doing to kind of or
to the data you had casework security of sample rate perturbation most of the speech
coders most versions
if you had to go back again we're which ones you think actually would help
i think you mean imagine which up there is a table in the paper
which many of them are helpful but the speech coder is the most helpful on
its own
so we choose the sample rate conversion was at a really big variations
we did things like plus or minus ten percent plus or minus five percent but
i think
i would say that big
so we use a big difference maybe we have other cts broadcast news progress has
been which would typically be guessing
we didn't break them apart
try other nations that just
a pattern p norm
we have since
and
it's
so little bit it seems like for this particular task it looks like the sigmoid
is that some other people use are a little bit better i'm not sure if
we think that's a universal statement
excuse me the sigmoid are better for training the bottlenecks
i think for this you know maybe not
so we have looked a little bit
there is more to explore
so if there are no more questions and we assume everybody here knows everything about
language recognition got common both systems
so that the same speaker again