all right about multi class discriminative training of i-vector language recognition this morning are now
on the crate from john hopkins university
and like to acknowledge some interesting discussions during this work with
my current colleague daniel my previous colleagues dog in l yet
duggan pedro sorry and more recently with nico
so
as an introduction
you guys know i think we had one discussion this morning that
language id using i-vectors is state-of-the-art system
what i wanna talk about is some particular aspects of it is typically done as
a two-stage process where we use a classifier is the first thing even after we've
got the i-vectors first we build a classifier
and then we separately build a backend which does the calibration and perhaps fusion as
well so i wanna talk about two aspects that are a little different of that
first i wanna talk about what if we try to have one system that does
the discrimination
the classification and the calibration once they're using discriminative training nobody ever said we have
used to systems back to back what we do it all together
and then the secondly i wanna talk about an open set extension to what is
usually a closed set language recognition task
so in the top i will start with a description of the gaussian model in
the i-vector space it something that many be seen before but i need to talk
about some particular aspects of it in order to get into the details here
also talk about how that relates to the open set case in that case of
go into some of the bayesian stuff that we do in speaker recognition and how
that could or couldn't be relevant in language recognition what the differences are
then i will talk about the two key things here which is the discriminative training
that i'm using in particular which is based on mmi and then i'll talk about
how i do the out of set model
so as a signal processing guy i like to thank of
this as an additive gaussian noise model and signal processing this is one of the
most basic things that we see
so
it in this context what we're talking about is that the observed i-vector you see
was generated from a language so it should look like the language vector mean but
it's corrupted by an additive gaussian noise
which we typically call a channel for lack of a better word
so this model here from a pattern recognition point of view we have a unknown
lean of each of our classes
we have a channel which is gaussian looks the same for all of the classes
that means that our classifier is a shared covariance gaussian model
and each language model is described by its mean
and that shared covariance is a channel or a within class covariance
so the building language recognition system then we need a training process in the scoring
process
training means we need to learn this shared within class covariance and then for each
language we need to learn what its mean looks like
and testing again is this gaussian scoring
and i guess unlike some people and stream are not particularly uncomfortable with closed set
detection
and that gives you a sort of funny looking form bayes rule the target if
it is this class then that's just the likelihood of this class
that's easy
but the non-target means that it's one of the other classes and then you need
some implicit prior of the distribution of the other classes
which for the other eases design where you can use a flat prior
given that is not the target
so that the key question then for building language model is how do we estimate
the mean estimating the mean of a gaussian is not one of the most complicated
things and statistics but there are multiple ways to do it of course this paper
or thing to do with just take the sample mean maximum likelihood
and that's mainly what i'm gonna end up using here in this work but i
wanna imprecise there are other things you could do and in speaker recognition we do
not do that we do something more complicated
the next a more sophisticated thing is map adaptation that we all know from gmm
ubms and dogs work
but you can do that in this context as well it's very simple formula that
requires however that you have a second covariance matrix which we can probably across class
covariance which is the prior distribution of what all models could look like
where in this case the distribution of what means are drawn from
and then finally from there you can go instead of taking point estimate you can
go to a bayesian approach where you don't actually estimate the mean for each class
you estimate the posterior distribution of the mean of each class given the training data
for that class
and in that case
you can see that posterior distribution and then you could be scoring with what's called
the predictive distribution which is
a bigger gaussian the fatter gaussian it includes the within class covariance
but also has an additional term which is the uncertainty of how much you didn't
show me that particular class
one little trick that i only learned recently no we started learned a lot sooner
it's a
develop many years ago have a reference in the book but it's really handy for
all these kind of systems is
everybody knows you can buy wise one covariance matrix recognizer data such that the covariance
you can but a linear transform and datasets the covariance which by the fact you
can do it for two
and since we have to this is really helpful
and i have a formulas in the paper it's actually not very heart
and you end up with a linear transform where within classes identity which we're often
used to w c and for example compasses that
but across class is also diagonal and it's sorted in order so the most important
dimensions are first
and the beautiful global transformation
it means that you can do linear discriminant analysis you can do dimension reduction easily
in the space just by picking the most interesting mentions the person
and it's also a reminder that when you say you do lda in your system
these little careful because lda
there's a number of ways to formulate lda they all give the same subspace but
they don't give the same transformation within that subspace
because that's not part of the criterion
and that's what this doesn't give the same subspace but it's not the same linear
transformation
so i'm gonna that some experiments here i'll start with some simple ones of the
want to the discriminative training next
where using acoustic i-vectors i think maybe it was mentioned here
the main thing we're gonna a lid system is you need to do shifted delta
cepstra and you need to do vocal tract length normalisation might not do speaker
i'm gonna present lre eleven "'cause" it's the most recent lre but as the kind
of hinted i'm not gonna use pair detection "'cause" i'm not a big fan of
pair detection
somebody's the over metric c average
but you get similar performance rankings
when you pair
detection as well
and
within lre
you build your own train and that's of these are of lincoln's training data sets
that are currently
zero mean
so i mentioned just as generative gaussian models i mentioned that you can do ml
and you can do these other things i mentioned ml map of a
have a nice applied here we just three things but is actually not those three
things
so you have to pay attention but didn't describe what this is
but ml so what i'm doing here is there is no back and there is
just bayes rule applied "'cause" that's the formula that i showed you to the generative
model of gaussian
and these numbers for people who do our reads these are not very good numbers
but this is what happened straight out of the generative model
and what i'm showing is c average didn't in c average
so means the average means you had a heart detection hard-decisioning on the detection
so the ml system
is the baseline
if you do this is the bayesian system so where you make the bayesian estimation
of the being then you actually in the end don't actually have the same covariance
for every class "'cause" they had different counts and that gives different a predictive uncertainty
but in a factor are very similar because in language recognition
you have many instances per class so it almost degenerates to the same thing
the reason i didn't show map is "'cause" it's in between those two and there's
not much space in between those two so it's not a very interesting thing
this last one is kind of interesting in that
it's not right but it actually works better
from a calibration
well as you say calibration that you think that it works better in the bayes
rule
what i've done here is what we typically do in speaker recognition where you use
the right map what you pretend that there's only one cut instead of keeping the
correct count of the number of cuts
and that gives you in terms of the predicted distribution that gives you a greater
uncertainty that and a wider covariance
and so happens that actually works a little better in this case
but i
once you put a back into the in the system which is what everybody's usually
showing with some then these differences really disappears so i'm gonna use ml systems for
the rest of the discriminative training work
as i said these numbers are very good there about three times as bad as
a state-of-the-art
what usually done it is with additionally trained back end the simplest one i think
john had was the full tell the scalar multiclass thing that a coded decoded before
that's logistic regression
you can do a full of logistic regression with a matrix instead of with a
scalar you can put a gaussian backend in front
a logistic regression which is something that we tried for or you can use a
discrimate we train gaussian as the back and which is something we were doing it
lincoln for quite awhile
and these systems all work much better and pretty similar to each other
you can also build a classifier to be discriminative one of the more common things
to do is an svm one verses rest
that's not that still doesn't solve the final task but it can help
and if you do one verses rest logistic regression you also still need to back
end or you can do recently uniquer has been doing a multiclass
training of the classifier itself followed by multiclass backend
but what i wanna talk about is trying to do everything together one training of
the multiclass system that won't need its own separate back end ready to apply bayes
rule straight out
and
it's not commonly used in backends but in our field mmi is a very common
thing in a given in the gmm world in the speech recognition work
the criterion if you're not familiar with it
it is another name for the cross entropy which is the same metric that logistic
regression uses
it is a multiclass are your probabilities correct kind of a metric
and it is a this is a closed set
discriminative training of classes against each other
the update equations
are you haven't seen are kind of cool and they're kind of different it's a
it's a little bit of a we're derivation compare the gradient descent that everybody's two
it can be interpreted of like a gradient descent with kind of a magical step
size
but it's quite effective and the weights it's always done in speech recognition is
since you were doing this to a gaussian system you start with an ml version
of the gaussian and then you discriminatively updated so to speak
a that makes the converse is much easier
it gives an actual regularisation because you're starting with something that is already a reasonable
solution and in fact the simplest form of regularization is just to not let it
run very long it is also a lot cheaper
and it also gives you something you can tie back and put a penalty function
that says don't be too different from the ml solution
so regularization is it is and straightforward thing to do an mmi
and this diagonal covariance transformation that i was talking about is really helpful there here
because
then we can only discriminately update these diagonal covariances instead of full covariances
so we have fewer parameters than a full matrix logistic regression but more parameters the
lowest logistic burst
so now these are pretty much state-of-the-art numbers now remember the previous number that couldn't
were up here essentially
so this is the ml gaussian followed by an mmi gaussian backend in the score
space which is kind of our dpot way of doing things when i was at
lincoln
this for score is kind of a disappointment which is what if you take the
training set and you discrimate we trained with them in mine and they don't have
a back here
it is in fact
considerably better than the ml system really of its equivalent would which i started
but is nowhere near where we wanna be obviously
so
why not
and
one of the core of our e
that
is more data dependent i think then realistic
is the dataset actually looks different than the training set
so this is only done on the training set it's not using any dev set
at all
the most obvious thing is that the training set and that at the data set
in the test set are all thirty seconds approximately the training set is whatever sides
of conversations that happen to be so that's an obvious mismatch selected the training set
and truncated everything to be thirty seconds instead of the entire sorry
drawing away data in that way turned out to be very helpful because it's now
what better match to what the test data looks like
but not everything i wanted so then i to the thirty second training set
concatenated together with the dev set which is a thirty second set
used the entire set at once
for training the system and that in fact works as well as in and slightly
better
then the two different us as the system followed by
discriminant right by a backend
so i looked at the number of different ways
permutations of this mmi system the anybody who's done it for gmm mmi is no
you can you can
train this that or the other and various things that
and that the simplest thing to do is just to do the means only and
that is fairly effective at the moment
you can train the mean and the within class covariance which is
and of course in the clothes that system the across class covariance is not coming
into play it's only the within class covariance which is having five
one thing that i found kind of interesting used to instead of training the entire
covariance matrix to train the scale factor which scales the covariance that's to a little
bit simpler system with fewer parameters
and you can also play with the sequential system
and in particular i found interesting to do the scale factor first and then the
means just in terms of the it it's really
that will given the end the same solution but
when you only do a limited number of iterations to starting point in the sequence
does affect you get
so
again these same sorts a lot this is what happens if you do so this
is now purely no back-end and the discriminately train classifier itself if you do need
only
your partisan system is not terribly good but you're means the average is pretty close
so that is an indication
what is calibration mean in a multiclass detection
task is kind of controversy all but
one thing that i think i can say comfortably is whenever you see this happen
it means that you're not calibrated
the fact that they might not doesn't necessarily mean that you are calibrated "'cause" bayes
rule is more complicated than that but
but this means that it is clearly not calibrated
so once we do something to the variance this is doing the mean and the
entire variance this is doing the mean and the scale factor is very except same
time
and this is due in a two stage process or of the scale factor of
the various followed by the mean
all of those
work much better so in order to get calibration you need to actually adjust the
covariance matrix which kinda makes sense you need to scale factor or something
and
once you fine tune on the numbers as we typically do when we're actually working
on these kind of task
been actually see that the two stage process it is the baddest the best one
and it is better than error
our a two-step process that we used to have before of separate system followed by
back in
okay so that's the discriminative training part the other thing i want to talk about
is the out of set problem that has mentioned in a question earlier
because oftentimes were interested in something where there's it could be another language is not
one of the closer
the nice thing about our two covariance mathematics that we've been using for speaker recognition
is it has in front of you a model for what out of set is
supposed to the
already mentioned that essentially that if you have a gaussian
distribution of what all models look like then an out of set languages are randomly
drawn language from that who
and that's represented by the gaussian distribution
then at test time
you have again and even bigger gaussian because the uncertainty is both the channel plus
which language was
so now you have
the out of set is also a gaussian bided have the bigger covariance then all
the others have a share variance which is smaller so it you no longer have
this a linear system
when you make a comparison
this is the most general formula when you have
and open-set problem which is both out of set and closed set
this is how you would combine them this is what i had before the sort
of bayes rule a quick competition of all the other closer classes this is the
new distribution the out of set distribution
if you wanna pure out of set problem which is what i'm gonna talk about
here you just take it needs to be out of set is one but in
fact you could make a mix distribution well
okay so i wanna talk about the out of set
just a touch on is what i have now
if i where to do the bayesian numerator for each class that i mentioned before
and then this denominator
then i have what would you like to call
bayesian speaker comparison
jones narrative paper about four
it is the same answer as p lda or the two covariance model
and i'd like to
emphasise that
they're set up differently so the numerator and denominator are different in these two mathematics
but the ratio is the same thing "'cause" it's a models and is the same
correct answer
i think you know formalism like i'm talking about here i find it much easier
to understand it in this context
the philosophy
and
daniel i've spent a lot of time on this can see that only a few
of the a guy from this perspective point of view
but in this terminology we say that we have a model for each class and
the covariances are hyper parameters in this terminology you guys like to say that there
is no model
and the parameters of the system are the covariance matrices again is the same
system it's a different same answer to different perspective but when we're talking about close
that and ml models
i know how to say that in this context and i don't know so well
how to say that
in the p lda one
so discriminative training of the out that i described this is the out of set
model but as i've said now i have this mmi hammer in my toolbox
and this is just one more covariance that i can train so i've got an
across class mean and covariance
the ml out of set system just takes that all of these where the sample
covariance matrices so this p
but i can
do an mmi updated this out of set classes well the simplest way for me
to do this is to take the
the closed set system there are presented and then separately
frees the closed set models and then separately update the out of set model given
the closed set models
i can do that with the by scoring would one verses rest instead of scoring
with bayes rule and doing a round robin on the same training set so
the advantage of this is i can actually build a system without ever actually having
any out-of-class data probably do better if i really did have out-of-class data but in
this case i don't and i can build a perfectly legitimate system
so
the performance of this system whatever done here
is scored this lre even though there is no out of set data scoring without
bayes rule where the system is then allowed to know what the other classes were
and so that the simulation one open set
scoring function
the ml version of this
the actual c average is actually a the chart it is that's kind of bad
numbers that i started with four
the mmi training of the closed set system
and then the
mel version of the across class covariance in fact is already a lot better so
whatever's happening in the closed set discriminative training is actually helping the
open set scoring as well but explicitly retraining
the out of set covariance matrix
with the same mechanism mel scale factor then the me
in fact if the pretty reasonably
and the system which is not obviously on calibrated
and it's pretty reasonable performance
the closed set scoring performance is still down here but this is gone a lot
better and it's perfectly feasible
so
the two contributions here where the single system concept of we don't have to do
system design and then back end we can discriminatively trained system to already be calibrated
and we can model out of set using the same mathematics that we have in
speaker recognition
but a simpler version "'cause" we don't need to be used bayesian in this case
and i think can also be discriminatively updated so that we can that be reasonably
calibrated for the open set
task as well
so thanks island
the very nice to see that you unified those two parts of the system
i which we could do that than in speaker recognition
so my question is your
your maximum likelihood
across class covariance so you've got twenty four languages to work within a six hundred
dimensional
i-vectors so
how did you estimated or a sign that
parameter
it is the sample covariance so everything here was done with the dimension reduction in
the front
to twenty three dimensions
i'm sorry that's why i illustration
already specified that there would be twenty three dimensions
and anything that has a prior is limited to twenty three dimensions
okay in this i i'd since i just took the sample covariance matrix at regularized
it somehow you can make it
appear to be bigger to be full size
okay so
well those formulas you showed with the covariances that happens in twenty three dimensional space
yes
so in this case you doing lda and then i'm not my tanks she's got
at the same as doing question back and another calibration
well as lda and regression back at you that this evaluation bloodless
this was your computing the sample covariance as ones in the full space
but
the across class is only rank twenty three
so you take that the six hundred dimensional within class and map-adapted twenty three but
yes
so if you do lda and regression but gaussian backend is the same subspace
use lda
yes if you product of lda
and twenty three or you get the gore since any get twenty four scores
is almost the same thing so you're still doing to state to steps
it's still just two steps
in my view are the ml estimation which in this case forces you to be
twenty three dimensional
and then
the update of those equations
but lda english and
back and there is similarity very close
well like the way we would have done a system before would be lda and
then gaussian in that space and then
mmi training in the score space
the likelihood ratios of the first thing this is mmi training in the i-vector space
directly
but
these are not very complicated mathematics of things are pretty closely related yes
so when you did the joint diagonalization
in there and then you
work with diagonal covariance matrices but then you're also updating the covariance matrices training
is that diagonalisation still valid then
i mean you do the static one projection was that mean then when you forced
to be it back it's sort of like saying i mean
the entire thing can be mapped back to
by undoing the diagonalisation into a full covariance so you in some sense you are
still updating a full covariance with your only updating in a constraint what
so the matrix is still an apple size but the number of parameters that you
discriminatively updated is not the full set
so if i guess i remember correctly so you're doing actually closed set
twenty three or twenty four language that is that correct twenty four language right so
is it possible i mean i don't want change your problem but if you were
to look at a subset so you're gonna pick twelve each and take the others
are is completely open set data so you to screw training it only on a
portion of we don't have actions we have said data
you have some sense of how strong your solution would be
if you didn't have access to those similar sounds that languages that you want to
reject
i think it's an interesting thought that
you could more extensively test this out of set hypothesis by doing a whole one
out or something and round robin in that and i think that isn't it interesting
idea but i haven't
have done