would you just
a bear with me for a couple minutes subset and some background and then i
will try to explain
in some detail what the technical problem is that we're trying to solve
so for the jfa model here i formulated that in terms of gmm mean vectors
problem and supervectors
that's first term and is the mean vector that comes from universal background model
the second term involves that a hidden variable x
which is independent of the channel
excuse me independent of the mixture component and intended to model the channel effects across
a recordings
and the third term in that formulation it how it's a local hidden variable
to characterize the
speaker phrase variability within a particular mixture component
so
the is typical approach would be to estimate facts matrix u
using the maximum likelihood to criterion which is exactly the criterion that is used to
train an i-vector extractor
in practice
rather than use maximum likelihood to you usually end up using relevance map of as
an empirical estimate of those matrices t
the relation between the two you can find explained in the paper by probably vol
going back to two thousand and eight
the point i what stress here is that z vector is high dimensional we're not
trying to explain the
a speaker phrase variability by a low dimensional vector of hidden variables
it's a factorial prior in the sense that the
explanations for the different mixture components are statistically independent
which really is a weakness
we're not actually in a position with a prior like this
to exploit the correlations between mixture components
so to do calculations with this type of model these standard method is an algorithm
by robbie vol which alternates between updating the two
hidden variables x and z
it didn't present of this way but it's actually a variational bayes algorithm which means
that it comes with variational lower bounds that you can used to
a likelihood or evidence calculations
that means that you can for example
formulate the
speaker recognition problem in exactly the same way as it's done in the l d
n e
a bayesian model selection
problem the question is whether
if you're given enrollment utterances and test utterances and you want to
account for that on some of the of data
whether you are better off
passes doing a single cent vector
or two vectors one for the enrollment data one for the test
something
basically unsatisfactory about this namely
it doesn't take account of the fact that the
what jfa is it's model for handle
the ubm moves under speaker and channel effects
traditionally
when we do these calculations we use the universal background model so collect belmont statistics
and ignore the fact
that i according to our model
the ubm is actually ships to as a result of these hidden variables
and
there is an important by jean down that tends to remedy this and i was
particularly interested in looking into this for the reason that i mentioned at the beginning
i believe that's
the ubm does have to be adapted
in text dependent speaker recognition
and
this is a principled way of doing that it introduces a an extra sense of
hidden variables
indicators which
so you how the frames are aligned with mixture components
and
that can be interleaved into the
variational bayes updates in vaults algorithm
so that you get
a quick
karen framework from handling that adaptation
problem
there's just
there's just one caviar
that i think is worth pointing out about this algorithm
it requires that you take account of all of the hidden variables when you're doing
ubm adaptation and the evidence calculations
no of course that's what you should do if the model is to be believed
if you take the model that fixed value
we should take account of all of the hidden variables
however what's going on here is that this a factorial priors actually so weak
but
doing things by the book
dance lead you into problems
so that's why have like this here as a as a kind of
so in the paper a high presented results on the are stored data using a
three types of classifier
that come out of these calculations
the first one there is simply to use the z vectors that can either from
votes calculation or from the show and on calculation
as features which
it's are extracted properly should be purged of channel effects
okay and then just feeding goes into a simple backend like the cosine distance classifier
a jfa as it was originally construed
attended not only to be a feature extractor but also model that's like a classifier
that's two additions
it
however
in order to understand
this problem of ubm adaptation
it's necessary also to look into what's going on
with those bayesian model selection algorithms
okay when you
what happens when you appliance
without a ubm adaptation and
of boats algorithm
or with ubm adaptation and round on some read from
and also to compare it with
the likelihood ratio calculation
which
what's traditional about two thousand and eight
it's turns out that
when you look into these questions that there's a whole bunch of anomalies that
that arise
the
ubm adaptation call if you're using jfa as a feature extractor ubm adaptation hz point
five point
okay
this is true
for these sent vectors that's not true for i-vectors is not true for speaker factors
it behaves a reasonably but not present factors that's and
this year's icassp
paper
on the other hand
if you look at the problem of maximum likelihood estimation
all the jfa model parameters maximum likelihood so
what you find is that it doesn't work at all
without ubm adaptation you do need
ubm adaptation order to get that to behave
sensibly
if you look here a on based model selection you find that there are some
cases
where shall and don's algorithm
works better than vaults
and other cases where exactly the opposite happens
the traditional jfa likelihood ratio is actually very simplistic get just uses plug in estimates
rather than attempt to integrate over hidden variables and no ubm adaptation of all
and
what i will show in this paper is that it can be made to work
very well
with very careful
ubm adaptation
okay so this business of ubm adaptation turns out to be very tracking
and
anyone who is being a around and in the in table long enough is probably
in parent by this by this problem at some stage
sorry my in my own experience
i couldn't get jfa working at all
until i stopped showing the ubm adaptation
but it doesn't really make a little sense because if you look at the history
of subspace methods eigenvoices eigen channels
they world implemented originally with ubm adaptation
if you speak to
guys in speech recognition they will be surprised
if you tell them that you're not doing ubm adaptation
it is essential for instance and
say subspace gaussian mixture models
okay
so here's an example these are just some examples of the anomalous results that to
arise
okay these are the
a bayesian model selection results
on the left hand side
is with five hundred and twelve
gaussians in the ubm
on the right hand side with sixty four
in the case of the small ubm
john don solvers some
does more
gives you a small improvement
that doesn't help with a five twelve gaussians
here's
the results in the third line the first two lines of the same as in
the last slide the
third line there is the traditional jfa likelihood ratio
and that the it's model selection and style with or without
ubm adaptation
so this then is what the what the paper is about well what i want
to show is that
if you start with the traditional jfa likelihood ratio
maybe just recall briefly
how that goes
you have a numerator and denominator
in the numerator
okay you plug in
the target speakers
supervector and you use that to center the baum-welch statistics and you integrate over the
channel factors
in the
in the denominator you plug in
the ubm supervector and you do exactly the same
calculation and you compare
those two those two probabilities
no ubm adaptation going on at all and apply in estimate
which is not serious in the numerator but in the denominator it really is problematic
because
theory says you should be employed integrating over the entire speaker population
rather than plugging in they
the mean value the value of the comes from the ring supervector
so
what i we show is that if you do the adaptation very carefully
adapt the
the ubm to some of the hidden variables but not all of them
then everything will work properly
this is as long as you were
using jfa as a classifier you're calculating a likelihood ratios
however
if you're using it as a feature extractor in this turns out to give the
best results
it turns out that you're better off
avoiding ubm adaptational together
i give you an explanation for this
it has to do with the fact that the factorial priors two week this phenomenon
is related to victoria priors not just subspace problems
okay
well really for this problem the first type of adaptation that you want to consider
is the lexical mismatch between your
enrollment and test utterance on the other on the one hand
and the ubm that might have been trained
on some other
some of the data
the
the jfa likelihood ratio in the numerator you're actually comparing the test speakers of the
ubm speaker
but if you consider what's going on here if you have
no lexical content and the in the trial
that is with thing which will most determine what the what the data looks like
not the ubm the you would be much better off
comparing to have phrase adapted
background model and so the
to the universal background model so you
if you simply adapt the ubm
to the lexical content of the frame is that is used in a particular trial
that will lead to a substantial improvement
in performance
so what's going on here is that
in the
in the or sre data for or
in the hours or days of there are a thirty different prices
okay the mean supervector of jfa is adapted to each of the phrases
but all of the other parameters are shared across phrases
if you adapt to the
channel effects in the test data
this will work fine
okay
i this with these remotes are referred to the sort of early history of like
and channel modeling
there are two alternative ways of going about that you can combine the two together
and you will get a slight so improvement there's
there's no problem there
if you
if you adapt
to the speaker affects in the enrollment data it would work fine
okay so what i mean here's that you
collect the bombers statistic strongly test utterance with
a gmm that has been
adapted to the target speaker
you get an improvement
if you
perform multiple
iterations of map to adapt of the
lexical content things work even better
so at this stage if you look through those lines you see that
we've already got forty percent improvement in error rates
just to just should through doing a
ubm adaptation carefully
this slide unfortunately we going to have to skip that because of the time constraints
it's interesting and but i just don't of trying to deal with that
here are results with a five hundred and twelve gaussians
it turns out that so doing careful adaptation with the ubm and sixty four gaussians
work can chew about these same performance as
working with five hundred and twelve gaussians and no adaptation
if you try adaptation with five twelve gaussians
things will not behave so well this is a rather extreme case where you have
many more gel since then you actually have frames in your in your test utterances
and the remaining two presents our results that are so that are obtained
with z vectors as features problem
using likelihood computations
likelihood ratio computations
that the difference between the two is the nap is used in one case but
not the other
the of
three point there is that you don't need now
okay because you've already suppressed
the channel effects
in extracting the present vectors
and these then our results on the on the full ten set
that the full order sort test set
just to compare
the
z vector classifier
using both soundworks and that's to say no ubm adaptation
and joan don's algorithm with ubm adaptation
and you can see that you're better off using both so algorithm that explained that
the minute of only take a second
okay so these are the
these are the conclusions
you can adapt to everything inside and the work
but this one thing you should not to
and that is adapt speaker affects in the test utterance
the
the reason for that is actually
this i believe is what's going on
the factorial priors extremely weak if you have a single test utterance
okay and your doing ubm adaptation
then you're allowing
the
different mean vectors in the gmm
to be displays in statistically independent ways like gives you an awful lot of freedom
to aligned
the data with the gaussians too much freedom
see what happens if you
if you had multiple enrollment utterances which is normally the case in text dependent speaker
recognition
you still have a very weak prior
but you have a strong extra constraint
if you go across the enrollment utterances the gaussians can not move in statistically independent
ways that up to move in lockstep
okay and that means that the
adaptation algorithm will behave sensibly
if you to
adaptation to the channel effects in the test utterance it can things will behave sensibly
and the reason for that
is because these subspace prior
channel effects are assumed to be confined to a low dimensional subspace
that imposes a strong constraint
on the way the
the gaussians can move
so final slide the
if you're using jfa as a feature extractor
which is my recommendation
then the upshot of all this
is that
in the case of the test utterance when you're extract the feature vector you cannot
use ubm adaptation
if you cannot use that
and extracting a feature from the test utterance you cannot use a in extracting
feature a feature from the enrollment utterance i've or otherwise the features whatnot the
would not be comparable
okay so in other words you have to use false algorithm
rather than rather than joan bounds
adaptation of the ubm to the lexical content still works very well as a fifty
percent error rate reduction compared with the
with the icassp paper
there's a follow on paper that interspeech which shows how this idea of adaptation to
phrases can be extended to give a simple
procedure for domain adaptation
so you can train
jfa
on sundays at a likeness data and use it on say a text-dependent
task domain
and the finally these that vectors at least on the orders or data
they to they are very good features there is no residual
channel variability that's to model in the in the back end
okay thank you