okay this is going to be account where
technical
so that recovery problem
maybe uncertainty modeling in
text dependent speaker recognition
of the issues that i'm concerned with here is that
what to do about the
the problem of a speaker recognition context where you have very little data and the
features you extract
are necessarily going to be noisy and the and the statistical sense
okay the this comes straight to the for in text dependent speaker recognition where you
have maybe just two seconds of data
it's also an important problem in a text-independent speaker recognition
because of the need to be able to
just set a uniform threshold even in cases where you're test utterances are variable duration
that would be interesting to say what happens in the forthcoming nist evaluation
with that a particular problem
some progress has been made on a with
but subspace methods with i-vectors you try to quantify three
the statistical noise in the i-vector extractor process
and see that into the into p of you get a
but
i've taken a possibility of the table for the purposes of interest and said look
of subspace methods
in general are not gonna work in text-independent speaker record text dependent speaker recognition because
of the data distribution
okay so
what i attempted to do it was to tackle this problem
well and speaker variability when one could not able to characterize that by a subspace
methods
i realise what was preparing the presentation that the that the paper in the proceedings
is very tense it's rather difficult tree
but
the idea it's a tricky but it's fairly simple okay so i have made an
effort in the slides just to communicate the core idea and if you are interested
them i recommend that you
that you look at the slides rather than the paper i poles these lines one
on my web page
so the
the test that for this task we took the dior store for three that's the
random digit
a portion or
of the arms or data i just mentioned two things about this
a because the a design we have five random digits at test time
well all ten digits were repeated three times at enrollment time in random order you
only see half of the digits at test time
and actually turns out that under
those conditions
a gmm methods have an advantage okay because you can use all of the
all of your enrollment data no matter what the test utterances or as if you
pre-segment the data into digits you are actually constraining yourself chose to using the enrollment
data that corresponds to the digits are actually think
in the in the vocabulary so in practice you actually need to you need
one of the thing a dimension is that this paper is about the back end
we used a standard sixty dimensional plp front end which is maybe not hiding
i think that'll come up and the
in the next talk you can get much better results on female speakers if you
use a low dimensional front-end
which i think of resort was the was the first two of discover
so
the model that i was using here is a is that you have a model
which uses the low dimensional of hidden variables to characterize speaker effects but does not
attempt to something to characterize channel effects
but does not attend so to characterize speakers using subspace methods
and these the z factor that characterizes speakers would be used as a feature vector
for
speaker recognition
and the problem i wanted to address was to design a back and that would
take account of the fact
that the number of observations to re
available to estimate the components of this vector is very small
in general a step of it has them
one frame per mixture component if you have a two second utterance and the ubm
with five hundred and twelve gaussians to calculation you see that you have extremely sparse
data
so there are two a backends the but are present
one the joint density back end uses a point estimates of the features that are
extracted at enrollment time and at test time and it's
a models the correlation between the two
okay to construct a likelihood ratio and the innovation in this paper the hidden supervector
back end it treats
those two feature vectors as hidden variables as in the original formulation jfa
so that the key ingredient is to supply up a prior distribution that on the
correlations between those hidden variables
how much time don't have like i'm sorry i dependence one a strong
how much
okay good
okay so
the
so i was just digress a minute
the way uncertainty modeling is used is usually tackled in text independent speaker recognition is
that you try to characterize the uncertainty in a point estimate of an i-vector using
a posterior covariance matrix
but is calculated using zero order statistics
and you do this on the enrollment side and on the test side independently
if you think about this you realise that this isn't quite the right way to
do
okay the reason is that if you are hypothesized thing a target trial then what
you see on the test site has to be highly correlated with what you see
on the from inside
they are not statistically independent
and that has to be of benefit
from
using those correlations to quantify the uncertainty in the feature that comes out of the
test utterance we could something called a little total variance that says on average
when you condition one random variable on another u models reduce the variance
okay so
this the critical thing that i introduced in this paper is there is this correlation
between the enrollment on the on the test
okay so here's a the mechanics of how the joint density back end work that's
pretty straight the features has point estimates
at this was inspired by us central command is work at the last odyssey
he implemented this the upper level of i-vectors there's nothing to start with just a
few doing at the level of supervectors as well
you can't obviously training
correlation matrices of supervector dimension but you can implement this idea at the level of
individual mixture components and that's
so that gives you a trainable
back end for text dependent speaker recognition even if you can't use subspace methods
so that's the
that's our best back end and thus the one that i used as a benchmark
for our experiments
so
the supervector backend is the hidden version of this
okay that says
you're in a position to observe by one statistics but you're not position to observe
the these that factors you have to make inferences
about the posterior distribution of those features and bayes
you're likelihood ratio on a on that calculation
now it turns out that three
probability calculations are formally just mathematically equivalent to calculations with and then the fusion i-vector
extractor that has just two gaussians in a
weekly effects mixture component from the ubm
you say you observed that mixture component once on the enrollment side once on the
test side so you're two
hidden gaussians
okay you have a variable number of observations on the on one side of variable
number of observations on the test site that's the type of situation that we model
with an i-vector extractor so here
the there is an i-vector extractor
but it's only being used to do probability calculations is not going to be used
to
to extract features
no
one thing about this i-vector extractors that you're not going to use a to impose
subspace constraints because it's just have the two gaussians right
you don't need to say that those two gaussians
line of low dimensional subspace of the
of the supervector space
so you might as well just take the total variability matrix to be the identity
matrix and shift all of the burden
modeling the data
so the to the prior distribution
in a i-vector modeling we always take us time or prior a standard normal prior
zero mean identity covariance matrix that's because there is in fact norm general stick to
be gained by using a non-standard prior okay you can always compensate for a non-standard
prior
by fiddling with the
with the total variability matrix here
take the total variability matrix of the to be the identity but you have to
train the prior
and
that involves doing well you can do a posterior calculations if you look at those
formulas you see they look just like the standard ones except i now have and
mean and the precision matrix which
would be zero and the identity matrix in the case of the
standard
standard normal prior
and you can do minimum divergence estimation okay which is the way in the fact
of training the prior if you think about the way you minimum divergence estimation wiretaps
in fact what you're doing is your estimating a prior
okay and then we say well there's no gain in using a non-standard prior
so with standard as the prior and modified the total variability matrix instead so here
we just
estimate the prior to put that were estimated in inverted commas estimating a prior is
not something
variation due but we do it all the time so it works
so how would you how would you training assuming a have to organise your training
data into target trials
you with collect
the
the i-vector for each trial three each mixture component in the ubm you would have
an observation on the enrollment set an observational that over multiple observations so you're bound
was statistics
and you just implement this minimum divergence estimation procedure
then get a prior distribution that tells you what correlations to expect
between the enrollment data and the test data
in the case of a target trial
if you want to handle non-target trials and you just a impose a statistical independence
assumption you just zero while the correlations
okay so the way you would use this machinery to calculate a likelihood ratio
is that
given enrollment data and test data
you would calculate the evidence but this is just the likelihood of the data
but you get when you integrate out the hidden variables
it's not usually done but i think everybody should of a gender implementation of i-vectors
we should always calculate the evidence
okay the because it tells you
it's a very it's a very good a diagnostic
that tells you whether you're implementation is correct you have to evaluate an integral it's
a gaussian entropy role
the answer can be expressed and close form in terms of a bomb or statistics
as in the paper
so you in order to use that's for speaker recognition you evaluate the evidence in
two different ways one with the prior for target trials and one with prior for
nontarget trials you take the ratio of the two and that gives you your likelihood
ratio
for speaker recognition
so the mechanics then of getting destroyed depends critically on how you prepare
the baumwelch statistics
that us summarize the enrollment data and the and the test data
the first thing you need to do
is that stick the enrollment things i'm
okay each of those is potentially contaminated by channel effects so you take the role
by one statistics and
filter out the channel effects
just using the jfa model
so in that way you get a set of syntactic up on well statistics which
characterizes a speaker you just pooled at the bottom one statistics together after you have
filtered out the channel effects you do that on the enrollment side you do the
huntley on the test side and you end up in and the trial having to
compare
one set upon was statistics with another using this hadn't supervector back end
i here's a here's a new wrinkle that really makes a more
we know that in order to
the sort of achilles heel of jfa models is the gaussian assumption
okay and the reason why we do length normalization
in between extracting i-vectors and feeding them to be able to get a is in
order to fix the are only a scale some assumptions in really i
we have to do a similar
track here but the normalization is a bit tricky because
you have to normalize one statistics are not normalising a vector
obviously the first order statistics
the magnitude is going to depend on zero order statistics so it's not
immediately obvious that the one
so the this recipe that a the data used it comes from well go back
to the jfa model and see how the jfa model is trained in these that
i-vectors this treat them as hidden variables
okay that come with
both a point estimate and an uncertainty of posterior covariance matrix that tells you
well how answer you more about the about the observations
a lot of the underlying hidden vector
and the thing to
but it turns out to be convened in to normalize is the expected norm of
that hidden variable
set of making the normal equal to one you make the expected norm one
okay so
a curious thing is that the second term on the right hand side of the
trace of the posterior covariance matrix is actually the dominant term
we can be because the encircling is you
and there is an experiment and paper that shows that you better not neglect factor
and the role of the relevance factor in the in the experiments that i reported
in the paper
as you fiddle with the relevance factor you're actually filling with the relative
magnitude of this term here so you have you do have just without
all possible relevance factors in order to get this thing working problem
okay so here some results using what i call global feature vectors that's where we
don't bother to pre-segment the data
in two digits
okay
member a set of the beginning
that the was an advantage on this task to not
segment
okay in other words just ignoring left to right structure
that you give and in the problem
so there is a gmm ubm benchmark
joint density benchmark and the two versions of the hidden supervector back and one without
but length normalization is applied to baumwelch statistics and the other with that
so you see that the length normalization really is the key
to getting this thing to work
i should mention that those you use it is you should reduction in error rate
there on the female side from eight percent to six percent
that appears to have to do with the with the front and i
we fixed a standard front-end for these experiments
but it appears that if you use
a lower dimensional feature vectors for female speakers you get better results and think that
sticks explanation of that
so there are there's a actually fairly big improvement okay if you go for one
twenty eight gaussians to five twelve even though the uncertainty in the case of five
twelve is necessarily going to be more ones
it was this
this phenomena that celibate motivated us to look at the uncertainty modelling problem
you can also implement this with
if you pre-segment the data into digits and extract what i call locals that vectors
and the paper
and it works in that case as well
there is a tree that we use here famous circles at component fusion
you can break the likelihood ratio up into a contribution from the individual gaussians and
wait and where the weights are
calculated using a logistic regression
that helps quite a lot
it requires however that you have a development set
in order just to choose the fusion weights
so in fact you see in the paper that
the results with these locals that vectors although we did obtain an improvement on the
evaluation set it was not as big nice improvement we obtained on we
on the development set
unique data if you're going to use a regularized logistic regression
so there is a way the way we found a way around that
instead of presegmenting the data into individual digits
we used a speech recognition system
in order to collect rebound one statistics
i mean if you can
in text-independent speaker recognition if you can use
signal and discriminant neural network to collect almost statistics be obvious thing to do in
text dependent speaker recognition for you know we
the phonetic transcription is just use a speech recognizer to collect the
to collect the common statistics
there because
individual signals are very unlikely to occur more than once
i in a digit string you are implicitly imposing a left-to-right structure
but you're not so you don't have to do it explicitly
the that works the horse just as well
okay so there's some fusion results
three two approaches with and without the
paying attention to that remind structure but they work if you use them you
you do get better results
okay so just adjustments one thing that i didn't well on
this thing can be implemented very efficiently
okay you come
you can basically things up in such a way that the linear algebra that needs
to be performed at runtime just involves
diagonal matrices
so it's not so it's nothing like the i-vector back end but i that i
presented
the last interspeech conference which
was just a trial run for this wasn't intended to be a realistic solution problem
it involves essentially
extracting an i-vector approach which is not something you would so you would normally do
but this is very computationally reason so it is effective in a lisp right
okay that's the that's only have to say thank you
okay we
question
are normalising the channel effect in the baum-welch statistics and he also normalized the for
any the phoneme variability there as well
well what was i said this is future work i to do something about but
i
i think this is a problem we should pay attention to
okay so famous that some preliminary work on a but it it's a vector
and sell something we well here for a very this
fanatics israel email down for you in
text dependent speaker recognition it's not so much an issue as it is and in
text independent where it's really going to come from
a neural network that's trying to discriminate
okay and ask questions any using a skinny channel
we estimate
that's right that's what the rest of the calls for i think can be somehow
we define it
it's what the jfa recipe calls for even though the channel
variables are treated as hidden variables that have
posterior expectation the posterior covariance matrix
if you look
the role that
likely
if you merrily interest and filtering of the channel thanks to turns out that all
you need is the
is the posterior expectation
this is just with the model sets
so i
the model is very simple you just two
this the only way i'm not once the last
okay tend to stick