This whole session should be about compact representations for speaker identification and the first talk
is... the title of the first talk is A small footprint i-vector extractor
So, I repeat, this talk
by a small footprint
i-vector extractor
one that
amount to memory
the troubles we have
basic problem is that these kind of algorithms for extracting i-vec
are quadratic
both of them, both the memory and
computational reqiurements
we present
is a follow-up on
paper that Ondrej
Glembek presented at the last year's ICAASSP,
The approximation on how i-vectors could be extrac
with minimal memory
overhead, but after
after discussing that work, I intend to
is to show how that idea
The principal motivation for doing this work
it's well known that
you can introduce approximations at run time as
we have minor degradation in the
in the recognition performance
However, these approximations generally do cause problems if you want to do training. So, the
motivation for aiming at the exact posterior computations was to be able to do training
and, in particular, to be able to do training on a very large scale. Traditionally,
most work with i-vectors has been done with dimensions four or six hundred. In other
areas of pattern recognition principal components analyzers have much higher dimension than they constructed, so
the
purpose of the paper was to be able to run experiments with very high- dimensional
i-vector extractors.
As it happens, this didn't pay off. But the experiments needed to be done, you
know, in any case.
Okay, so the ... the point of that i-vectors, then, is that they
they provide a compact representation of an utterance
typically, vector of four or eight hundred dimensions independently of the length of the utterance.
So, that the time dimension is banished alltogether, which greatly simplifies the problem.
Essentially, it now becomes a traditional
biometric pattern recognition problem without the complication introduced by
arbitrary duration. So, many standard techniques apply and joint factor analysis becomes vastly simpler; so
simple that is now has another name it's called probabilistic linear
discriminant analysis
And, of course, the simplicity of this representation has that
well, fruitful research in other areas, like language recognition. Even speaker diarization for i-vectors
can be extracted from short speaker turns as short as just one second.
So, the basic idea is there is an implicit assumption that, given an utterance, it
can be represented by a Gaussian mixture model.
If the
if that GMM were observable,
then the problem of extracting the i-vector would simply be a matter of applying a
standard probabilistic principle components analysis
to the GMM supervector
So, the basic assumption is that of the supervectors lie below dimensional space, the basis
of that space
is known as the eigenvoices and the coordinates of the supervector relative to that basis
is the i-vector representation. So, the idea is that the components of the i-vector should
represent high level aspects of the utterance, which are independent of the phonetic content.
Because all of this apparatus is built
of the UBM, UBM can play the role of modelling the phonetic variability in the
utterance
and the i-vector then should capture things like speaker characteristics
room impulse response
and the other global aspects of the
utterance
So, the problem that arises is that the
the GMM supervector is not observable. The way to get around the problem is by
thinking of the Baum-Welch statistics.
It's typically collected with the universal
background model as ... summarising a noisy observation of the... of the GMM supervector.
the
From the mathematical point of view. The only difference between this situation and a standard
probabilistic principal components analysis is that in the standard situation you get to observe every
component of the vector exactly once and this situation, you observe different parts of the
vector different number of times.
Other than that, there is nothing misterious in the derivation.
So, this is the mathematical model, the supervectors
are assumed to be confined in a multidimensional
subspace or the supervector space.
The vector y is assumed in the prior, a standard, normal distribution. Now, the problem
is given Baum-Welch statistics to produce a point estimeate of y and tyhat is the
i-vector representation if the utterance.
You can also write the terms of the individual
components of the GML
the standard assumption is that the covariance matrix here remains unchanged, it's the same for
all utterances.
the
Attempting to make that utterance- dependent seems to lead to insuperable problems in practice,
nobody, to my knowledge has
ever made the progress with that problem.
One aspect that is common to most implementations is that
some of the parameters, mainly the mean vectors and the covariance matrices
is copied, technically, from the UBM, into the
probabilistic model
That, actually, leads to a slight improvement in the performance
I'll report to some results
later
The main advantage, though, is that you can simplify the implementation by performance of affine
transformation of the parameters
which enables you to take the mean vectors to be zero and covariance matrices
to be the identity
and that enables you to handle UBMs with full covariance matrices
in the simplest way.
It's well known that using
covariance matrices
does help.
So, these are the standard equations for extracting the i-vectors, assuming that the model parameters
are known.
the matrices be
the
problem is accumulating this... this matrix here.
Those are the zero order statistics,
that are extracted with the UBM.
The standard procedure is to precompute the terms here These matrices,
these here, they're symmetric matrices.
So, you only need the
proper triangle.
The problem, then, in the memory point of view, is that... because these are quadratic
in the i-vector dimension, you have to pay a heavy
price in terms of memory.
So, those are some
typical figures
for fairly standard sort of
configuration.
that is
These are the standard training algorithms; the only point I wanted to make in
putting up this equation here is that: both in training and in extracting the i-vector,
which was
the previous slide, the principal
computation is a matter of calculating the
posterior distribution of that factor y.
So, that is the problem,
calculating the posterior distribution of y.
Not just the point
So, the contribution of this paper
is to use a variational Bayes implementation of the probability model
in order to solve this
particular problem of the
doing
So, the standard assumption is to assume that the
posterior distribution that you're interested in factorizes; in other words, that you have a statistical
independent assumption relation
that you can impose
Estimating these terms here's carried out by a standard variational Bayes update procedure, which you
can
find
within the reference
of the Bishop's book.
This notation here means you take the vector y and
you
calculate an expectation over all components, rather than particular components
that you happen to be interested in when you're updating
the particular term.
the...
These updated rules are guaranteed to increase the variational
lower bound and that's useful
property
So, this is an iterative method,
you have to single iteration, what consists of
cycling over the
components of the i-vector.
This is just to explain that the computation's actually brought down to something very simple
assumptions are
Gaussian
the
factors in the variational factorization are also Gaussian. To get the normals
you just
expression
and the point about the memory then is just as the... in the full posterior
calculation. Pre-computing these matrices here enables you to speed up the computation at a constant
memory
The things you have to be pre-compute here are just the diagonal versions of these
things here and for that the memory overhead is negligible.
So, this is all based on the assumption that the
posterior that we can assume a diagonal posterior covariance matrix,
So,
is's explained in the paper why the variational Bayes method
, even if that assumption turns out to be wrong,
the variational Bayes is guaranteed to find the point estimate of the i-vector
exactly.
See? So, the only
error that's introduced here
is in the posterior covariance matrix,
it's assumed to be diagonal
There's no error
in the point estimate of the posterior, thus it's
If you're familiar with the numerical
the mechanics correspond to something known as the
which in this case happens to be guaranteed
versions happens to be guaranteed because
variational Bayes
So the method is exact, the only real issue is how efficient
it is. That turns out to raise the question of how
is the
assumption that
the covariance matrix can be treated as diagonal.
Two points here, to bear in mind.
In order to show why the assumptions are
reasonable.
First is that the i-vector model is not uniquely defined.
You can perform a rotation in the i-vector coordinates
provided that you perform a corresponding transformation on the
eigenvoices, the model remains unchanged.
The
posterior,
prior factor of the
why it continues to be the center of whole distrib
You have freedom in
rotating the
the basis.
The other point, this was the point that Ondrej Glembek
named in his
ICAASP paper last year. That, in general, this is a good approximation to the posterior
precision matrix,
provided you have sufficient data. So, those W's there are just the mixture
weights in the
unversal background model
and talking the number of frames
in, for example,
in scenario like core condition it may be somethin
so that you have sufficinetly many
frames that this
approximation here
would be reasoned.
If you combine those two things together,
okay? You can say that by diagonalizing this sum here you
form this sum just once, using the
the mixture weights. Then you will produce a basis of the i-vector space with respect
to which all the posterior
precision matrices are approximately diagonal.
That's the justification for the
diagonal assumption. You have to use a preferred
basis
in order to
do the calculations.
And using this... using this basis guarantees that the variational Bayes algorithm
will converge very quickly.
Typically, three iterations are enough,
three iterations independently
of the
rank of the dimensionality
of the i-vector.
And that's the basis of my contention that this algorythm's
computational requirements are
linear in the
So, memory overhead is negligible and
the computation
scales linearly
rather than quadratic.
If you're using this,
the preferred basis is going to change
, so you should not overlook that.
Whenever you have a variational Bayes method, you have a variational lower bound, which is
very similar to auxiliary function and
which weighting the auxiliary function
which is guaranteed to increase
on successive iterations of your
algotythm.
So, it's useful to be able to evaluate this, and
the formula is given in the paper.
It's guaranteed to increase on successive
iterations of variational Bayes.
So,
used for debuging. In principle,
it can be used to monitor convergence, but it actually turns out that the
overhead of using it for that purpose
slows down the algorythm,
it's not used for that in practice.
I think is that it can be used to monitor convergence when
you are training an i-vector extractor with variat
The point here is that the
exact evidence, which is
the thing you woudl normally use
to monitor convergence,
cannot be used
in this particular case
if you're assuming that the posterior
is diagonal, then you have to
modify the calculation
Okay, so a few examples
of questions that I dealt with in the paper.
One is how accurate is
variational Bayes algorythm?
To be clear here, there is no issue
at run time, you are guaranteed to get the exact
point-estimate of your i-vector,
provided you
monitor iterations
The only issue is
the recent approximation, when you treat the posterior precision or covariance matrix as the
diagonal.
And those posterior precisions
do interrupt the
training model, so it's concievable. But using the
assumption on the posterior precisions could affec
the way training behaves.
So, that's one point
that needs to be checked.
I mentioned at the beginning,
this is well known, but
I think needed to be tested.
If you make the simplifying transformation, which allows you to take the mean vectors to
be zero or the covariance
matrices
You're copying some parameters
from the UBM into the probabilistic model for
i-vectors.
There is a question as to
obviously
plausable reason
How efficient is variational Bayes?
Obviously, there's going to be some price to be paid.
The standard implementation
you can recude computational burden
at the cost of
several gigabytes of
memory,
so you no longer have opportunity
of using all that memory.
The question about efficiency.
And finally, there is an issue of
training very high dimensional i-vector extractors.
You cannot train very high dimensional
i-vectors exactly using
variational approach.
Bayes approach does enable you to do it,
but there is an impairment of doing that.
Ok, so the testbed was
female det two trials
this is a matter of telephone speech,
extended core condition of the NIST
two thousand and ten
evaluation.
Extended core condition of the millions of
very much large number of trials
than the original
evaluation protocol.
The standard front end, the standard UBM
diagonal covariance matrices, trained on the usual
In other respects, the classifier
was quite standard
the used heavy-tailed PLDA
These were
results obtained with JFA executables,which is
the way i-vectors were originally
built and they were just
produced as Benchmark
There was a problem, the activity detection explai
error rates were a little higher than expected.
With variaional Bayes I actually got marginally better
results, which turned out to be a more effective c
covariance matrices. Copying the covariance matrices
actually
effect is to reduce to other estimated variances
I need to get to
efficiency
My figures for extracting a four hundred-dimensional i-vector extractor
are typically about half a second.
Almost all of the time is spent in
BLAS routines, accumulating the posterior covariance matrices
to take seventy five percent of the time.
An estimate a of quarter of second, which
suggests that compiler optimization may
be helpful, everything is going on inside the clas
For the variational Bayes method, I've got an estimate of
point nine seconds instead of point five.
I've fixed the number of iterations at five.
Variational Bayes method really comes to it's own
when you work
with higher-dimensional of
i-vector extractors.
It's one last table. I did try and put several
dimensions, up to sixteen
hundred and got a very good indurance and performa
Okay, thank you.
Well, it depends on the ... on the dimensionality of the i-vector extractor.
A couple of gigabytes, it's as big as the large vocabulary continuous speech recognizer, it's
not as intelligent, but it's as big.
Just the eigenvoices, okay, together with the stuff you have to pre-compute in order to
extract the i-vectors efficiently.
Yeah, that what require
You still have to store the eigenvoices, but thing that you did not know, that's
not the big part. The big part is a bunch of a triangular matrices that
we store in order to calculate i-vectors efficiently, using the standard approach.
The point of this was to use the variational Bayes to void that computation.
I'm afraid that we've just spent the time for questions here and so I... I
guess that Patrick will get lots of questions offline and maybe even at the end
of this talk they can answer those questions together with Sandro, who is giving the
next talk