So, good afternoon and thank you, Patrick.
Well, I am Carlos Vaqueros from Agnitio, from Spain, and I'm presenting our work on
Datasets shift in PLDA based speaker verification,
which is, actually, an analysis on
several techniques
that can be used in
PLDA systems to indicate the effect of dataset shift. But also it's analysis on the
limitations that the PLDA systems have, when dealing with dataset shift.
So, dataset shift is the mismatch that may appear between the joint distributions of inputs
and outputs
for training and testing.
Okay? In general, we have three types of dataset shift. First one will be covariate
shift , which is the... which appears when there is
when the distribution of the inputs, differ from training to testing and it's the most
usual one, the most usual type of dataset shift, since it is related to channel
variability, session variability or language mismatch. But there are also another types of
dataset shift, for example prior probability shift, which is related to variations in the operating
point;
or concept shift, which is related to adversarial environments, that in speaker verification will be
spoofing attempts.
In this ... in this work we're focusing on covariate shift.
Covariate shift has been widely studied in the speaker verification.
We know that there are several techniques developed to compensate for channel/ session variability or
language mismatch. But most of the sessions, most of these techniques work under the assumption
that large datasets are available for training.
The thing is: what happens in real situations, where we face a completely new and
unknown situation and we don't have data to train these... these approaches? For example, here
we have some results.
We are considering the JFA system,
that
to face the condition one, the NIST, SRE, await, which is interview-interview, and we don't
use any telephone, any microphone data for training channel. So, we can see that JFA
if not using microphone data, is not much better classical map
doesn't use any compensation at all
So, once we have the microphone data, we get
a huge improvement.
So the thing is, what can we do we in real scenarios that are unknown
and unseen?
Well, if we don't have any data, it's hard to do anything, but usually we
can, we can expect that some little amount of matched data is provided. So, there
is the thing that we could do.
We can define some probabilistic framework, so that it is possible to perform an adaptation,
even a modeled train.
When a mismatch development data and given some matched data, we... so we can adapt
the model parameters and they can work as soon as possible in this new scenario.
But to do this in a natural way and
to derive it eassily, we should suspect that the
the speaker verification system would be a monolithic system that provides a single probabilistic framework
to compute the likelihood of the model parameters given the data.
Well, the first approach is to JFA were monolythic, so they provided a framework in
which algorithm worked, that defined this, the weight to adapt of these parameters. It could
be possible to define weight to adapt these parameters, given a small amount of data.
But, currents state-of-the-art PLDA systems, they are modular, so we have several model levels.
We have started with the first level, the UBM, so we plan the UBm separately
and it provides sufficient statistics. We used to train the i-vector extractor, a total variability
subspace, and then we obtained i-vectors and we used them to train the PLDA model,
but we used them as features. PLDA model has no knowledge of how these features
were structured,
just the prior distribution they have.
So it it's ... this model has it's advantages, because it's very easy to
to keep improving this model, since we can fix the UBM and work in the
total variability matrices, which is fast to train and
we can try many things and prove it. And also the i-vector extractor is fixed,
we can work a lot and very quickly in PLDA model, and we keep improving
it.
But, in test of adapting this model to new situations, it's ... it has some
problems. Because either we work in the highest level, in a highest model level, that
is PLDA and we adapt the PLDA parameters to face the new situations
or if we want to work in
lower model levels, we will need to retrain
the whole system.
For example, if we have adapted the UBM, our i-vector extractor is not valid anymore,
so we will need to retrain it on the whole data. And this is not
feasible in many applications, for example an application that you want to learn online as
you get more data, in new situation, so you...
need to have all the development data every time we adapt the UBM, so that's
not... it will take a long time to adapt it for even a few set
of recording, a small set of recrding. So that's not feasible in many applications.
Well, we... in any case, there are several techniques that we've done, that we can
apply
in a PLDA system. First thing we could do is, we can
the UBM, attend to the subsequent model levels, but we will need to retrain the
whole system.
We can do it pooling all the available data, the development data and the matched
data, or we could do it weighting of datasets. But, this will be not feasible
in many applications.
So, we can also work in the i-vector extractor. One thing that has been done
is to
is to train a new total variability matrix on the matched
matched data.
Stack it with the original total
variability matrix.
Well, this approach has some to work, but usually you need a quite large amount
of data to train the match, total variability matrix. And also, it will require to
retrain the PLDA model.
It will have some problems. And also, become working
in the PLDA
PLDA model. Here, what we are proposing to do is simply use the length normalization.
But using
some sort of i-vector adaptation by centering
using the i-vector mean from the matched dataset.
What it has to say?
Here
it should be some reference to the word, the study done by Jesus, that is
also another approach that could have to compensate for covariate shift
in five to six percent and after another approach.
So, this
these problems, but always work in the PLDA model so the UBM and the i-vector
extractor are modified.
To test these techniques, what we do is we simulate covariate shift into variation language
mismatch.
So we assume that our system has been trained completely on English data.
We will evaluate it in mismatched groups of languages. We will consider Chinese, Hindi-Urdu and
Russian. As the development data we will use the NIST data from zero four to
zero six, the Switchword data and Fisher data.
Here we will have the number of session speakers that we have for each language
for Chinese we'll have
quite a large amount of data.
For example, for Hindi-Urdu we don't have much development data.
We will evaluate these approaches on the NIST SRE zero eight telephone- telephone condition. We
will consider all to all trials
sHere we have the number of models and speakers, it is
language.
In a speaker verification system we will consider an i-vector, PLDA system, gender-dependent i-vector extractor,
dimension four hundred. And then, we'll consider a gender-dependent PLDA, which is a mixture of
two PLDA models, one for... one trained male data, one trained with female data.
With what... with full covariance matrix for the system component we have speaker subspace of
dimension one hundred and twenty.
And the result will... are analyzed in terms of EER and miniDCF. MiniDCF
So the first thing we do is, we analyze the effect of covariate shift in
the data. And what we have done is to analyze the i-vectors.
We have different languages. So we have computed in Mahalanobis distance, been doing the
population of English i-vectors are the
other language, the population of other language's i-vectors. We have seen that these distances are
very large. So, this means that when we are performing the i-vector land normalisation
language which is different from English, we project it onto a small region of the
hypersphere of unit radius. So, that... the distribution will not be suspected.
The... all the i-vectors will be concentrated in a small region of the hyperextract.
So this will have an effect in the accuracy, not only the distribution of i-vectors,
because we are missing more information in the UBM But in the end, we see
that it has
an effect in the accuracy of the system, but we can see in this table
only English data has been used for development
the other languages
worse results that English. It is true that we don't know the accuracy that we
will get for these languages, provided that we have enough data to train a model,
to train a complete evaluation system with them. But there's no reason also to believe
that these languages are harder for speaker verification system that English. So we could expect
to get an accuracy which is
somehow similar, maybe better, maybe worse, but somehow similar.
to English.
Well, here we are comparing the minDCF obtained for the proposed techniques
for the three languages and the three groups of languages at their best.
So the first call for each language is the baseline, so you see
English development data.
And the second column is
stacking to the... we use
total variability matrices.
The third is using i-vector adaptation. Fourth is using s-norm.
but, we will
And the last three collumns are combinations of these techniques.
So, what... we can see that most of these techniques work in the sense that
they improve the
results of the system
but improvement is quite small.
if we wanted to reach some acccuracy close to English, which is
here
where we are still too far, we're still too far.
So, this can be seen also in this DET curves
where we are representing the DET curves of time for Chinese.
We have the DET curve which is
only using English data for involvement, the blue curve will use a match training data
to
perform i-vector adaptation
the black curve will use match Chinese data
to perform
i-vector adaptation on s-norm.
We get the
we see that we get a slight improvement, but we are still too far from
English
So, that's from the results we would like to get.
There is also another important fact that we introduce. The presence of covariate shift. We
will find this misalignment in the score distributions.
It's something that is widely known and
you can see this effect here, in the example we have.
We have represented the English and
Chinese score distributions. We can see that the Chinese score distributions
are
shifted to the right
higher scores, probably it's related also with the fact that
the i-vectors are concentrated in the small region.
you
So,
it's
it's mandatory to use, it will have a little amount of data to use it
for calibration.
This is something that everybody knows and we have been doing for
in all NIST evals, we always calibrate each condition separately. We use also techniques with
side info
for calibration that we, that we add the language, but it's important, the condition might...
because if we only have a little amount of data, and we need to use
independent...
part of the data for calibration and for adaptation, we will not have much data
for adaptation.
So, here we are representing minDCF for our languages.
And in the actual DCF we use English data for calibration, in red. That's DCF
when you use
we use matched data.
It's
it's mandatory to use matched data for calibration.
So, as conclusions of this work,
we'll say that dataset shift is usual in speaker recognition
There are many techniques developed to compensate for this, but most of them need
large amount of data to work properly.
But in many real cases little data is provided.
So, if we have monolithic systems, it will enable us to perform some sort of
adaptation.
But state-of-the-art techniques tend to modularities, since development is much easier, when we have a
modular system.
PLDA
There are techniques that can work with this modular
modular systems, but they obtain a slight increase in accuracy.
There is still a huge gap to improve.
And finally, it's important to keep in mind that matched data is mandatory for calibration,
so we have
small amount of data
for adaptation, we will need to use part of this data for calibration.
So, that's all, thank you very much.
You mean, in this work?
You mean this work or in the literature?
I'm not sure, BUT you can see that, for example, YOUR i-vectors don't match your
distribution, your expected prior distribution needs a new or
or even at lower levels your statistics or
or MFCC
but yes
but it would be interesting. I think the problem is that
if you want to have a compensation
basis, it would be interesting to have at some point JFA or maybe eigenchannel base
system that is
described as probabilistic framework that you could adapt, define some technique but
interesting to do it.
So you mean using a smaller
dimensional i-vector extracor?
okay
But in any case, you will... if you adapt your i-vector extractor, you will need
to retrain your PLDA system.
Yeah yeah. Have you tried to remove the specific means
or the specific channel conditions?
for example
microphone data
or to remove
telephone mean from the telephone data
microphone mean from the microphone data?
No, I haven't tried that.
Sounds risky.
It may work, but
like assuming that there is no rotation in the i-vectors, so that's shift in the
if there is rotation
it will not work
I don't know
It is interesting to try. I've tried that and it was helping
It was helping? Ok, that's interesting.
okay
Well, especially were in those languages, where I don't have much matched data yet. Yeah,
that might be... I think it's in most languages pretty balanced, but there are some
languages... I think I remember that, for example, Hindi had
Hindi-Urdu had ...
in detail... seven speakers. So that was
I remember, but is probably... it is quite unbalanced, but maybe we have
female speaker
well
okay
Well, not for Chinese, for example. It depends on the language
but
I would say that i-vector adaptation is the one that
rocks, so it always needs improvement
It's not much, but
yet
The matched data.
So, when I work I use...
so these techniques try to use the matched data
but in our web two group the
accuracy of the system
Not much, I don't think the improvement was indicative, if there was improvement. Maybe there
was some losses.
So you mean that
if I get
my model speakers from English, it will help also if we
perform some of these techniques to adapt to them?
okay
okay
I see that you can't do sometimes something without the data, because there are certain
ways
courses of variability
variability in the first place
general comment to
all of us
Yeah, ok, well in fact, there are techniques that provide more...that need the results presented
in last of the speech
is based on integrating out the
PLDA parameters. So, to
the uncertainty of these parameters, so it should be more robust to dataset shift, but
when you see.. the point here is: if you have some amount of data
so it's better to use it. But you're right
You are completely right, of course.