Speech Transcript - Dataset Shift in PLDA based Speaker Verification

0:00:21	So, good afternoon and thank you, Patrick.
0:00:25	Well, I am Carlos Vaqueros from Agnitio, from Spain, and I'm presenting our work on
0:00:30	Datasets shift in PLDA based speaker verification,
0:00:34	which is, actually, an analysis on
0:00:37	several techniques
0:00:38	that can be used in
0:00:42	PLDA systems to indicate the effect of dataset shift. But also it's analysis on the
0:00:49	limitations that the PLDA systems have, when dealing with dataset shift.
0:00:56	So, dataset shift is the mismatch that may appear between the joint distributions of inputs
0:01:02	and outputs
0:01:04	for training and testing.
0:01:07	Okay? In general, we have three types of dataset shift. First one will be covariate
0:01:12	shift , which is the... which appears when there is
0:01:18	when the distribution of the inputs, differ from training to testing and it's the most
0:01:25	usual one, the most usual type of dataset shift, since it is related to channel
0:01:32	variability, session variability or language mismatch. But there are also another types of
0:01:39	dataset shift, for example prior probability shift, which is related to variations in the operating
0:01:46	point;
0:01:47	or concept shift, which is related to adversarial environments, that in speaker verification will be
0:01:55	spoofing attempts.
0:01:57	In this ... in this work we're focusing on covariate shift.
0:02:03	Covariate shift has been widely studied in the speaker verification.
0:02:07	We know that there are several techniques developed to compensate for channel/ session variability or
0:02:12	language mismatch. But most of the sessions, most of these techniques work under the assumption
0:02:21	that large datasets are available for training.
0:02:27	The thing is: what happens in real situations, where we face a completely new and
0:02:31	unknown situation and we don't have data to train these... these approaches? For example, here
0:02:37	we have some results.
0:02:40	We are considering the JFA system,
0:02:42	that
0:02:46	to face the condition one, the NIST, SRE, await, which is interview-interview, and we don't
0:02:52	use any telephone, any microphone data for training channel. So, we can see that JFA
0:03:01	if not using microphone data, is not much better classical map
0:03:05	doesn't use any compensation at all
0:03:09	So, once we have the microphone data, we get
0:03:14	a huge improvement.
0:03:15	So the thing is, what can we do we in real scenarios that are unknown
0:03:20	and unseen?
0:03:22	Well, if we don't have any data, it's hard to do anything, but usually we
0:03:26	can, we can expect that some little amount of matched data is provided. So, there
0:03:34	is the thing that we could do.
0:03:37	We can define some probabilistic framework, so that it is possible to perform an adaptation,
0:03:46	even a modeled train.
0:03:48	When a mismatch development data and given some matched data, we... so we can adapt
0:03:54	the model parameters and they can work as soon as possible in this new scenario.
0:04:01	But to do this in a natural way and
0:04:09	to derive it eassily, we should suspect that the
0:04:14	the speaker verification system would be a monolithic system that provides a single probabilistic framework
0:04:20	to compute the likelihood of the model parameters given the data.
0:04:27	Well, the first approach is to JFA were monolythic, so they provided a framework in
0:04:34	which algorithm worked, that defined this, the weight to adapt of these parameters. It could
0:04:42	be possible to define weight to adapt these parameters, given a small amount of data.
0:04:47	But, currents state-of-the-art PLDA systems, they are modular, so we have several model levels.
0:04:57	We have started with the first level, the UBM, so we plan the UBm separately
0:05:03	and it provides sufficient statistics. We used to train the i-vector extractor, a total variability
0:05:09	subspace, and then we obtained i-vectors and we used them to train the PLDA model,
0:05:15	but we used them as features. PLDA model has no knowledge of how these features
0:05:22	were structured,
0:05:25	just the prior distribution they have.
0:05:29	So it it's ... this model has it's advantages, because it's very easy to
0:05:39	to keep improving this model, since we can fix the UBM and work in the
0:05:44	total variability matrices, which is fast to train and
0:05:50	we can try many things and prove it. And also the i-vector extractor is fixed,
0:05:55	we can work a lot and very quickly in PLDA model, and we keep improving
0:05:59	it.
0:06:00	But, in test of adapting this model to new situations, it's ... it has some
0:06:08	problems. Because either we work in the highest level, in a highest model level, that
0:06:14	is PLDA and we adapt the PLDA parameters to face the new situations
0:06:22	or if we want to work in
0:06:25	lower model levels, we will need to retrain
0:06:29	the whole system.
0:06:31	For example, if we have adapted the UBM, our i-vector extractor is not valid anymore,
0:06:35	so we will need to retrain it on the whole data. And this is not
0:06:38	feasible in many applications, for example an application that you want to learn online as
0:06:45	you get more data, in new situation, so you...
0:06:51	need to have all the development data every time we adapt the UBM, so that's
0:06:56	not... it will take a long time to adapt it for even a few set
0:07:01	of recording, a small set of recrding. So that's not feasible in many applications.
0:07:14	Well, we... in any case, there are several techniques that we've done, that we can
0:07:20	apply
0:07:21	in a PLDA system. First thing we could do is, we can
0:07:28	the UBM, attend to the subsequent model levels, but we will need to retrain the
0:07:34	whole system.
0:07:35	We can do it pooling all the available data, the development data and the matched
0:07:39	data, or we could do it weighting of datasets. But, this will be not feasible
0:07:45	in many applications.
0:07:47	So, we can also work in the i-vector extractor. One thing that has been done
0:07:53	is to
0:07:55	is to train a new total variability matrix on the matched
0:08:01	matched data.
0:08:03	Stack it with the original total
0:08:06	variability matrix.
0:08:07	Well, this approach has some to work, but usually you need a quite large amount
0:08:15	of data to train the match, total variability matrix. And also, it will require to
0:08:21	retrain the PLDA model.
0:08:23	It will have some problems. And also, become working
0:08:29	in the PLDA
0:08:33	PLDA model. Here, what we are proposing to do is simply use the length normalization.
0:08:41	But using
0:08:43	some sort of i-vector adaptation by centering
0:08:49	using the i-vector mean from the matched dataset.
0:08:59	What it has to say?
0:09:02	Here
0:09:03	it should be some reference to the word, the study done by Jesus, that is
0:09:10	also another approach that could have to compensate for covariate shift
0:09:16	in five to six percent and after another approach.
0:09:20	So, this
0:09:22	these problems, but always work in the PLDA model so the UBM and the i-vector
0:09:28	extractor are modified.
0:09:32	To test these techniques, what we do is we simulate covariate shift into variation language
0:09:39	mismatch.
0:09:40	So we assume that our system has been trained completely on English data.
0:09:44	We will evaluate it in mismatched groups of languages. We will consider Chinese, Hindi-Urdu and
0:09:54	Russian. As the development data we will use the NIST data from zero four to
0:10:03	zero six, the Switchword data and Fisher data.
0:10:08	Here we will have the number of session speakers that we have for each language
0:10:12	for Chinese we'll have
0:10:14	quite a large amount of data.
0:10:16	For example, for Hindi-Urdu we don't have much development data.
0:10:21	We will evaluate these approaches on the NIST SRE zero eight telephone- telephone condition. We
0:10:27	will consider all to all trials
0:10:32	sHere we have the number of models and speakers, it is
0:10:35	language.
0:10:37	In a speaker verification system we will consider an i-vector, PLDA system, gender-dependent i-vector extractor,
0:10:44	dimension four hundred. And then, we'll consider a gender-dependent PLDA, which is a mixture of
0:10:50	two PLDA models, one for... one trained male data, one trained with female data.
0:10:59	With what... with full covariance matrix for the system component we have speaker subspace of
0:11:04	dimension one hundred and twenty.
0:11:06	And the result will... are analyzed in terms of EER and miniDCF. MiniDCF
0:11:16	So the first thing we do is, we analyze the effect of covariate shift in
0:11:21	the data. And what we have done is to analyze the i-vectors.
0:11:25	We have different languages. So we have computed in Mahalanobis distance, been doing the
0:11:33	population of English i-vectors are the
0:11:38	other language, the population of other language's i-vectors. We have seen that these distances are
0:11:44	very large. So, this means that when we are performing the i-vector land normalisation
0:11:54	language which is different from English, we project it onto a small region of the
0:12:00	hypersphere of unit radius. So, that... the distribution will not be suspected.
0:12:08	The... all the i-vectors will be concentrated in a small region of the hyperextract.
0:12:13	So this will have an effect in the accuracy, not only the distribution of i-vectors,
0:12:19	because we are missing more information in the UBM But in the end, we see
0:12:25	that it has
0:12:25	an effect in the accuracy of the system, but we can see in this table
0:12:31	only English data has been used for development
0:12:37	the other languages
0:12:38	worse results that English. It is true that we don't know the accuracy that we
0:12:47	will get for these languages, provided that we have enough data to train a model,
0:12:53	to train a complete evaluation system with them. But there's no reason also to believe
0:12:57	that these languages are harder for speaker verification system that English. So we could expect
0:13:04	to get an accuracy which is
0:13:07	somehow similar, maybe better, maybe worse, but somehow similar.
0:13:10	to English.
0:13:13	Well, here we are comparing the minDCF obtained for the proposed techniques
0:13:22	for the three languages and the three groups of languages at their best.
0:13:27	So the first call for each language is the baseline, so you see
0:13:31	English development data.
0:13:33	And the second column is
0:13:37	stacking to the... we use
0:13:41	total variability matrices.
0:13:42	The third is using i-vector adaptation. Fourth is using s-norm.
0:13:49	but, we will
0:13:54	And the last three collumns are combinations of these techniques.
0:13:58	So, what... we can see that most of these techniques work in the sense that
0:14:02	they improve the
0:14:06	results of the system
0:14:06	but improvement is quite small.
0:14:12	if we wanted to reach some acccuracy close to English, which is
0:14:17	here
0:14:18	where we are still too far, we're still too far.
0:14:24	So, this can be seen also in this DET curves
0:14:28	where we are representing the DET curves of time for Chinese.
0:14:37	We have the DET curve which is
0:14:38	only using English data for involvement, the blue curve will use a match training data
0:14:45	to
0:14:46	perform i-vector adaptation
0:14:49	the black curve will use match Chinese data
0:14:52	to perform
0:14:53	i-vector adaptation on s-norm.
0:14:55	We get the
0:14:56	we see that we get a slight improvement, but we are still too far from
0:15:02	English
0:15:03	So, that's from the results we would like to get.
0:15:10	There is also another important fact that we introduce. The presence of covariate shift. We
0:15:16	will find this misalignment in the score distributions.
0:15:22	It's something that is widely known and
0:15:25	you can see this effect here, in the example we have.
0:15:29	We have represented the English and
0:15:30	Chinese score distributions. We can see that the Chinese score distributions
0:15:35	are
0:15:38	shifted to the right
0:15:42	higher scores, probably it's related also with the fact that
0:15:45	the i-vectors are concentrated in the small region.
0:15:50	you
0:15:53	So,
0:15:54	it's
0:15:56	it's mandatory to use, it will have a little amount of data to use it
0:16:00	for calibration.
0:16:02	This is something that everybody knows and we have been doing for
0:16:08	in all NIST evals, we always calibrate each condition separately. We use also techniques with
0:16:16	side info
0:16:18	for calibration that we, that we add the language, but it's important, the condition might...
0:16:25	because if we only have a little amount of data, and we need to use
0:16:30	independent...
0:16:32	part of the data for calibration and for adaptation, we will not have much data
0:16:36	for adaptation.
0:16:38	So, here we are representing minDCF for our languages.
0:16:44	And in the actual DCF we use English data for calibration, in red. That's DCF
0:16:50	when you use
0:16:51	we use matched data.
0:16:53	It's
0:16:55	it's mandatory to use matched data for calibration.
0:16:58	So, as conclusions of this work,
0:17:00	we'll say that dataset shift is usual in speaker recognition
0:17:08	There are many techniques developed to compensate for this, but most of them need
0:17:14	large amount of data to work properly.
0:17:17	But in many real cases little data is provided.
0:17:21	So, if we have monolithic systems, it will enable us to perform some sort of
0:17:28	adaptation.
0:17:29	But state-of-the-art techniques tend to modularities, since development is much easier, when we have a
0:17:36	modular system.
0:17:37	PLDA
0:17:38	There are techniques that can work with this modular
0:17:43	modular systems, but they obtain a slight increase in accuracy.
0:17:47	There is still a huge gap to improve.
0:17:49	And finally, it's important to keep in mind that matched data is mandatory for calibration,
0:17:55	so we have
0:17:58	small amount of data
0:17:59	for adaptation, we will need to use part of this data for calibration.
0:18:04	So, that's all, thank you very much.
0:18:28	You mean, in this work?
0:18:40	You mean this work or in the literature?
0:18:44	I'm not sure, BUT you can see that, for example, YOUR i-vectors don't match your
0:18:51	distribution, your expected prior distribution needs a new or
0:18:57	or even at lower levels your statistics or
0:18:59	or MFCC
0:19:04	but yes
0:19:14	but it would be interesting. I think the problem is that
0:19:19	if you want to have a compensation
0:19:23	basis, it would be interesting to have at some point JFA or maybe eigenchannel base
0:19:31	system that is
0:19:34	described as probabilistic framework that you could adapt, define some technique but
0:19:43	interesting to do it.
0:20:10	So you mean using a smaller
0:20:12	dimensional i-vector extracor?
0:20:18	okay
0:20:23	But in any case, you will... if you adapt your i-vector extractor, you will need
0:20:28	to retrain your PLDA system.
0:20:41	Yeah yeah. Have you tried to remove the specific means
0:20:46	or the specific channel conditions?
0:20:50	for example
0:20:51	microphone data
0:20:53	or to remove
0:20:55	telephone mean from the telephone data
0:20:56	microphone mean from the microphone data?
0:21:00	No, I haven't tried that.
0:21:03	Sounds risky.
0:21:07	It may work, but
0:21:09	like assuming that there is no rotation in the i-vectors, so that's shift in the
0:21:16	if there is rotation
0:21:18	it will not work
0:21:22	I don't know
0:21:23	It is interesting to try. I've tried that and it was helping
0:21:27	It was helping? Ok, that's interesting.
0:21:43	okay
0:21:54	Well, especially were in those languages, where I don't have much matched data yet. Yeah,
0:22:00	that might be... I think it's in most languages pretty balanced, but there are some
0:22:07	languages... I think I remember that, for example, Hindi had
0:22:11	Hindi-Urdu had ...
0:22:14	in detail... seven speakers. So that was
0:22:18	I remember, but is probably... it is quite unbalanced, but maybe we have
0:22:22	female speaker
0:22:47	well
0:22:51	okay
0:22:53	Well, not for Chinese, for example. It depends on the language
0:22:59	but
0:22:59	I would say that i-vector adaptation is the one that
0:23:05	rocks, so it always needs improvement
0:23:09	It's not much, but
0:23:11	yet
0:23:22	The matched data.
0:23:24	So, when I work I use...
0:23:28	so these techniques try to use the matched data
0:23:33	but in our web two group the
0:23:35	accuracy of the system
0:23:44	Not much, I don't think the improvement was indicative, if there was improvement. Maybe there
0:23:51	was some losses.
0:24:51	So you mean that
0:24:54	if I get
0:24:56	my model speakers from English, it will help also if we
0:25:00	perform some of these techniques to adapt to them?
0:25:08	okay
0:25:13	okay
0:25:35	I see that you can't do sometimes something without the data, because there are certain
0:25:41	ways
0:25:45	courses of variability
0:26:04	variability in the first place
0:26:13	general comment to
0:26:16	all of us
0:26:44	Yeah, ok, well in fact, there are techniques that provide more...that need the results presented
0:26:51	in last of the speech
0:26:54	is based on integrating out the
0:26:57	PLDA parameters. So, to
0:27:01	the uncertainty of these parameters, so it should be more robust to dataset shift, but
0:27:08	when you see.. the point here is: if you have some amount of data
0:27:12	so it's better to use it. But you're right
0:27:20	You are completely right, of course.

Dataset Shift in PLDA based Speaker Verification

SESSION 02: Speaker Recognition - Generative modeling

Carlos Vaquero