0:00:21 | So, good afternoon and thank you, Patrick. |
---|
0:00:25 | Well, I am Carlos Vaqueros from Agnitio, from Spain, and I'm presenting our work on |
---|
0:00:30 | Datasets shift in PLDA based speaker verification, |
---|
0:00:34 | which is, actually, an analysis on |
---|
0:00:37 | several techniques |
---|
0:00:38 | that can be used in |
---|
0:00:42 | PLDA systems to indicate the effect of dataset shift. But also it's analysis on the |
---|
0:00:49 | limitations that the PLDA systems have, when dealing with dataset shift. |
---|
0:00:56 | So, dataset shift is the mismatch that may appear between the joint distributions of inputs |
---|
0:01:02 | and outputs |
---|
0:01:04 | for training and testing. |
---|
0:01:07 | Okay? In general, we have three types of dataset shift. First one will be covariate |
---|
0:01:12 | shift , which is the... which appears when there is |
---|
0:01:18 | when the distribution of the inputs, differ from training to testing and it's the most |
---|
0:01:25 | usual one, the most usual type of dataset shift, since it is related to channel |
---|
0:01:32 | variability, session variability or language mismatch. But there are also another types of |
---|
0:01:39 | dataset shift, for example prior probability shift, which is related to variations in the operating |
---|
0:01:46 | point; |
---|
0:01:47 | or concept shift, which is related to adversarial environments, that in speaker verification will be |
---|
0:01:55 | spoofing attempts. |
---|
0:01:57 | In this ... in this work we're focusing on covariate shift. |
---|
0:02:03 | Covariate shift has been widely studied in the speaker verification. |
---|
0:02:07 | We know that there are several techniques developed to compensate for channel/ session variability or |
---|
0:02:12 | language mismatch. But most of the sessions, most of these techniques work under the assumption |
---|
0:02:21 | that large datasets are available for training. |
---|
0:02:27 | The thing is: what happens in real situations, where we face a completely new and |
---|
0:02:31 | unknown situation and we don't have data to train these... these approaches? For example, here |
---|
0:02:37 | we have some results. |
---|
0:02:40 | We are considering the JFA system, |
---|
0:02:42 | that |
---|
0:02:46 | to face the condition one, the NIST, SRE, await, which is interview-interview, and we don't |
---|
0:02:52 | use any telephone, any microphone data for training channel. So, we can see that JFA |
---|
0:03:01 | if not using microphone data, is not much better classical map |
---|
0:03:05 | doesn't use any compensation at all |
---|
0:03:09 | So, once we have the microphone data, we get |
---|
0:03:14 | a huge improvement. |
---|
0:03:15 | So the thing is, what can we do we in real scenarios that are unknown |
---|
0:03:20 | and unseen? |
---|
0:03:22 | Well, if we don't have any data, it's hard to do anything, but usually we |
---|
0:03:26 | can, we can expect that some little amount of matched data is provided. So, there |
---|
0:03:34 | is the thing that we could do. |
---|
0:03:37 | We can define some probabilistic framework, so that it is possible to perform an adaptation, |
---|
0:03:46 | even a modeled train. |
---|
0:03:48 | When a mismatch development data and given some matched data, we... so we can adapt |
---|
0:03:54 | the model parameters and they can work as soon as possible in this new scenario. |
---|
0:04:01 | But to do this in a natural way and |
---|
0:04:09 | to derive it eassily, we should suspect that the |
---|
0:04:14 | the speaker verification system would be a monolithic system that provides a single probabilistic framework |
---|
0:04:20 | to compute the likelihood of the model parameters given the data. |
---|
0:04:27 | Well, the first approach is to JFA were monolythic, so they provided a framework in |
---|
0:04:34 | which algorithm worked, that defined this, the weight to adapt of these parameters. It could |
---|
0:04:42 | be possible to define weight to adapt these parameters, given a small amount of data. |
---|
0:04:47 | But, currents state-of-the-art PLDA systems, they are modular, so we have several model levels. |
---|
0:04:57 | We have started with the first level, the UBM, so we plan the UBm separately |
---|
0:05:03 | and it provides sufficient statistics. We used to train the i-vector extractor, a total variability |
---|
0:05:09 | subspace, and then we obtained i-vectors and we used them to train the PLDA model, |
---|
0:05:15 | but we used them as features. PLDA model has no knowledge of how these features |
---|
0:05:22 | were structured, |
---|
0:05:25 | just the prior distribution they have. |
---|
0:05:29 | So it it's ... this model has it's advantages, because it's very easy to |
---|
0:05:39 | to keep improving this model, since we can fix the UBM and work in the |
---|
0:05:44 | total variability matrices, which is fast to train and |
---|
0:05:50 | we can try many things and prove it. And also the i-vector extractor is fixed, |
---|
0:05:55 | we can work a lot and very quickly in PLDA model, and we keep improving |
---|
0:05:59 | it. |
---|
0:06:00 | But, in test of adapting this model to new situations, it's ... it has some |
---|
0:06:08 | problems. Because either we work in the highest level, in a highest model level, that |
---|
0:06:14 | is PLDA and we adapt the PLDA parameters to face the new situations |
---|
0:06:22 | or if we want to work in |
---|
0:06:25 | lower model levels, we will need to retrain |
---|
0:06:29 | the whole system. |
---|
0:06:31 | For example, if we have adapted the UBM, our i-vector extractor is not valid anymore, |
---|
0:06:35 | so we will need to retrain it on the whole data. And this is not |
---|
0:06:38 | feasible in many applications, for example an application that you want to learn online as |
---|
0:06:45 | you get more data, in new situation, so you... |
---|
0:06:51 | need to have all the development data every time we adapt the UBM, so that's |
---|
0:06:56 | not... it will take a long time to adapt it for even a few set |
---|
0:07:01 | of recording, a small set of recrding. So that's not feasible in many applications. |
---|
0:07:14 | Well, we... in any case, there are several techniques that we've done, that we can |
---|
0:07:20 | apply |
---|
0:07:21 | in a PLDA system. First thing we could do is, we can |
---|
0:07:28 | the UBM, attend to the subsequent model levels, but we will need to retrain the |
---|
0:07:34 | whole system. |
---|
0:07:35 | We can do it pooling all the available data, the development data and the matched |
---|
0:07:39 | data, or we could do it weighting of datasets. But, this will be not feasible |
---|
0:07:45 | in many applications. |
---|
0:07:47 | So, we can also work in the i-vector extractor. One thing that has been done |
---|
0:07:53 | is to |
---|
0:07:55 | is to train a new total variability matrix on the matched |
---|
0:08:01 | matched data. |
---|
0:08:03 | Stack it with the original total |
---|
0:08:06 | variability matrix. |
---|
0:08:07 | Well, this approach has some to work, but usually you need a quite large amount |
---|
0:08:15 | of data to train the match, total variability matrix. And also, it will require to |
---|
0:08:21 | retrain the PLDA model. |
---|
0:08:23 | It will have some problems. And also, become working |
---|
0:08:29 | in the PLDA |
---|
0:08:33 | PLDA model. Here, what we are proposing to do is simply use the length normalization. |
---|
0:08:41 | But using |
---|
0:08:43 | some sort of i-vector adaptation by centering |
---|
0:08:49 | using the i-vector mean from the matched dataset. |
---|
0:08:59 | What it has to say? |
---|
0:09:02 | Here |
---|
0:09:03 | it should be some reference to the word, the study done by Jesus, that is |
---|
0:09:10 | also another approach that could have to compensate for covariate shift |
---|
0:09:16 | in five to six percent and after another approach. |
---|
0:09:20 | So, this |
---|
0:09:22 | these problems, but always work in the PLDA model so the UBM and the i-vector |
---|
0:09:28 | extractor are modified. |
---|
0:09:32 | To test these techniques, what we do is we simulate covariate shift into variation language |
---|
0:09:39 | mismatch. |
---|
0:09:40 | So we assume that our system has been trained completely on English data. |
---|
0:09:44 | We will evaluate it in mismatched groups of languages. We will consider Chinese, Hindi-Urdu and |
---|
0:09:54 | Russian. As the development data we will use the NIST data from zero four to |
---|
0:10:03 | zero six, the Switchword data and Fisher data. |
---|
0:10:08 | Here we will have the number of session speakers that we have for each language |
---|
0:10:12 | for Chinese we'll have |
---|
0:10:14 | quite a large amount of data. |
---|
0:10:16 | For example, for Hindi-Urdu we don't have much development data. |
---|
0:10:21 | We will evaluate these approaches on the NIST SRE zero eight telephone- telephone condition. We |
---|
0:10:27 | will consider all to all trials |
---|
0:10:32 | sHere we have the number of models and speakers, it is |
---|
0:10:35 | language. |
---|
0:10:37 | In a speaker verification system we will consider an i-vector, PLDA system, gender-dependent i-vector extractor, |
---|
0:10:44 | dimension four hundred. And then, we'll consider a gender-dependent PLDA, which is a mixture of |
---|
0:10:50 | two PLDA models, one for... one trained male data, one trained with female data. |
---|
0:10:59 | With what... with full covariance matrix for the system component we have speaker subspace of |
---|
0:11:04 | dimension one hundred and twenty. |
---|
0:11:06 | And the result will... are analyzed in terms of EER and miniDCF. MiniDCF |
---|
0:11:16 | So the first thing we do is, we analyze the effect of covariate shift in |
---|
0:11:21 | the data. And what we have done is to analyze the i-vectors. |
---|
0:11:25 | We have different languages. So we have computed in Mahalanobis distance, been doing the |
---|
0:11:33 | population of English i-vectors are the |
---|
0:11:38 | other language, the population of other language's i-vectors. We have seen that these distances are |
---|
0:11:44 | very large. So, this means that when we are performing the i-vector land normalisation |
---|
0:11:54 | language which is different from English, we project it onto a small region of the |
---|
0:12:00 | hypersphere of unit radius. So, that... the distribution will not be suspected. |
---|
0:12:08 | The... all the i-vectors will be concentrated in a small region of the hyperextract. |
---|
0:12:13 | So this will have an effect in the accuracy, not only the distribution of i-vectors, |
---|
0:12:19 | because we are missing more information in the UBM But in the end, we see |
---|
0:12:25 | that it has |
---|
0:12:25 | an effect in the accuracy of the system, but we can see in this table |
---|
0:12:31 | only English data has been used for development |
---|
0:12:37 | the other languages |
---|
0:12:38 | worse results that English. It is true that we don't know the accuracy that we |
---|
0:12:47 | will get for these languages, provided that we have enough data to train a model, |
---|
0:12:53 | to train a complete evaluation system with them. But there's no reason also to believe |
---|
0:12:57 | that these languages are harder for speaker verification system that English. So we could expect |
---|
0:13:04 | to get an accuracy which is |
---|
0:13:07 | somehow similar, maybe better, maybe worse, but somehow similar. |
---|
0:13:10 | to English. |
---|
0:13:13 | Well, here we are comparing the minDCF obtained for the proposed techniques |
---|
0:13:22 | for the three languages and the three groups of languages at their best. |
---|
0:13:27 | So the first call for each language is the baseline, so you see |
---|
0:13:31 | English development data. |
---|
0:13:33 | And the second column is |
---|
0:13:37 | stacking to the... we use |
---|
0:13:41 | total variability matrices. |
---|
0:13:42 | The third is using i-vector adaptation. Fourth is using s-norm. |
---|
0:13:49 | but, we will |
---|
0:13:54 | And the last three collumns are combinations of these techniques. |
---|
0:13:58 | So, what... we can see that most of these techniques work in the sense that |
---|
0:14:02 | they improve the |
---|
0:14:06 | results of the system |
---|
0:14:06 | but improvement is quite small. |
---|
0:14:12 | if we wanted to reach some acccuracy close to English, which is |
---|
0:14:17 | here |
---|
0:14:18 | where we are still too far, we're still too far. |
---|
0:14:24 | So, this can be seen also in this DET curves |
---|
0:14:28 | where we are representing the DET curves of time for Chinese. |
---|
0:14:37 | We have the DET curve which is |
---|
0:14:38 | only using English data for involvement, the blue curve will use a match training data |
---|
0:14:45 | to |
---|
0:14:46 | perform i-vector adaptation |
---|
0:14:49 | the black curve will use match Chinese data |
---|
0:14:52 | to perform |
---|
0:14:53 | i-vector adaptation on s-norm. |
---|
0:14:55 | We get the |
---|
0:14:56 | we see that we get a slight improvement, but we are still too far from |
---|
0:15:02 | English |
---|
0:15:03 | So, that's from the results we would like to get. |
---|
0:15:10 | There is also another important fact that we introduce. The presence of covariate shift. We |
---|
0:15:16 | will find this misalignment in the score distributions. |
---|
0:15:22 | It's something that is widely known and |
---|
0:15:25 | you can see this effect here, in the example we have. |
---|
0:15:29 | We have represented the English and |
---|
0:15:30 | Chinese score distributions. We can see that the Chinese score distributions |
---|
0:15:35 | are |
---|
0:15:38 | shifted to the right |
---|
0:15:42 | higher scores, probably it's related also with the fact that |
---|
0:15:45 | the i-vectors are concentrated in the small region. |
---|
0:15:50 | you |
---|
0:15:53 | So, |
---|
0:15:54 | it's |
---|
0:15:56 | it's mandatory to use, it will have a little amount of data to use it |
---|
0:16:00 | for calibration. |
---|
0:16:02 | This is something that everybody knows and we have been doing for |
---|
0:16:08 | in all NIST evals, we always calibrate each condition separately. We use also techniques with |
---|
0:16:16 | side info |
---|
0:16:18 | for calibration that we, that we add the language, but it's important, the condition might... |
---|
0:16:25 | because if we only have a little amount of data, and we need to use |
---|
0:16:30 | independent... |
---|
0:16:32 | part of the data for calibration and for adaptation, we will not have much data |
---|
0:16:36 | for adaptation. |
---|
0:16:38 | So, here we are representing minDCF for our languages. |
---|
0:16:44 | And in the actual DCF we use English data for calibration, in red. That's DCF |
---|
0:16:50 | when you use |
---|
0:16:51 | we use matched data. |
---|
0:16:53 | It's |
---|
0:16:55 | it's mandatory to use matched data for calibration. |
---|
0:16:58 | So, as conclusions of this work, |
---|
0:17:00 | we'll say that dataset shift is usual in speaker recognition |
---|
0:17:08 | There are many techniques developed to compensate for this, but most of them need |
---|
0:17:14 | large amount of data to work properly. |
---|
0:17:17 | But in many real cases little data is provided. |
---|
0:17:21 | So, if we have monolithic systems, it will enable us to perform some sort of |
---|
0:17:28 | adaptation. |
---|
0:17:29 | But state-of-the-art techniques tend to modularities, since development is much easier, when we have a |
---|
0:17:36 | modular system. |
---|
0:17:37 | PLDA |
---|
0:17:38 | There are techniques that can work with this modular |
---|
0:17:43 | modular systems, but they obtain a slight increase in accuracy. |
---|
0:17:47 | There is still a huge gap to improve. |
---|
0:17:49 | And finally, it's important to keep in mind that matched data is mandatory for calibration, |
---|
0:17:55 | so we have |
---|
0:17:58 | small amount of data |
---|
0:17:59 | for adaptation, we will need to use part of this data for calibration. |
---|
0:18:04 | So, that's all, thank you very much. |
---|
0:18:28 | You mean, in this work? |
---|
0:18:40 | You mean this work or in the literature? |
---|
0:18:44 | I'm not sure, BUT you can see that, for example, YOUR i-vectors don't match your |
---|
0:18:51 | distribution, your expected prior distribution needs a new or |
---|
0:18:57 | or even at lower levels your statistics or |
---|
0:18:59 | or MFCC |
---|
0:19:04 | but yes |
---|
0:19:14 | but it would be interesting. I think the problem is that |
---|
0:19:19 | if you want to have a compensation |
---|
0:19:23 | basis, it would be interesting to have at some point JFA or maybe eigenchannel base |
---|
0:19:31 | system that is |
---|
0:19:34 | described as probabilistic framework that you could adapt, define some technique but |
---|
0:19:43 | interesting to do it. |
---|
0:20:10 | So you mean using a smaller |
---|
0:20:12 | dimensional i-vector extracor? |
---|
0:20:18 | okay |
---|
0:20:23 | But in any case, you will... if you adapt your i-vector extractor, you will need |
---|
0:20:28 | to retrain your PLDA system. |
---|
0:20:41 | Yeah yeah. Have you tried to remove the specific means |
---|
0:20:46 | or the specific channel conditions? |
---|
0:20:50 | for example |
---|
0:20:51 | microphone data |
---|
0:20:53 | or to remove |
---|
0:20:55 | telephone mean from the telephone data |
---|
0:20:56 | microphone mean from the microphone data? |
---|
0:21:00 | No, I haven't tried that. |
---|
0:21:03 | Sounds risky. |
---|
0:21:07 | It may work, but |
---|
0:21:09 | like assuming that there is no rotation in the i-vectors, so that's shift in the |
---|
0:21:16 | if there is rotation |
---|
0:21:18 | it will not work |
---|
0:21:22 | I don't know |
---|
0:21:23 | It is interesting to try. I've tried that and it was helping |
---|
0:21:27 | It was helping? Ok, that's interesting. |
---|
0:21:43 | okay |
---|
0:21:54 | Well, especially were in those languages, where I don't have much matched data yet. Yeah, |
---|
0:22:00 | that might be... I think it's in most languages pretty balanced, but there are some |
---|
0:22:07 | languages... I think I remember that, for example, Hindi had |
---|
0:22:11 | Hindi-Urdu had ... |
---|
0:22:14 | in detail... seven speakers. So that was |
---|
0:22:18 | I remember, but is probably... it is quite unbalanced, but maybe we have |
---|
0:22:22 | female speaker |
---|
0:22:47 | well |
---|
0:22:51 | okay |
---|
0:22:53 | Well, not for Chinese, for example. It depends on the language |
---|
0:22:59 | but |
---|
0:22:59 | I would say that i-vector adaptation is the one that |
---|
0:23:05 | rocks, so it always needs improvement |
---|
0:23:09 | It's not much, but |
---|
0:23:11 | yet |
---|
0:23:22 | The matched data. |
---|
0:23:24 | So, when I work I use... |
---|
0:23:28 | so these techniques try to use the matched data |
---|
0:23:33 | but in our web two group the |
---|
0:23:35 | accuracy of the system |
---|
0:23:44 | Not much, I don't think the improvement was indicative, if there was improvement. Maybe there |
---|
0:23:51 | was some losses. |
---|
0:24:51 | So you mean that |
---|
0:24:54 | if I get |
---|
0:24:56 | my model speakers from English, it will help also if we |
---|
0:25:00 | perform some of these techniques to adapt to them? |
---|
0:25:08 | okay |
---|
0:25:13 | okay |
---|
0:25:35 | I see that you can't do sometimes something without the data, because there are certain |
---|
0:25:41 | ways |
---|
0:25:45 | courses of variability |
---|
0:26:04 | variability in the first place |
---|
0:26:13 | general comment to |
---|
0:26:16 | all of us |
---|
0:26:44 | Yeah, ok, well in fact, there are techniques that provide more...that need the results presented |
---|
0:26:51 | in last of the speech |
---|
0:26:54 | is based on integrating out the |
---|
0:26:57 | PLDA parameters. So, to |
---|
0:27:01 | the uncertainty of these parameters, so it should be more robust to dataset shift, but |
---|
0:27:08 | when you see.. the point here is: if you have some amount of data |
---|
0:27:12 | so it's better to use it. But you're right |
---|
0:27:20 | You are completely right, of course. |
---|