0:00:17 | would you just |
---|
0:00:19 | a bear with me for a couple minutes subset and some background and then i |
---|
0:00:23 | will try to explain |
---|
0:00:25 | in some detail what the technical problem is that we're trying to solve |
---|
0:00:30 | so for the jfa model here i formulated that in terms of gmm mean vectors |
---|
0:00:35 | problem and supervectors |
---|
0:00:38 | that's first term and is the mean vector that comes from universal background model |
---|
0:00:47 | the second term involves that a hidden variable x |
---|
0:00:50 | which is independent of the channel |
---|
0:00:53 | excuse me independent of the mixture component and intended to model the channel effects across |
---|
0:00:58 | a recordings |
---|
0:01:01 | and the third term in that formulation it how it's a local hidden variable |
---|
0:01:06 | to characterize the |
---|
0:01:09 | speaker phrase variability within a particular mixture component |
---|
0:01:16 | so |
---|
0:01:18 | the is typical approach would be to estimate facts matrix u |
---|
0:01:23 | using the maximum likelihood to criterion which is exactly the criterion that is used to |
---|
0:01:31 | train an i-vector extractor |
---|
0:01:35 | in practice |
---|
0:01:37 | rather than use maximum likelihood to you usually end up using relevance map of as |
---|
0:01:42 | an empirical estimate of those matrices t |
---|
0:01:47 | the relation between the two you can find explained in the paper by probably vol |
---|
0:01:53 | going back to two thousand and eight |
---|
0:01:56 | the point i what stress here is that z vector is high dimensional we're not |
---|
0:02:03 | trying to explain the |
---|
0:02:05 | a speaker phrase variability by a low dimensional vector of hidden variables |
---|
0:02:12 | it's a factorial prior in the sense that the |
---|
0:02:17 | explanations for the different mixture components are statistically independent |
---|
0:02:22 | which really is a weakness |
---|
0:02:24 | we're not actually in a position with a prior like this |
---|
0:02:27 | to exploit the correlations between mixture components |
---|
0:02:37 | so to do calculations with this type of model these standard method is an algorithm |
---|
0:02:43 | by robbie vol which alternates between updating the two |
---|
0:02:48 | hidden variables x and z |
---|
0:02:51 | it didn't present of this way but it's actually a variational bayes algorithm which means |
---|
0:02:55 | that it comes with variational lower bounds that you can used to |
---|
0:02:59 | a likelihood or evidence calculations |
---|
0:03:03 | that means that you can for example |
---|
0:03:06 | formulate the |
---|
0:03:08 | speaker recognition problem in exactly the same way as it's done in the l d |
---|
0:03:14 | n e |
---|
0:03:16 | a bayesian model selection |
---|
0:03:19 | problem the question is whether |
---|
0:03:21 | if you're given enrollment utterances and test utterances and you want to |
---|
0:03:25 | account for that on some of the of data |
---|
0:03:29 | whether you are better off |
---|
0:03:32 | passes doing a single cent vector |
---|
0:03:34 | or two vectors one for the enrollment data one for the test |
---|
0:03:41 | something |
---|
0:03:42 | basically unsatisfactory about this namely |
---|
0:03:46 | it doesn't take account of the fact that the |
---|
0:03:50 | what jfa is it's model for handle |
---|
0:03:54 | the ubm moves under speaker and channel effects |
---|
0:03:58 | traditionally |
---|
0:03:59 | when we do these calculations we use the universal background model so collect belmont statistics |
---|
0:04:06 | and ignore the fact |
---|
0:04:07 | that i according to our model |
---|
0:04:10 | the ubm is actually ships to as a result of these hidden variables |
---|
0:04:17 | and |
---|
0:04:19 | there is an important by jean down that tends to remedy this and i was |
---|
0:04:24 | particularly interested in looking into this for the reason that i mentioned at the beginning |
---|
0:04:29 | i believe that's |
---|
0:04:31 | the ubm does have to be adapted |
---|
0:04:34 | in text dependent speaker recognition |
---|
0:04:38 | and |
---|
0:04:38 | this is a principled way of doing that it introduces a an extra sense of |
---|
0:04:43 | hidden variables |
---|
0:04:45 | indicators which |
---|
0:04:47 | so you how the frames are aligned with mixture components |
---|
0:04:52 | and |
---|
0:04:53 | that can be interleaved into the |
---|
0:04:58 | variational bayes updates in vaults algorithm |
---|
0:05:02 | so that you get |
---|
0:05:04 | a quick |
---|
0:05:04 | karen framework from handling that adaptation |
---|
0:05:09 | problem |
---|
0:05:11 | there's just |
---|
0:05:14 | there's just one caviar |
---|
0:05:16 | that i think is worth pointing out about this algorithm |
---|
0:05:21 | it requires that you take account of all of the hidden variables when you're doing |
---|
0:05:26 | ubm adaptation and the evidence calculations |
---|
0:05:30 | no of course that's what you should do if the model is to be believed |
---|
0:05:35 | if you take the model that fixed value |
---|
0:05:37 | we should take account of all of the hidden variables |
---|
0:05:40 | however what's going on here is that this a factorial priors actually so weak |
---|
0:05:48 | but |
---|
0:05:50 | doing things by the book |
---|
0:05:52 | dance lead you into problems |
---|
0:05:54 | so that's why have like this here as a as a kind of |
---|
0:06:00 | so in the paper a high presented results on the are stored data using a |
---|
0:06:05 | three types of classifier |
---|
0:06:08 | that come out of these calculations |
---|
0:06:13 | the first one there is simply to use the z vectors that can either from |
---|
0:06:19 | votes calculation or from the show and on calculation |
---|
0:06:23 | as features which |
---|
0:06:26 | it's are extracted properly should be purged of channel effects |
---|
0:06:31 | okay and then just feeding goes into a simple backend like the cosine distance classifier |
---|
0:06:37 | a jfa as it was originally construed |
---|
0:06:42 | attended not only to be a feature extractor but also model that's like a classifier |
---|
0:06:47 | that's two additions |
---|
0:06:50 | it |
---|
0:06:52 | however |
---|
0:06:54 | in order to understand |
---|
0:06:56 | this problem of ubm adaptation |
---|
0:07:01 | it's necessary also to look into what's going on |
---|
0:07:05 | with those bayesian model selection algorithms |
---|
0:07:09 | okay when you |
---|
0:07:10 | what happens when you appliance |
---|
0:07:12 | without a ubm adaptation and |
---|
0:07:16 | of boats algorithm |
---|
0:07:18 | or with ubm adaptation and round on some read from |
---|
0:07:22 | and also to compare it with |
---|
0:07:26 | the likelihood ratio calculation |
---|
0:07:30 | which |
---|
0:07:32 | what's traditional about two thousand and eight |
---|
0:07:36 | it's turns out that |
---|
0:07:38 | when you look into these questions that there's a whole bunch of anomalies that |
---|
0:07:43 | that arise |
---|
0:07:45 | the |
---|
0:07:47 | ubm adaptation call if you're using jfa as a feature extractor ubm adaptation hz point |
---|
0:07:55 | five point |
---|
0:07:56 | okay |
---|
0:07:57 | this is true |
---|
0:07:58 | for these sent vectors that's not true for i-vectors is not true for speaker factors |
---|
0:08:03 | it behaves a reasonably but not present factors that's and |
---|
0:08:09 | this year's icassp |
---|
0:08:11 | paper |
---|
0:08:13 | on the other hand |
---|
0:08:14 | if you look at the problem of maximum likelihood estimation |
---|
0:08:18 | all the jfa model parameters maximum likelihood so |
---|
0:08:23 | what you find is that it doesn't work at all |
---|
0:08:26 | without ubm adaptation you do need |
---|
0:08:29 | ubm adaptation order to get that to behave |
---|
0:08:32 | sensibly |
---|
0:08:34 | if you look here a on based model selection you find that there are some |
---|
0:08:39 | cases |
---|
0:08:39 | where shall and don's algorithm |
---|
0:08:42 | works better than vaults |
---|
0:08:44 | and other cases where exactly the opposite happens |
---|
0:08:49 | the traditional jfa likelihood ratio is actually very simplistic get just uses plug in estimates |
---|
0:08:56 | rather than attempt to integrate over hidden variables and no ubm adaptation of all |
---|
0:09:02 | and |
---|
0:09:03 | what i will show in this paper is that it can be made to work |
---|
0:09:07 | very well |
---|
0:09:08 | with very careful |
---|
0:09:10 | ubm adaptation |
---|
0:09:12 | okay so this business of ubm adaptation turns out to be very tracking |
---|
0:09:16 | and |
---|
0:09:17 | anyone who is being a around and in the in table long enough is probably |
---|
0:09:23 | in parent by this by this problem at some stage |
---|
0:09:28 | sorry my in my own experience |
---|
0:09:31 | i couldn't get jfa working at all |
---|
0:09:34 | until i stopped showing the ubm adaptation |
---|
0:09:38 | but it doesn't really make a little sense because if you look at the history |
---|
0:09:41 | of subspace methods eigenvoices eigen channels |
---|
0:09:45 | they world implemented originally with ubm adaptation |
---|
0:09:49 | if you speak to |
---|
0:09:51 | guys in speech recognition they will be surprised |
---|
0:09:55 | if you tell them that you're not doing ubm adaptation |
---|
0:09:57 | it is essential for instance and |
---|
0:10:01 | say subspace gaussian mixture models |
---|
0:10:06 | okay |
---|
0:10:08 | so here's an example these are just some examples of the anomalous results that to |
---|
0:10:14 | arise |
---|
0:10:15 | okay these are the |
---|
0:10:17 | a bayesian model selection results |
---|
0:10:21 | on the left hand side |
---|
0:10:24 | is with five hundred and twelve |
---|
0:10:27 | gaussians in the ubm |
---|
0:10:30 | on the right hand side with sixty four |
---|
0:10:34 | in the case of the small ubm |
---|
0:10:36 | john don solvers some |
---|
0:10:38 | does more |
---|
0:10:39 | gives you a small improvement |
---|
0:10:41 | that doesn't help with a five twelve gaussians |
---|
0:10:47 | here's |
---|
0:10:48 | the results in the third line the first two lines of the same as in |
---|
0:10:51 | the last slide the |
---|
0:10:53 | third line there is the traditional jfa likelihood ratio |
---|
0:10:57 | and that the it's model selection and style with or without |
---|
0:11:04 | ubm adaptation |
---|
0:11:07 | so this then is what the what the paper is about well what i want |
---|
0:11:11 | to show is that |
---|
0:11:14 | if you start with the traditional jfa likelihood ratio |
---|
0:11:19 | maybe just recall briefly |
---|
0:11:21 | how that goes |
---|
0:11:23 | you have a numerator and denominator |
---|
0:11:26 | in the numerator |
---|
0:11:28 | okay you plug in |
---|
0:11:30 | the target speakers |
---|
0:11:33 | supervector and you use that to center the baum-welch statistics and you integrate over the |
---|
0:11:38 | channel factors |
---|
0:11:40 | in the |
---|
0:11:43 | in the denominator you plug in |
---|
0:11:46 | the ubm supervector and you do exactly the same |
---|
0:11:50 | calculation and you compare |
---|
0:11:52 | those two those two probabilities |
---|
0:11:55 | no ubm adaptation going on at all and apply in estimate |
---|
0:11:59 | which is not serious in the numerator but in the denominator it really is problematic |
---|
0:12:05 | because |
---|
0:12:06 | theory says you should be employed integrating over the entire speaker population |
---|
0:12:11 | rather than plugging in they |
---|
0:12:14 | the mean value the value of the comes from the ring supervector |
---|
0:12:20 | so |
---|
0:12:22 | what i we show is that if you do the adaptation very carefully |
---|
0:12:28 | adapt the |
---|
0:12:30 | the ubm to some of the hidden variables but not all of them |
---|
0:12:34 | then everything will work properly |
---|
0:12:39 | this is as long as you were |
---|
0:12:41 | using jfa as a classifier you're calculating a likelihood ratios |
---|
0:12:48 | however |
---|
0:12:49 | if you're using it as a feature extractor in this turns out to give the |
---|
0:12:53 | best results |
---|
0:12:55 | it turns out that you're better off |
---|
0:12:57 | avoiding ubm adaptational together |
---|
0:13:00 | i give you an explanation for this |
---|
0:13:03 | it has to do with the fact that the factorial priors two week this phenomenon |
---|
0:13:08 | is related to victoria priors not just subspace problems |
---|
0:13:16 | okay |
---|
0:13:17 | well really for this problem the first type of adaptation that you want to consider |
---|
0:13:23 | is the lexical mismatch between your |
---|
0:13:28 | enrollment and test utterance on the other on the one hand |
---|
0:13:31 | and the ubm that might have been trained |
---|
0:13:33 | on some other |
---|
0:13:36 | some of the data |
---|
0:13:38 | the |
---|
0:13:39 | the jfa likelihood ratio in the numerator you're actually comparing the test speakers of the |
---|
0:13:45 | ubm speaker |
---|
0:13:47 | but if you consider what's going on here if you have |
---|
0:13:50 | no lexical content and the in the trial |
---|
0:13:53 | that is with thing which will most determine what the what the data looks like |
---|
0:14:00 | not the ubm the you would be much better off |
---|
0:14:03 | comparing to have phrase adapted |
---|
0:14:05 | background model and so the |
---|
0:14:08 | to the universal background model so you |
---|
0:14:10 | if you simply adapt the ubm |
---|
0:14:13 | to the lexical content of the frame is that is used in a particular trial |
---|
0:14:18 | that will lead to a substantial improvement |
---|
0:14:22 | in performance |
---|
0:14:24 | so what's going on here is that |
---|
0:14:26 | in the |
---|
0:14:28 | in the or sre data for or |
---|
0:14:34 | in the hours or days of there are a thirty different prices |
---|
0:14:38 | okay the mean supervector of jfa is adapted to each of the phrases |
---|
0:14:43 | but all of the other parameters are shared across phrases |
---|
0:14:51 | if you adapt to the |
---|
0:14:55 | channel effects in the test data |
---|
0:14:57 | this will work fine |
---|
0:14:59 | okay |
---|
0:15:01 | i this with these remotes are referred to the sort of early history of like |
---|
0:15:07 | and channel modeling |
---|
0:15:09 | there are two alternative ways of going about that you can combine the two together |
---|
0:15:14 | and you will get a slight so improvement there's |
---|
0:15:17 | there's no problem there |
---|
0:15:18 | if you |
---|
0:15:21 | if you adapt |
---|
0:15:22 | to the speaker affects in the enrollment data it would work fine |
---|
0:15:26 | okay so what i mean here's that you |
---|
0:15:29 | collect the bombers statistic strongly test utterance with |
---|
0:15:34 | a gmm that has been |
---|
0:15:37 | adapted to the target speaker |
---|
0:15:40 | you get an improvement |
---|
0:15:42 | if you |
---|
0:15:45 | perform multiple |
---|
0:15:46 | iterations of map to adapt of the |
---|
0:15:51 | lexical content things work even better |
---|
0:15:53 | so at this stage if you look through those lines you see that |
---|
0:15:57 | we've already got forty percent improvement in error rates |
---|
0:16:03 | just to just should through doing a |
---|
0:16:06 | ubm adaptation carefully |
---|
0:16:11 | this slide unfortunately we going to have to skip that because of the time constraints |
---|
0:16:17 | it's interesting and but i just don't of trying to deal with that |
---|
0:16:25 | here are results with a five hundred and twelve gaussians |
---|
0:16:31 | it turns out that so doing careful adaptation with the ubm and sixty four gaussians |
---|
0:16:36 | work can chew about these same performance as |
---|
0:16:41 | working with five hundred and twelve gaussians and no adaptation |
---|
0:16:46 | if you try adaptation with five twelve gaussians |
---|
0:16:50 | things will not behave so well this is a rather extreme case where you have |
---|
0:16:55 | many more gel since then you actually have frames in your in your test utterances |
---|
0:17:01 | and the remaining two presents our results that are so that are obtained |
---|
0:17:07 | with z vectors as features problem |
---|
0:17:11 | using likelihood computations |
---|
0:17:15 | likelihood ratio computations |
---|
0:17:17 | that the difference between the two is the nap is used in one case but |
---|
0:17:21 | not the other |
---|
0:17:22 | the of |
---|
0:17:24 | three point there is that you don't need now |
---|
0:17:27 | okay because you've already suppressed |
---|
0:17:30 | the channel effects |
---|
0:17:32 | in extracting the present vectors |
---|
0:17:38 | and these then our results on the on the full ten set |
---|
0:17:42 | that the full order sort test set |
---|
0:17:44 | just to compare |
---|
0:17:48 | the |
---|
0:17:49 | z vector classifier |
---|
0:17:51 | using both soundworks and that's to say no ubm adaptation |
---|
0:17:55 | and joan don's algorithm with ubm adaptation |
---|
0:18:00 | and you can see that you're better off using both so algorithm that explained that |
---|
0:18:05 | the minute of only take a second |
---|
0:18:08 | okay so these are the |
---|
0:18:10 | these are the conclusions |
---|
0:18:13 | you can adapt to everything inside and the work |
---|
0:18:17 | but this one thing you should not to |
---|
0:18:19 | and that is adapt speaker affects in the test utterance |
---|
0:18:25 | the |
---|
0:18:28 | the reason for that is actually |
---|
0:18:31 | this i believe is what's going on |
---|
0:18:33 | the factorial priors extremely weak if you have a single test utterance |
---|
0:18:39 | okay and your doing ubm adaptation |
---|
0:18:43 | then you're allowing |
---|
0:18:45 | the |
---|
0:18:46 | different mean vectors in the gmm |
---|
0:18:49 | to be displays in statistically independent ways like gives you an awful lot of freedom |
---|
0:18:54 | to aligned |
---|
0:18:56 | the data with the gaussians too much freedom |
---|
0:19:01 | see what happens if you |
---|
0:19:05 | if you had multiple enrollment utterances which is normally the case in text dependent speaker |
---|
0:19:11 | recognition |
---|
0:19:12 | you still have a very weak prior |
---|
0:19:14 | but you have a strong extra constraint |
---|
0:19:18 | if you go across the enrollment utterances the gaussians can not move in statistically independent |
---|
0:19:24 | ways that up to move in lockstep |
---|
0:19:27 | okay and that means that the |
---|
0:19:29 | adaptation algorithm will behave sensibly |
---|
0:19:34 | if you to |
---|
0:19:37 | adaptation to the channel effects in the test utterance it can things will behave sensibly |
---|
0:19:42 | and the reason for that |
---|
0:19:45 | is because these subspace prior |
---|
0:19:47 | channel effects are assumed to be confined to a low dimensional subspace |
---|
0:19:51 | that imposes a strong constraint |
---|
0:19:54 | on the way the |
---|
0:19:57 | the gaussians can move |
---|
0:20:02 | so final slide the |
---|
0:20:07 | if you're using jfa as a feature extractor |
---|
0:20:11 | which is my recommendation |
---|
0:20:14 | then the upshot of all this |
---|
0:20:17 | is that |
---|
0:20:19 | in the case of the test utterance when you're extract the feature vector you cannot |
---|
0:20:23 | use ubm adaptation |
---|
0:20:25 | if you cannot use that |
---|
0:20:27 | and extracting a feature from the test utterance you cannot use a in extracting |
---|
0:20:32 | feature a feature from the enrollment utterance i've or otherwise the features whatnot the |
---|
0:20:38 | would not be comparable |
---|
0:20:40 | okay so in other words you have to use false algorithm |
---|
0:20:43 | rather than rather than joan bounds |
---|
0:20:48 | adaptation of the ubm to the lexical content still works very well as a fifty |
---|
0:20:54 | percent error rate reduction compared with the |
---|
0:20:58 | with the icassp paper |
---|
0:21:02 | there's a follow on paper that interspeech which shows how this idea of adaptation to |
---|
0:21:08 | phrases can be extended to give a simple |
---|
0:21:14 | procedure for domain adaptation |
---|
0:21:17 | so you can train |
---|
0:21:18 | jfa |
---|
0:21:19 | on sundays at a likeness data and use it on say a text-dependent |
---|
0:21:26 | task domain |
---|
0:21:28 | and the finally these that vectors at least on the orders or data |
---|
0:21:33 | they to they are very good features there is no residual |
---|
0:21:39 | channel variability that's to model in the in the back end |
---|
0:21:43 | okay thank you |
---|