0:00:13 | are |
---|
0:00:14 | um |
---|
0:00:15 | i i america of the list term the uh session chair and um a one second an advertisement if you |
---|
0:00:21 | see people wearing a wine may ask them about a asr you |
---|
0:00:25 | okay |
---|
0:00:26 | um we're gonna start off |
---|
0:00:28 | a a first uh paper is uh |
---|
0:00:31 | front-end feature transforms with context filtering for speaker adaptation |
---|
0:00:35 | um papers by ageing one |
---|
0:00:37 | a |
---|
0:00:38 | i i take a |
---|
0:00:40 | as well uh |
---|
0:00:41 | as as well as raw were yeah |
---|
0:00:43 | a a all and and |
---|
0:00:45 | by how a go go all |
---|
0:00:47 | um and it will be presented by you any |
---|
0:00:56 | okay so uh the top is front in feature transforms with context filtering for speaker adaptation |
---|
0:01:02 | um |
---|
0:01:03 | so uh this an line of the talk |
---|
0:01:06 | first stop briefly motivate |
---|
0:01:08 | a other work can explain a |
---|
0:01:10 | basically what's |
---|
0:01:12 | uh what we're trying to accomplish |
---|
0:01:14 | uh then next all give an overview of |
---|
0:01:16 | uh and the new technique called maximum likelihood context filtering |
---|
0:01:21 | and then we'll move uh strain in some experiments and results to see how works |
---|
0:01:26 | okay so uh the top is front |
---|
0:01:29 | speaker adaptation |
---|
0:01:30 | uh |
---|
0:01:31 | in terms of front-end transforms we usually uh do in your transforms |
---|
0:01:35 | or conditionally linear transform |
---|
0:01:38 | a perhaps the most popular technique is feature space mllr or maybe more popularly named constraint |
---|
0:01:44 | yeah mllr |
---|
0:01:46 | and the course there's just discriminative uh techniques that of developed and nonlinear transformations |
---|
0:01:53 | a some variance of F from a additive been |
---|
0:01:55 | a work on and in years are quick of more are |
---|
0:01:59 | and uh full covariance |
---|
0:02:01 | a formal are |
---|
0:02:02 | um |
---|
0:02:03 | so to they'll tell you about another |
---|
0:02:06 | aryan of the full are |
---|
0:02:07 | and but first that let's review of more so the ideas you're given |
---|
0:02:12 | set of adaptation data |
---|
0:02:14 | and you want to estimate uh when you're transformation a |
---|
0:02:17 | and a bias |
---|
0:02:18 | be uh which can be |
---|
0:02:20 | concatenated |
---|
0:02:21 | in into the linear |
---|
0:02:23 | a a matrix W |
---|
0:02:25 | so uh the key point about |
---|
0:02:28 | a a a from are for the purposes of this talk is that the a matrix is square |
---|
0:02:32 | uh T by D and the notation used here |
---|
0:02:35 | and uh |
---|
0:02:36 | that makes it particularly easy to learn |
---|
0:02:41 | um |
---|
0:02:41 | so |
---|
0:02:42 | of course |
---|
0:02:43 | the main thing you need to deal with when you up these transforms is |
---|
0:02:47 | uh the volume |
---|
0:02:48 | uh change compensation |
---|
0:02:50 | in the case of a when you're transformation it's just the log determinant a and red you C |
---|
0:02:55 | in our objective function Q |
---|
0:02:57 | uh the second term there is just your |
---|
0:03:00 | a typical term you see it has the posterior |
---|
0:03:03 | probability of all the components of the acoustic model your evaluating that's |
---|
0:03:08 | gamma |
---|
0:03:09 | uh subscript scripted by J for each gaussian |
---|
0:03:13 | okay so when we |
---|
0:03:16 | start to think about the non square case |
---|
0:03:18 | uh |
---|
0:03:20 | what what do we need to do so first that's set up the notation |
---|
0:03:23 | so we use the notation X had of T |
---|
0:03:25 | to do know of |
---|
0:03:27 | a vector with context in this case there's context size one |
---|
0:03:30 | so there's X T might as one T in T plus one of being concatenated to make X have the |
---|
0:03:36 | T |
---|
0:03:36 | so the model is |
---|
0:03:38 | why i |
---|
0:03:39 | because a X have |
---|
0:03:41 | um |
---|
0:03:42 | we can uh condense this notation to |
---|
0:03:45 | um |
---|
0:03:46 | form W |
---|
0:03:47 | and hear them the main difference here is the a is not square in this case it is D by |
---|
0:03:52 | three D |
---|
0:03:53 | because the output Y |
---|
0:03:54 | is the original dimension of of the input X |
---|
0:03:58 | but X that is three D |
---|
0:04:01 | so how do we estimate |
---|
0:04:03 | a a a a a non square matrix the max my like |
---|
0:04:07 | uh |
---|
0:04:08 | so |
---|
0:04:08 | a an important point is there's no direct obvious way to do this and that's because |
---|
0:04:13 | the |
---|
0:04:13 | uh |
---|
0:04:15 | if your you change you changing the dimension of the space so that is no determine volume |
---|
0:04:21 | the you can use |
---|
0:04:22 | a a in a straightforward manner to |
---|
0:04:24 | accomplish this |
---|
0:04:25 | so let's go back and look at how we get that term |
---|
0:04:28 | so basically what you say is that the |
---|
0:04:30 | log likelihood under the transformation |
---|
0:04:33 | of Y |
---|
0:04:35 | uh |
---|
0:04:36 | is of equal to the log likelihood of the input variable up to a constant |
---|
0:04:40 | that is your |
---|
0:04:41 | to cope in term |
---|
0:04:43 | so in the case that uh you soon that a square you can readily |
---|
0:04:48 | confirmed that the term is |
---|
0:04:51 | have log |
---|
0:04:52 | ratio of the determinant |
---|
0:04:54 | of the input and output mall assuming the gaussian |
---|
0:04:57 | so this slide is just showing how you would ride that |
---|
0:05:00 | there's L X |
---|
0:05:02 | a gaussian |
---|
0:05:03 | L Y |
---|
0:05:04 | i when you to |
---|
0:05:05 | are are get as you know the when your transform data |
---|
0:05:08 | and essentially you quite them the fine what C is in you find that |
---|
0:05:11 | it is the log ratio |
---|
0:05:13 | a a as we started before |
---|
0:05:15 | so on the bottom line and read |
---|
0:05:18 | if you break down that not show you see that uh the covariance of |
---|
0:05:22 | D variable Y |
---|
0:05:24 | uh |
---|
0:05:25 | this is just a known uh identity it's a a |
---|
0:05:29 | a a signal X transport |
---|
0:05:31 | transpose |
---|
0:05:32 | a transpose |
---|
0:05:34 | E |
---|
0:05:34 | a a signal X |
---|
0:05:35 | at |
---|
0:05:36 | transpose |
---|
0:05:37 | um so the compensation term ends up being log determine a |
---|
0:05:42 | um so in our case we're gonna assume that the compensation term |
---|
0:05:45 | remains the same |
---|
0:05:47 | uh will drop the |
---|
0:05:49 | log determinant of sigma X had term |
---|
0:05:52 | because it does not depend on a |
---|
0:05:54 | number left |
---|
0:05:55 | with the |
---|
0:05:56 | a segment X a transport |
---|
0:05:58 | pos turned uh that we had in this case that |
---|
0:06:01 | they was square |
---|
0:06:03 | um |
---|
0:06:04 | so the modified objective becomes |
---|
0:06:07 | uh are the following |
---|
0:06:08 | and the one one point is that well what is this |
---|
0:06:11 | signal X had that was used well what they did is they use |
---|
0:06:15 | uh |
---|
0:06:15 | a full covariance approximations all the speech features |
---|
0:06:19 | to come up with a that |
---|
0:06:21 | a a full covariance sigma X hat |
---|
0:06:23 | a used it in this subjective to |
---|
0:06:25 | learn K |
---|
0:06:30 | okay so in terms of optimising |
---|
0:06:32 | a uh this modified objective this the statistics that you need of |
---|
0:06:36 | the same |
---|
0:06:37 | form is in the square case scores the sizes are different |
---|
0:06:41 | uh |
---|
0:06:41 | but to uh main quantities that you need to optimize |
---|
0:06:45 | the objective are |
---|
0:06:47 | to be able to evaluate the objective Q and the derivative of the objective |
---|
0:06:52 | um and so the row by row it or iterative up |
---|
0:06:55 | data are real that people normally use |
---|
0:06:58 | not be applied here at least it's not obvious how to do it |
---|
0:07:01 | there |
---|
0:07:01 | uh |
---|
0:07:03 | a a a a we're looking at it now but there are some ways to do something |
---|
0:07:06 | very similar |
---|
0:07:07 | uh but it uh for the purposes of this paper just a a a gas optimization a uh |
---|
0:07:14 | package was used the H C L package |
---|
0:07:17 | and uh as i mentioned before you just has in the function in in a function and its gradient available |
---|
0:07:22 | at any point that the optimiser wants to evaluate |
---|
0:07:27 | okay okay so uh |
---|
0:07:29 | that's |
---|
0:07:30 | that's essentially the map it'll trying leave some time for questions at the end |
---|
0:07:34 | this any more uh |
---|
0:07:35 | details that are |
---|
0:07:37 | request |
---|
0:07:38 | um so moving right on to training data and models so |
---|
0:07:41 | uh the training data |
---|
0:07:43 | for the task we evaluated this |
---|
0:07:44 | technique on was collected in stationary noise |
---|
0:07:47 | there's about eight hundred hours of it |
---|
0:07:49 | a a word internal uh |
---|
0:07:52 | a weird internal model with kind of phone contact |
---|
0:07:55 | eight hundred thirty |
---|
0:07:57 | context dependent states in a |
---|
0:07:58 | ten K gaussians was train |
---|
0:08:01 | and uh the technique was tested on |
---|
0:08:04 | a L D A forty dimensional features bill and on models |
---|
0:08:08 | built using maximum likelihood |
---|
0:08:10 | the M i |
---|
0:08:12 | and |
---|
0:08:12 | on a model uh with an F a i transformation applied before |
---|
0:08:17 | uh we apply context filtering |
---|
0:08:21 | uh in terms of test data uh was recorded in car at three different speeds zeros thirty at sixty miles |
---|
0:08:27 | per hour |
---|
0:08:28 | uh there were four tasks dressed |
---|
0:08:30 | digits commands and radio control |
---|
0:08:33 | and that's about uh |
---|
0:08:34 | twenty six K utterances and uh |
---|
0:08:37 | a total of a hundred and thirty thousand word |
---|
0:08:40 | he is that the distribution of the |
---|
0:08:42 | the snr distribution of this data |
---|
0:08:45 | in terms of |
---|
0:08:46 | speed |
---|
0:08:47 | how you can see will see that most in is obtained that for the sixty |
---|
0:08:52 | for our data and for that data about |
---|
0:08:54 | basically half of the data is below |
---|
0:08:57 | say twelve and a half T V |
---|
0:08:59 | uh a we estimate a using a forced alignment |
---|
0:09:05 | okay so for experiments |
---|
0:09:07 | uh a context filtering was tried for speaker adaptation |
---|
0:09:11 | training speaker dependent uh |
---|
0:09:13 | a a that being uh |
---|
0:09:15 | the canonical model |
---|
0:09:16 | so uh a and all C a and just a little uh nomenclature here |
---|
0:09:21 | it is uh |
---|
0:09:22 | maximum likelihood context filtering with context size and |
---|
0:09:26 | so one would be plus or minus one |
---|
0:09:29 | aim is included in the context |
---|
0:09:31 | when computing the transform |
---|
0:09:35 | uh |
---|
0:09:36 | so for all the experiments |
---|
0:09:38 | uh we |
---|
0:09:39 | the transform was in lies with identity uh with respect to the current |
---|
0:09:43 | frames parameters |
---|
0:09:44 | and the side frames where |
---|
0:09:46 | uh than a lies to |
---|
0:09:48 | have zero |
---|
0:09:49 | to zeros |
---|
0:09:50 | so |
---|
0:09:51 | just for reference they also tried using |
---|
0:09:53 | for the centre |
---|
0:09:54 | a a part of the matrix the F from or that was estimated |
---|
0:09:58 | uh |
---|
0:09:59 | you uh |
---|
0:10:00 | using the usual technique |
---|
0:10:03 | okay so in terms a result |
---|
0:10:05 | give skip it had to that so |
---|
0:10:08 | uh |
---|
0:10:08 | clearly a from R |
---|
0:10:10 | uh brings a lot over the baseline on this data |
---|
0:10:13 | and when a you turn on context filtering |
---|
0:10:16 | yeah actually get some significant gains |
---|
0:10:18 | in the sixty per hour call call and you can see that there are late and red |
---|
0:10:22 | so this is actually twenty three percent |
---|
0:10:25 | uh relative gain in word error rate thirty percent and sensor rate over a more |
---|
0:10:30 | um um |
---|
0:10:31 | the other point here is it's starting with a more are |
---|
0:10:33 | and then adapting actually doesn't give you any advantage over |
---|
0:10:37 | uh starting with an and it then you made |
---|
0:10:41 | a this point is just showing how |
---|
0:10:44 | uh performance varies with than um with the amount of data you provide |
---|
0:10:48 | uh to the transform estimation |
---|
0:10:50 | so uh |
---|
0:10:52 | where we can see that actually |
---|
0:10:54 | the relative a degradation in performance when you have less data as in this case ten utterance |
---|
0:10:59 | ten utterances as |
---|
0:11:01 | all utterances i believe all is a hundred in this case |
---|
0:11:05 | um |
---|
0:11:06 | is less and i i think the argument here is that |
---|
0:11:10 | uh uh you're using context so you can do some averaging |
---|
0:11:14 | of the data you see and that's |
---|
0:11:16 | that effectively regular thing yes |
---|
0:11:18 | estimation the sum |
---|
0:11:19 | extent although there's more parameters to estimate so |
---|
0:11:22 | uh |
---|
0:11:23 | kind of counter intuitive i think |
---|
0:11:28 | okay a this just a picture of but typical F mark transform estimated uh using our |
---|
0:11:33 | system it's |
---|
0:11:34 | it's uh |
---|
0:11:35 | for the most part no |
---|
0:11:38 | and uh this is the corresponding |
---|
0:11:40 | a one frame of context |
---|
0:11:42 | a context filtering transforms so you can see |
---|
0:11:44 | interestingly it's not symmetric the |
---|
0:11:47 | the uh |
---|
0:11:48 | previous |
---|
0:11:49 | the mapping from previous the current frame is almost i no so is the current the current frame |
---|
0:11:54 | mapping |
---|
0:11:55 | scene |
---|
0:11:58 | a but the |
---|
0:12:00 | a count of the future looks |
---|
0:12:01 | kind of random |
---|
0:12:05 | and thing to keep in mind is that this is |
---|
0:12:07 | uh |
---|
0:12:08 | no uh it's |
---|
0:12:09 | is actually |
---|
0:12:11 | that's whole subspace a lot most solutions to this problem |
---|
0:12:15 | so it's not clear if this is an artifact of the optimization package |
---|
0:12:18 | perhaps the order in it that that uh that it optimize the subspace and whatnot not |
---|
0:12:25 | hmmm |
---|
0:12:26 | using |
---|
0:12:28 | okay so here's more results |
---|
0:12:30 | uh |
---|
0:12:31 | a collective a uh using a you my model |
---|
0:12:34 | and again uh we're seeing seven significant gains |
---|
0:12:38 | uh |
---|
0:12:39 | oh over F a are about ten percent relative improvement |
---|
0:12:43 | on the six team up our data |
---|
0:12:47 | a once again when we when we uh train have a my transform and then apply context filtering |
---|
0:12:53 | we're still actually getting some gains |
---|
0:12:56 | it's about |
---|
0:12:58 | nine percent |
---|
0:12:59 | relative sent error rate reduction over a more |
---|
0:13:06 | okay so one summer a uh and i'll see a fixed ends well the full rank square matrix |
---|
0:13:11 | technique |
---|
0:13:12 | colour from R to not square meter |
---|
0:13:15 | and now there's some very nice gains |
---|
0:13:18 | on a |
---|
0:13:19 | some pretty good systems |
---|
0:13:20 | uh |
---|
0:13:21 | the use to be am am i have from my |
---|
0:13:24 | uh when we apply this technique |
---|
0:13:26 | so terms the future work |
---|
0:13:28 | course is the use a |
---|
0:13:29 | uh |
---|
0:13:30 | we should uh |
---|
0:13:32 | trying a discriminative objective function is something that i think they're looking at in the course |
---|
0:13:37 | uh |
---|
0:13:38 | the another question is how this technique interacts with traditional noise robust as methods like spectral subtraction |
---|
0:13:45 | dynamic noise adaptation et cetera |
---|
0:13:48 | okay |
---|
0:13:48 | so |
---|
0:13:49 | that's all i have hopefully this sometime time of for questions |
---|
0:13:55 | i is use my |
---|
0:14:00 | so the plot you have for improvement so that you had to have ten utterance |
---|
0:14:04 | for each speed how do you do that |
---|
0:14:07 | a practical sense are you going to keep track |
---|
0:14:10 | yeah |
---|
0:14:10 | hi |
---|
0:14:12 | uh let me just go to the us you |
---|
0:14:15 | i mean this is just a investigating the amount of data is needed for the transform to be effective |
---|
0:14:21 | so uh the this is useful |
---|
0:14:24 | in sense that if you need to roll a speaker for example on a cell phone |
---|
0:14:28 | he only needs to talk ten utterances by the way that's a good point uh each utterance is only about |
---|
0:14:33 | three seconds |
---|
0:14:34 | so we're talking about |
---|
0:14:36 | you know uh at thirty seconds of a of data uh were already |
---|
0:14:40 | almost that |
---|
0:14:42 | a completely adapted to do the speaker as opposed to |
---|
0:14:45 | that from more are that's |
---|
0:14:46 | actually seems the need about thirty |
---|
0:14:48 | to be at that stage |
---|
0:14:56 | a |
---|
0:14:58 | oh microphone |
---|
0:14:59 | for the third one right there |
---|
0:15:03 | so from this chart you're working all |
---|
0:15:05 | screws me |
---|
0:15:06 | uh you know |
---|
0:15:08 | utterances collected at sixty |
---|
0:15:10 | models mouse or speech |
---|
0:15:12 | in a real scenario in people drive store high we |
---|
0:15:16 | so the you know sequence |
---|
0:15:18 | uh uh you know uh as made you that with this same are is not this scenery |
---|
0:15:22 | oh have you test that's scenario |
---|
0:15:25 | yeah i they in consider that in this work but it so this is a block |
---|
0:15:30 | this block optimization of the matrix actually |
---|
0:15:32 | so |
---|
0:15:33 | just take a section of |
---|
0:15:34 | speaker data and |
---|
0:15:35 | see how many utterances are required |
---|
0:15:38 | a |
---|
0:15:38 | to can to get uh decent gains |
---|
0:15:42 | but that's certainly an important problem |
---|
0:15:49 | to more questions |
---|
0:15:52 | have a quick one |
---|
0:15:53 | um i was actually kind of interested when you were looking at |
---|
0:15:57 | the results for one and two |
---|
0:15:59 | like to be two different um when you know |
---|
0:16:02 | the they actually do visualisation of the two |
---|
0:16:05 | um context |
---|
0:16:07 | was it's great and i found that the visual interesting and i was wondering at that |
---|
0:16:12 | i think but different |
---|
0:16:15 | uh |
---|
0:16:16 | oh i see |
---|
0:16:18 | right i i think that uh |
---|
0:16:21 | think this is one of the only ones they actually look that okay |
---|
0:16:24 | uh |
---|
0:16:25 | i was very curious about that myself |
---|
0:16:27 | so |
---|
0:16:28 | ready put them in at the last moment |
---|
0:16:29 | actually |
---|
0:16:31 | um |
---|
0:16:32 | yeah that's very true choosing certainly uh |
---|
0:16:35 | uh |
---|
0:16:37 | but it for the experiments they did they found that performance was that eating at about |
---|
0:16:42 | uh |
---|
0:16:43 | but and right context |
---|
0:16:44 | two |
---|
0:16:45 | so uh uh i i think that uh is symmetry and that |
---|
0:16:49 | X |
---|
0:16:50 | for future uh |
---|
0:16:52 | investigation and understanding |
---|
0:16:55 | i think the speaker |
---|
0:17:00 | maybe we we're gonna need a minute to set up the next peak there so |
---|
0:17:03 | sir |
---|
0:17:03 | but |
---|