0:00:14 | um i everyone |
---|
0:00:16 | and not gonna apologise for the quality of my slides because |
---|
0:00:20 | and uh i just i think it really matters |
---|
0:00:23 | people are people if you make an to find people seem to trying to compensate for something that that's my |
---|
0:00:29 | that that's my uh |
---|
0:00:31 | yeah |
---|
0:00:32 | philosophy |
---|
0:00:33 | okay so this is that actually very similar to some of |
---|
0:00:37 | but in this session and it's |
---|
0:00:39 | it's a strange session is in the because almost all of these talks are about some kind of uh |
---|
0:00:44 | no no |
---|
0:00:45 | fast faster from a uh |
---|
0:00:46 | with some kind of factorization |
---|
0:00:49 | now |
---|
0:00:51 | i was going to introduce you |
---|
0:00:53 | okay so i that all should be call it C M or or should we call it F a lot |
---|
0:00:58 | for some reason i gone over recently to the C a last side but i'm hedging my bets and the |
---|
0:01:02 | actual talk |
---|
0:01:04 | uh |
---|
0:01:06 | i i think you get everyone really had a stand that is the same thing |
---|
0:01:11 | yeah |
---|
0:01:12 | and i'm not gonna go through this slide in detail because it's the same as many slides that have been |
---|
0:01:17 | presented already |
---|
0:01:19 | yeah |
---|
0:01:19 | notation is a little bit different i |
---|
0:01:22 | this notation with a little plus |
---|
0:01:25 | it kind of my personal notation |
---|
0:01:28 | but i i i use it because i i'm not comfortable with a or the greek let as an it's |
---|
0:01:32 | hard to me to remember a |
---|
0:01:33 | what |
---|
0:01:34 | i i is an a and that is supposed to be the same as X |
---|
0:01:37 | so i just think it's easy at to |
---|
0:01:39 | remember but the total plus means the a one |
---|
0:01:43 | uh in and since some uh |
---|
0:01:45 | in some people's work it's a side i |
---|
0:01:49 | um |
---|
0:01:52 | as that's that's also another little can confusing |
---|
0:01:55 | difference between people notation that sometimes people put the one on the and and sometimes at the beginning |
---|
0:02:00 | i think the B M had it is is to put on the end and that's what i'm doing |
---|
0:02:05 | yeah |
---|
0:02:08 | this is another kind of introductory slide but |
---|
0:02:12 | since we've had so many introductions to cmllr |
---|
0:02:16 | have from a lot |
---|
0:02:17 | in this in this uh session i don't really think i need to go through this |
---|
0:02:22 | no |
---|
0:02:24 | a point i i put in here |
---|
0:02:26 | is that |
---|
0:02:27 | have some uh |
---|
0:02:29 | i i'm gonna switch them out |
---|
0:02:30 | works so the relative little adaptation data |
---|
0:02:33 | now here |
---|
0:02:35 | well that's a little adaptation data means after about thirty seconds or so |
---|
0:02:39 | and someone else actually in this |
---|
0:02:41 | session did mention that figure |
---|
0:02:43 | after about thirty seconds or so you you get almost all of the improvement that you gonna get |
---|
0:02:48 | at thirty seconds is a little bit too much for many practical applications |
---|
0:02:53 | if you have some telephone service was someone that's gonna like request to stop quote or |
---|
0:02:58 | do web search or something |
---|
0:03:00 | my and you might only have |
---|
0:03:02 | two or three |
---|
0:03:03 | seconds or maybe five seconds of a and |
---|
0:03:06 | and that's not really enough for a from a lot of work |
---|
0:03:09 | in fact the law about five seconds and and my experience |
---|
0:03:13 | it's not gonna give you and and and the thing is gone actually make it worse |
---|
0:03:17 | so you might as well turn off |
---|
0:03:20 | so that's the problem that this |
---|
0:03:21 | talk is addressing an actual is the same problem that many previous talks and the session have then |
---|
0:03:27 | i think all of the previous talks and the session of an addressing this problem |
---|
0:03:32 | so |
---|
0:03:36 | okay this |
---|
0:03:37 | light is rising some of the prior |
---|
0:03:40 | approach is an i i should emphasise that |
---|
0:03:43 | i'm talking about the prior approaches to |
---|
0:03:46 | somehow have regular rising C M are |
---|
0:03:49 | obviously there's many other things that you can do like eigenvoices voices the stuff that uh the previous speaker mention |
---|
0:03:56 | i single other parameters but i'm talking about |
---|
0:03:58 | see all are |
---|
0:04:00 | yeah regularization |
---|
0:04:02 | so a simple things you can do you can just |
---|
0:04:05 | like the a matrix diagonal |
---|
0:04:08 | and that |
---|
0:04:08 | a kind of an option and H K and it |
---|
0:04:11 | it you know it it's a good approach you get a lot of improvement but it very ad hoc |
---|
0:04:16 | i also uh make a block diagonal |
---|
0:04:20 | this this approach had its origins the cost and the delta |
---|
0:04:24 | and delta-delta type features |
---|
0:04:26 | so you have three blocks that thirteen by thirteen |
---|
0:04:29 | and i don't nobody really uses those features on process anymore one that serious |
---|
0:04:35 | site i don't think use them anymore |
---|
0:04:37 | uh without some kind of transformation you can still use the block diagonal in fact one of the baseline that |
---|
0:04:42 | will be presenting |
---|
0:04:43 | is we use these blocks even though they've lost |
---|
0:04:46 | the original meaning |
---|
0:04:48 | and you know it is still seems to work |
---|
0:04:51 | so and other |
---|
0:04:53 | she's is bayesian approaches |
---|
0:04:55 | as the that been a couple of different papers both called |
---|
0:04:58 | F map a lot |
---|
0:05:00 | by a not a lot of it means would be we do have from a while but we have a |
---|
0:05:04 | prior |
---|
0:05:05 | and we pick the map estimate |
---|
0:05:08 | as been a paper from microsoft and one from be M and they had slightly different priors but it was |
---|
0:05:13 | the same basic idea |
---|
0:05:14 | this is one of the baseline that we're going to |
---|
0:05:17 | you using an our experiment |
---|
0:05:19 | yeah |
---|
0:05:21 | a an issue with these approaches is that |
---|
0:05:24 | you you you probably like to have |
---|
0:05:27 | a prior that tells you how all of the rows of the transform colour late with each other |
---|
0:05:32 | but in practice that's not really uh do able |
---|
0:05:35 | so |
---|
0:05:36 | people we generally see priors |
---|
0:05:39 | the are ride the row by row or even completely diagonal |
---|
0:05:42 | so it's the prior over each individual parameter |
---|
0:05:46 | yeah |
---|
0:05:47 | approach that we are using is |
---|
0:05:49 | parameter or doctor |
---|
0:05:51 | reduction using a basis |
---|
0:05:53 | where the uh |
---|
0:05:54 | a from all i'm matrix a some kind of weighted |
---|
0:05:57 | so of uh |
---|
0:05:59 | i the of a basis matrices are of |
---|
0:06:02 | a prototype matrices |
---|
0:06:04 | so |
---|
0:06:06 | the basic idea of not that this is similar to some of the previous talks |
---|
0:06:10 | sometimes people have the uh |
---|
0:06:12 | the there is a factorization than done a basis on you know the upper and lower how or something like |
---|
0:06:17 | that |
---|
0:06:18 | but we we we're talking about a a a a basic expansion of just the row or |
---|
0:06:23 | the row transform |
---|
0:06:24 | and the basic form of it is given here |
---|
0:06:27 | where |
---|
0:06:29 | is W subscript and |
---|
0:06:31 | do the kind of like prototype or i again |
---|
0:06:34 | a from a lot matrices and |
---|
0:06:36 | they're kind of computed in advance somehow |
---|
0:06:39 | for a given speaker |
---|
0:06:41 | you have to estimate these coefficients |
---|
0:06:44 | yeah well |
---|
0:06:45 | this is not a really convex problem but |
---|
0:06:48 | uh uh it's |
---|
0:06:50 | it's solvable in a kind of local sense and it's not really |
---|
0:06:53 | i i don't think it's a practical issue |
---|
0:06:56 | so |
---|
0:06:57 | the previous work in this area |
---|
0:06:59 | you can of decide the basis this size an advance |
---|
0:07:03 | so this the again |
---|
0:07:04 | the decide let's we're gonna make that two hundred say |
---|
0:07:08 | a number of parameters and the actual matrix what what's the T nine by forty it's like |
---|
0:07:13 | a couple of thousand so |
---|
0:07:15 | i |
---|
0:07:15 | if you decide in advance some we gonna make two hundred coefficient |
---|
0:07:19 | a that you know does pretty well for typical configurations if you have you know between ten and thirty seconds |
---|
0:07:25 | of speech |
---|
0:07:26 | but you gonna get a degradation |
---|
0:07:28 | once you have a lot of data because you're not really estimating all of the parameters that you could estimate |
---|
0:07:34 | eventually gets a bit worse |
---|
0:07:36 | when you have a a lot of adaptation data |
---|
0:07:38 | so that the |
---|
0:07:40 | this to the closest prior work |
---|
0:07:43 | so |
---|
0:07:45 | the couple of differences |
---|
0:07:46 | there what we're describing here from this prior what it was done in I B M |
---|
0:07:51 | yeah why we |
---|
0:07:53 | a basis |
---|
0:07:54 | size to very per speaker |
---|
0:07:57 | and we have a very simple real we just say |
---|
0:07:59 | the more data that we have them more coefficients we can estimate |
---|
0:08:03 | and we'd we just make the number of coefficients proportional to the amount of data |
---|
0:08:08 | a of you could do all kinds of fancy stuff with information criterion so on but |
---|
0:08:13 | i think this technique is you know easily complicated enough already without |
---|
0:08:17 | introducing an you aspects so we just picked a very simple rule |
---|
0:08:20 | but but the other aspect is we have a |
---|
0:08:23 | we have a keep way of estimating these basis matrices W and |
---|
0:08:27 | but a little bit more clever than just doing pca |
---|
0:08:32 | yeah |
---|
0:08:33 | and and and and lee |
---|
0:08:36 | we just trying to popularise this type of method |
---|
0:08:38 | this uh |
---|
0:08:40 | we have uh |
---|
0:08:41 | we have a job version of this paper in which we tried to explain very clearly had to implement it |
---|
0:08:46 | "'cause" it really does |
---|
0:08:48 | were |
---|
0:08:49 | and it's uh it's for robust and everything and police so i think it's something but the i do recommend |
---|
0:08:54 | to you |
---|
0:08:57 | yeah |
---|
0:09:00 | this |
---|
0:09:01 | so probably covers material that i've |
---|
0:09:05 | oh i'm gonna go to this slide |
---|
0:09:06 | so |
---|
0:09:09 | yeah the sample that it's like making a diagonal block back all very well and good but |
---|
0:09:14 | it just doesn't it it's just a bit |
---|
0:09:16 | ad hoc can it doesn't |
---|
0:09:18 | you that much improvement |
---|
0:09:20 | also so you it's all a half of to have to decide like |
---|
0:09:23 | how much data do we have can we afford to do the form one are we gonna make it diagonal |
---|
0:09:27 | when that this trade-off and you can get into |
---|
0:09:30 | having count cutoffs and stuff but it's a bit of a mess |
---|
0:09:34 | yeah |
---|
0:09:35 | i to have anything and in methods you know if we could of done this in the bayesian way |
---|
0:09:40 | that's probably |
---|
0:09:42 | i think that's what more optimal because by picking a basis |
---|
0:09:45 | size you kind of making a hard decision |
---|
0:09:48 | a with the bayesian method you could do that and a soft way |
---|
0:09:52 | and you know i think making a soft decision was always better than a hard decision |
---|
0:09:56 | but |
---|
0:09:57 | the the problem with the bayesian approach is first is very hard to estimate the prior |
---|
0:10:02 | because |
---|
0:10:04 | the whole reason may in the situation as we don't have a ton of data per speaker |
---|
0:10:08 | right and |
---|
0:10:09 | assuming a training data is matched just T testing condition |
---|
0:10:13 | not gonna have a lot of data from you training data to estimate the |
---|
0:10:17 | the uh W major sees |
---|
0:10:19 | so how you gonna estimate of prior because you don't have good estimates of the |
---|
0:10:23 | uh |
---|
0:10:24 | the things are trying to get a prior on and a "'cause" you can do all of these |
---|
0:10:28 | no easy in schemes where you integrate and stuff but it just becomes a a big head a |
---|
0:10:32 | plus was always somebody choice choices to make |
---|
0:10:35 | but there is i like this basis type of method is you just a we can use a basis |
---|
0:10:40 | and everything just falls out and it's obvious what to do |
---|
0:10:44 | so |
---|
0:10:46 | i to talk about how to estimate the uh |
---|
0:10:48 | basis |
---|
0:10:49 | so |
---|
0:10:51 | because we going to decide the coefficients and test i am |
---|
0:10:56 | the kind of a or did base that's what you have the most important elements first |
---|
0:11:00 | and released least important elements last |
---|
0:11:03 | i if you we gonna say we gonna have any close two hundred and just that's it |
---|
0:11:07 | then |
---|
0:11:09 | it wouldn't really matter whether they were all mixed up what order they were and |
---|
0:11:12 | but because uh |
---|
0:11:14 | we're gonna decide the number of coefficients and test i'm we need to have this or |
---|
0:11:18 | thing |
---|
0:11:19 | so things are not show that |
---|
0:11:21 | pca or A or or you know as three D are those those kinds of |
---|
0:11:25 | approaches |
---|
0:11:27 | the they actually give you this |
---|
0:11:29 | but |
---|
0:11:30 | i'm not very comfortable just |
---|
0:11:32 | saying we're gonna do pca because you know |
---|
0:11:35 | but who's to say that that makes sense |
---|
0:11:38 | uh what one obvious argument why doesn't make sense |
---|
0:11:41 | was that if you were to uh |
---|
0:11:45 | and a scale the different dimensions of your |
---|
0:11:48 | a of feature vector |
---|
0:11:49 | that's gonna change the solution that P gives C you |
---|
0:11:53 | no i mean is gonna change it and then |
---|
0:11:55 | yeah and that kind of uh is gonna change the so basically it's gonna affect your decoding |
---|
0:11:59 | and to me that's a but uh |
---|
0:12:02 | it's not the right thing to do |
---|
0:12:04 | so |
---|
0:12:06 | but the framework but i think is most natural as maximum likelihood |
---|
0:12:09 | what we're going to choose try to pick the bases that maximise the like you on test |
---|
0:12:15 | and uh |
---|
0:12:17 | i i don't think i have time to go through the whole |
---|
0:12:22 | and i wasn't having to through the whole you know |
---|
0:12:25 | argument about what we're doing |
---|
0:12:27 | but but basically |
---|
0:12:29 | we end that the use pca A but in that slightly precondition and uh |
---|
0:12:34 | space |
---|
0:12:35 | so |
---|
0:12:36 | but W is a thirty nine by forty T matrix typically |
---|
0:12:40 | but |
---|
0:12:41 | we wanna consider the correlations between the rows and it's not ready can to think of it as a matrix |
---|
0:12:45 | a let's think of it as one big but to |
---|
0:12:48 | of size that in nine by forty byte we can cat make the rose |
---|
0:12:52 | no |
---|
0:12:53 | uh |
---|
0:12:55 | if |
---|
0:12:58 | i don't know if i can easily |
---|
0:12:59 | and and how well this argument works but |
---|
0:13:01 | that |
---|
0:13:02 | i think about the objective function for each speaker |
---|
0:13:05 | if that objective function were records at work or tick function |
---|
0:13:09 | if we can somehow to change of variable so that the uh |
---|
0:13:12 | quadratic part of that function is just proportional to the you makes |
---|
0:13:17 | it then is possible to show that |
---|
0:13:19 | we kind of right solution is just doing weighted pca |
---|
0:13:23 | a some kind of derivation i mean it it might be obvious to some people how how you derive that |
---|
0:13:29 | uh |
---|
0:13:29 | but you know |
---|
0:13:31 | but let's just take it is given for now but |
---|
0:13:33 | that that's true |
---|
0:13:34 | so so the |
---|
0:13:36 | to do is do this strange variable so that the objective function is quadratic for each speaker |
---|
0:13:41 | and for is not quite possible to do that |
---|
0:13:45 | because |
---|
0:13:46 | we |
---|
0:13:47 | okay okay for a basis for as a couple of reasons |
---|
0:13:50 | that's the objective function is not really quadratic that this log and and it's |
---|
0:13:54 | for nonlinear |
---|
0:13:55 | secondly |
---|
0:13:56 | you can take a taylor series approximation of round the kind of before make trick |
---|
0:14:01 | i and zero |
---|
0:14:03 | take it taylor series around there |
---|
0:14:05 | and the could right sick uh |
---|
0:14:09 | that the quadratic term in that taylor series |
---|
0:14:11 | but remember rubber it is a big vector right so the quadratic reading um is like |
---|
0:14:15 | a big major |
---|
0:14:17 | uh |
---|
0:14:18 | a if you know about two thousand by two thousand |
---|
0:14:20 | but quadratic to it depends on the that |
---|
0:14:24 | so it's not just the constant |
---|
0:14:26 | but |
---|
0:14:27 | i i i don't think this that's really very that much in an important way |
---|
0:14:31 | so |
---|
0:14:32 | it is possible to kind of do a |
---|
0:14:35 | it's possible to make an argument that |
---|
0:14:37 | once we |
---|
0:14:39 | once we work out the uh average of these quadratic terms |
---|
0:14:43 | and then kind of preconditions conditions so that average averages the unit |
---|
0:14:47 | it possible to make a reasonable argument that |
---|
0:14:50 | to each speech uh |
---|
0:14:51 | the uh |
---|
0:14:53 | that was matrices are approximately unit |
---|
0:14:56 | and and it is the situation where |
---|
0:14:59 | would like it to be you know but is not gonna make it big difference if it's not quite unit |
---|
0:15:03 | because i with all this is doing is this is |
---|
0:15:06 | we're gonna pretty |
---|
0:15:07 | we gonna pretty take everything in then do pca |
---|
0:15:11 | and if you don't quite three low take correctly |
---|
0:15:14 | rotations on the correct word kind a pretty a lot of a |
---|
0:15:18 | that is accurate it's not gonna like totally change the result of pca |
---|
0:15:22 | well that's gonna happen is that let's say the first i eigenvector vector |
---|
0:15:27 | is gonna be mixed up a little bit with the second and so on |
---|
0:15:30 | so |
---|
0:15:32 | it pretty close to maximum like we had |
---|
0:15:36 | now i think i i |
---|
0:15:38 | cover this material |
---|
0:15:39 | so |
---|
0:15:40 | oh training time computation is you do this |
---|
0:15:44 | but the training time computation basically involves computing |
---|
0:15:48 | big matrices like this and doing like a a as we D and stuff |
---|
0:15:51 | it's all described in the in the journal paper |
---|
0:15:54 | so a test time there's an iterative of in test them to uh to keep the coefficients those the |
---|
0:16:01 | a subscript something |
---|
0:16:03 | so |
---|
0:16:04 | it's |
---|
0:16:04 | it's a lot convex problem but it's pretty easy to get a local optimum |
---|
0:16:08 | we just you like steepest ascent and you do it pretty exact line search |
---|
0:16:13 | and it's all it's all described in and the |
---|
0:16:15 | paper |
---|
0:16:17 | so this is the result with slide requires a little bit of explanation |
---|
0:16:21 | so we have to |
---|
0:16:23 | of |
---|
0:16:24 | test data one is |
---|
0:16:25 | short utterances one is long |
---|
0:16:28 | it's it's like one is the digit digits type of task and stocks and stuff and one is a voice |
---|
0:16:32 | mail |
---|
0:16:33 | yeah |
---|
0:16:34 | yes |
---|
0:16:34 | we divide each of those two |
---|
0:16:36 | in two uh four subsets based on the line |
---|
0:16:39 | and the X axis |
---|
0:16:41 | is |
---|
0:16:42 | is is the uh length about of an so ten to the there was one second ten to the one |
---|
0:16:46 | is ten seconds |
---|
0:16:48 | and each of these kind |
---|
0:16:51 | each of the uh |
---|
0:16:54 | the |
---|
0:16:56 | each of you kind of a |
---|
0:16:57 | lines the points corresponds to a bin of train a test data |
---|
0:17:02 | so |
---|
0:17:03 | we we we can divide it up and the buck them on the left hand side this short |
---|
0:17:08 | utterances on the right is wrong |
---|
0:17:10 | and each point is a relative improvement |
---|
0:17:13 | yeah |
---|
0:17:13 | the the the the triangles the that triangle on the bottom that's just regular |
---|
0:17:18 | a from or R |
---|
0:17:20 | and it's actually making things worse for the first three bins |
---|
0:17:23 | and then a helps a bit |
---|
0:17:26 | oh is the a either word error rate kind of jumps up and down a bit |
---|
0:17:30 | because it's different types of data |
---|
0:17:32 | so |
---|
0:17:33 | this is maybe not the ideal uh a data to test this on but |
---|
0:17:37 | you've got look at the relative of and these thing |
---|
0:17:41 | uh |
---|
0:17:42 | the very top line is our method |
---|
0:17:44 | is doing a bed so it's given me a lot more improvement than the other methods |
---|
0:17:49 | i the this the three block on the diagonal one have matter a lot |
---|
0:17:53 | those |
---|
0:17:54 | there are bit a uh then doing regular similar |
---|
0:17:57 | i get some improvement for the shorter amount of data |
---|
0:18:01 | but we get a more improvement from our method |
---|
0:18:04 | so |
---|
0:18:05 | and i i mean the the story is that |
---|
0:18:07 | if you have let's say between about three and ten seconds of data |
---|
0:18:12 | i think this method will be a big improvement versus |
---|
0:18:15 | doing of map a lower diagonal |
---|
0:18:17 | or whatever |
---|
0:18:19 | so uh |
---|
0:18:21 | but but if you have a you know let's a more than thirty seconds they really doesn't make it from |
---|
0:18:26 | so i think |
---|
0:18:27 | i have pretty much covered all of this |
---|
0:18:30 | i'm being us to wrap up |
---|
0:18:33 | yeah |
---|
0:18:34 | i think we've covered all of this |
---|
0:18:36 | so i recommend that john paper if you want to implement this "'cause" i do described very clearly had to |
---|
0:18:41 | do it and i think this stuff it does work |
---|
0:18:44 | okay |
---|
0:18:49 | i a question mark |
---|
0:18:57 | pretty stunned |
---|
0:19:03 | okay okay well uh but close the session at okay thanks right |
---|