0:00:18 | uh_huh |
---|
0:00:22 | so the and so there was a sum of overlap uh between that you talk in right about that |
---|
0:00:26 | to a problem used to say that uh |
---|
0:00:29 | presentations of like painting a lot user takes a couple of coats and |
---|
0:00:34 | some people it never sticks and you know who you are |
---|
0:00:37 | a |
---|
0:00:38 | so uh base you guys start of the |
---|
0:00:40 | the the purpose of the interest in this work is investigating techniques for performing |
---|
0:00:44 | uh sub-spaces over acoustic model parameter |
---|
0:00:48 | and the obvious |
---|
0:00:50 | uh you know a aspect of that is that we for more efficient characterisations of sources of variability |
---|
0:00:55 | by representing used for sources of variability in low dimensional spaces |
---|
0:01:00 | so you |
---|
0:01:02 | say a speaker adaptation and speaker verification |
---|
0:01:05 | uh there's been there's there's clear advantages here to forming sub-spaces over |
---|
0:01:09 | uh a or i guess |
---|
0:01:11 | describing a global variability by forming |
---|
0:01:15 | the or or over these super vectors which correspond to concatenated means of continues gaussian scenes hmms and G M |
---|
0:01:23 | and there's |
---|
0:01:23 | examples in speaker verification go back to the eigenvoice approach is where you know where where they show the |
---|
0:01:30 | it then it's of forming these |
---|
0:01:32 | these uh these global sub-spaces and also extend into uh obviously into speaker verification |
---|
0:01:39 | where things like joint factor analysis also benefit from these low-dimensional |
---|
0:01:43 | uh a is formed over these giant supervectors vectors |
---|
0:01:47 | so we have the more interesting developments uh that's occurred recently is this notion of extent |
---|
0:01:54 | these uh these supervector based techniques to modelling state-level variability |
---|
0:02:00 | and they and we do that by |
---|
0:02:02 | uh basically |
---|
0:02:03 | oh for mean of multiple sub-spaces |
---|
0:02:06 | uh uh where as we rather than these single subspace spaces that we form for these super vector |
---|
0:02:12 | take |
---|
0:02:13 | and uh this is the guy of these subspace gaussian mixture model it proposed by |
---|
0:02:18 | bad bad and in uh two thousand ten |
---|
0:02:20 | oh do as we doing an experimental study to compare the |
---|
0:02:24 | uh the asr performance of both so these both of these types of approach |
---|
0:02:29 | uh i i guess is important it to note again that we're forming sub-spaces spaces over model parameters and |
---|
0:02:35 | and i guess this is a simple example where we form these |
---|
0:02:38 | these uh sub-spaces over super vectors in a speaker adaptation problem and so this little example here we have |
---|
0:02:46 | a a a a a a a a a a population of a speakers |
---|
0:02:50 | and we're for and uh we have um |
---|
0:02:53 | uh i do but from to these N speakers and i have a little plot in at some arbitrary two |
---|
0:02:57 | dimensional space |
---|
0:02:58 | of the feature vectors of those speakers we can train |
---|
0:03:01 | a a mixture model |
---|
0:03:03 | so we describing |
---|
0:03:04 | each of these populations of speakers in terms of model parameters |
---|
0:03:08 | uh a mixture of gaussians |
---|
0:03:10 | a two dimensional space |
---|
0:03:11 | and then |
---|
0:03:12 | yeah as we could let's say will form a one dimensional subspace |
---|
0:03:16 | um that describes variation over these um |
---|
0:03:20 | uh a or these and dimensional model parameters |
---|
0:03:23 | this case and can be quite large the cake cat need the concatenation of these |
---|
0:03:27 | he's mixture vector |
---|
0:03:29 | so i |
---|
0:03:30 | we can do that |
---|
0:03:32 | a speaker dependent supervectors by concatenating the means of each of these |
---|
0:03:37 | i each each of these distributions |
---|
0:03:39 | and we can identify some sort of this the subspace projection |
---|
0:03:43 | by using something like print principal components analysis or maximum likelihood clustering |
---|
0:03:48 | uh |
---|
0:03:49 | based on being is that |
---|
0:03:51 | mention or |
---|
0:03:52 | uh super super vectors |
---|
0:03:54 | having done that |
---|
0:03:55 | we can uh do adaptation now all had sound adaptation data |
---|
0:04:00 | we are well |
---|
0:04:01 | we also have a a speaker-independent model parameters which we might of training from multi-speaker speaker |
---|
0:04:06 | uh training data |
---|
0:04:07 | and we also have |
---|
0:04:09 | a a a a a a a a a subspace projection matrix that will have a trained on our super |
---|
0:04:14 | vector |
---|
0:04:15 | then i |
---|
0:04:16 | and vowels |
---|
0:04:17 | uh coming up with a a a a vector which describes |
---|
0:04:21 | uh the you know the position |
---|
0:04:23 | oh of our model with respect to ride data in this |
---|
0:04:27 | uh in in this low dimensional space |
---|
0:04:29 | and so that's a that that's race basically what these subspace space adaptation procedures are |
---|
0:04:34 | uh uh in um uh when when we uh |
---|
0:04:37 | are defined are subspace over these supervectors |
---|
0:04:40 | and the good thing about them in a a a for example and uh |
---|
0:04:44 | and speaker adaptation |
---|
0:04:45 | is we can uh uh can we could do adaptation was seconds of speech rather than and it's it's speech |
---|
0:04:51 | like you might need when you're doing were you know sort of regression based H |
---|
0:04:55 | so that's a hall |
---|
0:04:56 | but all well on good and these things are well known |
---|
0:04:58 | and that the oh |
---|
0:05:00 | a we got there's are model i forgot about that |
---|
0:05:02 | so that |
---|
0:05:04 | so is that's all fine and and um |
---|
0:05:06 | uh so the idea here we but in the subspace gmm |
---|
0:05:10 | as we can next |
---|
0:05:12 | which |
---|
0:05:13 | which are do subspace adaptation for a or form subspace |
---|
0:05:17 | um models to model sort of global variability |
---|
0:05:20 | to modeling so state level variability |
---|
0:05:23 | hmmm |
---|
0:05:23 | and was |
---|
0:05:25 | in and |
---|
0:05:26 | stop me if you've heard this one |
---|
0:05:28 | but there's a number of generalisations when we go from the um |
---|
0:05:32 | a a for from these as uh to uh vector based approaches to |
---|
0:05:36 | to of the uh subspace gmm the first one |
---|
0:05:39 | as as as and pointed out we |
---|
0:05:41 | for a a a a uh uh are are state dependent |
---|
0:05:45 | observation probabilities are state dependent probability |
---|
0:05:48 | in terms of use |
---|
0:05:50 | uh a full covariance |
---|
0:05:52 | the have since is shared pool a full covariance gaussians and we call that |
---|
0:05:56 | a a basically a a a universal background model |
---|
0:06:00 | following from the |
---|
0:06:01 | terminology in uh in in speaker verification |
---|
0:06:05 | and so we have |
---|
0:06:06 | a a and the hundreds of these |
---|
0:06:08 | uh a where from you know what a couple of hundred to a thousand |
---|
0:06:12 | a a shared covariance map uh gaussians |
---|
0:06:15 | uh that we define a distributions over |
---|
0:06:18 | from next |
---|
0:06:18 | you as a nation is that were forming |
---|
0:06:21 | these uh as these subspace projections |
---|
0:06:25 | a one for each of these gaussians in the pool |
---|
0:06:28 | and uh and so we'll have multiple projections |
---|
0:06:32 | a rather than a single a projection as we were be |
---|
0:06:35 | uh the the speaker the space based approach |
---|
0:06:38 | and then the final generalisation of is that we |
---|
0:06:41 | uh the the state dependent means |
---|
0:06:44 | and these state dependent weights are now uh basically obtain |
---|
0:06:48 | yeah it's projections |
---|
0:06:50 | with an these sub-spaces |
---|
0:06:52 | so the the i |
---|
0:06:54 | a the the mean vector for the j-th state and the i a mixture component |
---|
0:07:00 | is obtained from a |
---|
0:07:02 | uh this this projection where this and so by |
---|
0:07:05 | is the the the cursor |
---|
0:07:07 | oh hmmm really and so by |
---|
0:07:10 | it is the uh |
---|
0:07:11 | is the uh uh the subspace projection |
---|
0:07:14 | a matrix and the V G A it is this state vector that the and mentioned before |
---|
0:07:19 | but we uh we describe in has here it is as a a set |
---|
0:07:23 | uh to the uh a and uh you of these uh a you uh a uh the mean of the |
---|
0:07:29 | uh universal background model |
---|
0:07:31 | uh but that's a a a that that's not terribly important |
---|
0:07:34 | and then T |
---|
0:07:35 | the state dependent weights are obtained from a weight projection vectors |
---|
0:07:39 | and that's these |
---|
0:07:40 | W use of ice here |
---|
0:07:42 | uh and uh again applied |
---|
0:07:44 | uh to these state depended V vector |
---|
0:07:47 | and uh the fan of these weights are i you know they're their normal right so they sum to one |
---|
0:07:54 | the exponential why makes it uh look vaguely like a um |
---|
0:07:58 | multiclass logistic regression but in fact |
---|
0:08:01 | i guess just stand probably pointed out that this exponential is really |
---|
0:08:05 | an important here when was comes to optimize the objective function |
---|
0:08:09 | the expectation maximisation al gore |
---|
0:08:11 | so |
---|
0:08:12 | a a so this is a very interesting a what are they really interesting aspects of but is we've got |
---|
0:08:17 | a just a really small number of parameters to represent the state |
---|
0:08:20 | single state vector V in this case |
---|
0:08:23 | i where as in a traditional continues density hmm all the parameters are basically state depend we just got a |
---|
0:08:29 | big pile gas |
---|
0:08:31 | so these |
---|
0:08:32 | this sub state matrices and the U |
---|
0:08:34 | i'll share |
---|
0:08:35 | i can then extend this |
---|
0:08:37 | more arrows we can extend this |
---|
0:08:39 | for other by by adding the notion of sub states |
---|
0:08:43 | i we now |
---|
0:08:44 | instead of had a single sub just uh single Z vector per state now we can have any number of |
---|
0:08:49 | them |
---|
0:08:50 | i met just a house is to play with a parameterisation |
---|
0:08:53 | um and so now we have a mixed |
---|
0:08:56 | a oh what will combination of and |
---|
0:08:59 | uh uh uh uh a sub state |
---|
0:09:02 | a a a uh densities |
---|
0:09:04 | per state |
---|
0:09:05 | and now the means and these weights and a in backs post by sub state and and stage J |
---|
0:09:10 | that's so they uh somebody has something you about the um |
---|
0:09:14 | um and the number of parameters and the model the interesting thing |
---|
0:09:18 | is that is dominated generally uh depending a parameterisation course |
---|
0:09:23 | but |
---|
0:09:23 | you generally have five or more |
---|
0:09:25 | uh uh uh uh a shared parameters then you had state dependent prior |
---|
0:09:30 | and |
---|
0:09:31 | uh |
---|
0:09:31 | the example system that the examples set since my hat that might be as much as ten to one |
---|
0:09:37 | oh |
---|
0:09:38 | that's a little bit extreme but that that uh |
---|
0:09:40 | that's not unusual |
---|
0:09:41 | okay |
---|
0:09:42 | so issues of training uh we |
---|
0:09:45 | we doing maximum likelihood training with the em algorithm |
---|
0:09:49 | i i if we have |
---|
0:09:52 | uh basically the maximum likelihood training of these subspace parameters and the |
---|
0:09:57 | and the uh these state vectors |
---|
0:09:59 | is a really simple straightforward extension of the case for the this global |
---|
0:10:03 | a a subspace model what we trained these sub-spaces over super vectors |
---|
0:10:08 | and a stand but |
---|
0:10:09 | what happens is that at and work very well when you don't have the weight vectors those |
---|
0:10:13 | oh those additional degrees of freedom are important |
---|
0:10:16 | and so we knew we had that was a in you you there is basically an additional component to the |
---|
0:10:21 | ml auxiliary function |
---|
0:10:23 | uh of the |
---|
0:10:24 | the uh uh a the these solution the out to you don't you don't have unique optimum and so you |
---|
0:10:30 | have to be very careful |
---|
0:10:31 | about how you |
---|
0:10:32 | optimize |
---|
0:10:33 | that auxiliary function |
---|
0:10:35 | as far as initialising things you start a |
---|
0:10:38 | uh initialising the the context |
---|
0:10:41 | um the state context |
---|
0:10:43 | for the sgmm |
---|
0:10:45 | if the phonetic context clustered |
---|
0:10:47 | uh a continuous density hmm state |
---|
0:10:50 | uh we initialize the the means and uh for covariance |
---|
0:10:55 | a matrices |
---|
0:10:57 | for |
---|
0:10:58 | for |
---|
0:10:59 | go |
---|
0:11:00 | a as a as a um |
---|
0:11:03 | i just using unsupervised gmm training |
---|
0:11:06 | and uh a rather than initialising of the the other parameters of the system we basically and initialize the uh |
---|
0:11:15 | state |
---|
0:11:16 | should the joint state um |
---|
0:11:18 | uh a uh uh a mixture component of posteriors |
---|
0:11:23 | where |
---|
0:11:23 | uh a as a product of the state posteriors from a initial cd hmm |
---|
0:11:30 | and the mixture component part last from are from are you M |
---|
0:11:34 | so |
---|
0:11:36 | yeah that |
---|
0:11:37 | a uh basically we're doing a experimental study to compare |
---|
0:11:40 | the performance of the |
---|
0:11:43 | uh a subspace gmm |
---|
0:11:45 | with the unsupervised uh subspace adaptation |
---|
0:11:48 | when yeah okay in that and we're doing that using the resource management database and will acknowledge that that's a |
---|
0:11:54 | a a fairly small corpus collected under constrained conditions we have about four hours of training |
---|
0:11:59 | and about a hundred speakers |
---|
0:12:01 | um |
---|
0:12:02 | the the advantage for us though it is |
---|
0:12:05 | as it's not very amenable uh to various other adaptation approach the regression based adaptation and uh |
---|
0:12:12 | vtln and so on |
---|
0:12:13 | don't do much with it so we can see where there is there isn't the issue of the overlap of |
---|
0:12:19 | the effects we get from the S from from the adaptation scheme we're doing here |
---|
0:12:23 | with other possible um |
---|
0:12:25 | you normalisation strategies |
---|
0:12:27 | so |
---|
0:12:28 | yeah baseline system with seventeen hundred uh context clustered |
---|
0:12:33 | uh states and about six gaussians per state which is pretty typical |
---|
0:12:37 | and the |
---|
0:12:38 | a a speaker dependent evaluation task |
---|
0:12:40 | is for quite nine percent and that's pretty much uh in in the alignment with the state the are |
---|
0:12:45 | um |
---|
0:12:46 | and the another point here about the parameter the allocation of the sgmm parameters so for this particular system |
---|
0:12:54 | we're starting at |
---|
0:12:56 | dependent a a a are the continuous density hmm |
---|
0:12:59 | uh system has about eight hundred thousand parameters |
---|
0:13:03 | um |
---|
0:13:04 | and all of those are basically a shared parameter |
---|
0:13:07 | for the |
---|
0:13:09 | for the the right this in the first row or this table for the single sub state |
---|
0:13:13 | per state |
---|
0:13:14 | a subspace gmm |
---|
0:13:16 | a roughly ninety percent of those parameters are shared across all states we have |
---|
0:13:21 | you know a |
---|
0:13:23 | the shape parameters correspond to these sub state projection matrices |
---|
0:13:27 | uh and the full covariance gaussians |
---|
0:13:30 | and was about six hundred thirty K of those |
---|
0:13:32 | but only about sixty K A of these state depend of parameters which correspond to these V that |
---|
0:13:38 | so i for that particular system we have a as you know small number of parameters as we do |
---|
0:13:44 | uh for the can use continuous density hmm |
---|
0:13:47 | and but are a of the uh parameterisation |
---|
0:13:50 | uh uh that that that the not that much different |
---|
0:13:54 | then the continuous density hmm |
---|
0:13:56 | and it no matter about the parameterisation is an our case |
---|
0:13:59 | we tends to be have a rate lee you know we were were this heavily biased |
---|
0:14:03 | to as a number of parameters |
---|
0:14:05 | you know uh it's |
---|
0:14:07 | them |
---|
0:14:08 | so the the E one or eight we've got here is we can |
---|
0:14:13 | by basically repeated this for point nine percent word accurate word error rate we got from the baseline continuous density |
---|
0:14:20 | hmm |
---|
0:14:21 | for |
---|
0:14:22 | subspace face G and and the best performance that we got was was about uh what what was three point |
---|
0:14:28 | nine nine percent so that's about a twenty percent improvement |
---|
0:14:31 | which is it's pretty substantial on the is i guess with |
---|
0:14:33 | dance comment meant that that when you have a small amount of data you your real level of the it's |
---|
0:14:37 | about twenty percent |
---|
0:14:39 | T V second and third row |
---|
0:14:42 | a described by uh i can basically compare a different a means of the initialisation than this uh this scheme |
---|
0:14:49 | for initialising the posterior probabilities of the sgmm |
---|
0:14:53 | a give us some small but statistically significant improvement performance over a flat start |
---|
0:14:58 | um |
---|
0:14:59 | for the data |
---|
0:15:00 | and a comparison of the sgmm with these supervector adaptation |
---|
0:15:05 | uh a approaches which of because uh well which are fairly well known |
---|
0:15:09 | uh these days |
---|
0:15:10 | i i basically what we did |
---|
0:15:12 | yeah |
---|
0:15:14 | uh estimated uh a a a a a a a sub state |
---|
0:15:19 | i'm sorry a subspace |
---|
0:15:20 | projection matrix E here which is |
---|
0:15:23 | a defined of over these supervectors of R |
---|
0:15:26 | a a a a a a a of a a a um |
---|
0:15:29 | uh a yeah of our scheme of continuous density hmm |
---|
0:15:33 | that's the uh the um in that first equation |
---|
0:15:36 | and then |
---|
0:15:37 | uh doing adaptation |
---|
0:15:39 | uh uh estimate this uh this use actor speaker dependent you vector from a single unlabeled labeled or i'm transcribe |
---|
0:15:47 | test utterance |
---|
0:15:48 | and so we have a basically a subspace dimension there is twenty |
---|
0:15:52 | um |
---|
0:15:52 | and uh |
---|
0:15:54 | a what we thought as we got about eight ten percent the |
---|
0:15:58 | but eight or not i guess that's nine or ten percent |
---|
0:16:00 | uh improve performance |
---|
0:16:04 | that's a for from this uh a super vector based at uh adaptation |
---|
0:16:09 | which is not as big as we got from the sgmm |
---|
0:16:12 | we could we were also a a a a i and this corpus tried the uh and |
---|
0:16:17 | uh these speaker at a you go |
---|
0:16:19 | throwing and |
---|
0:16:20 | the speaker subspace |
---|
0:16:22 | uh with the with the uh S gmm model we didn't get a a really statistically significant |
---|
0:16:28 | uh improvement from that |
---|
0:16:30 | uh i would suspect that the additional |
---|
0:16:32 | degrees of freedom in these speaker dependent weights in uh |
---|
0:16:36 | uh describing N's uh to earlier talk might have |
---|
0:16:40 | uh might have an impact on the |
---|
0:16:42 | uh |
---|
0:16:44 | i do in my done |
---|
0:16:45 | no i so that to me |
---|
0:16:48 | as a |
---|
0:16:48 | okay |
---|
0:16:50 | you could me a look |
---|
0:16:51 | so |
---|
0:16:51 | this this last slide is an anecdotal example |
---|
0:16:54 | oh the distribution of these sub state projection vectors |
---|
0:16:58 | um it or of the first two dimensions |
---|
0:17:01 | of those sub state projection vectors |
---|
0:17:03 | in this case for |
---|
0:17:05 | spanish language call home corpus or from an sgmm |
---|
0:17:08 | trained from the |
---|
0:17:10 | spanish language callhome home corpus |
---|
0:17:12 | and um |
---|
0:17:13 | we so we have these |
---|
0:17:15 | that's that's we were restricted uh this plot to uh a sort of a scatter diagram here of the center |
---|
0:17:20 | states |
---|
0:17:21 | of the five spanish vowels |
---|
0:17:24 | with the easily in the set indicating so the location of the cluster centroids |
---|
0:17:29 | this is very similar |
---|
0:17:30 | to a plot that look but good and better words did |
---|
0:17:34 | uh i i'm uh in this language corpus |
---|
0:17:36 | and and um basically the thing i found that was really interesting about this it is |
---|
0:17:41 | a a a a a you you see is very nice clustering |
---|
0:17:44 | oh a three is uh uh of these state dependent vectors of uh for the for the different files |
---|
0:17:49 | that's something you just don't see in a continuous density hmm right you get |
---|
0:17:53 | uh |
---|
0:17:54 | a you you certainly can't look at the means |
---|
0:17:58 | oh of the densities in a continues sit density hmm and see any kind of |
---|
0:18:02 | visible is structure there so this is |
---|
0:18:04 | a very interesting thing how the structure sort of discovered automatically from the sgmm and |
---|
0:18:09 | and the C |
---|
0:18:12 | the that uh the really are some other interesting uses of the thing |
---|
0:18:15 | um so to summarise it here i i guess some out of time |
---|
0:18:19 | we got this a rather substantial |
---|
0:18:23 | i say eighty percent before |
---|
0:18:25 | but it's getting pour so before for get my team percent by the end of the talk i i than |
---|
0:18:29 | eighteen percent reduction and word error rate |
---|
0:18:32 | uh compared to the cd hmm |
---|
0:18:34 | and uh basically a did better then be and supervise a subspace based speaker adaptation |
---|
0:18:40 | um |
---|
0:18:42 | oh we yes and this sort of general comment that uh the is very interesting and it's more anecdotal comment |
---|
0:18:47 | that these state level parameters |
---|
0:18:49 | seem to |
---|
0:18:50 | uncover cover |
---|
0:18:51 | uh i i so sadie picked |
---|
0:18:53 | a a a a a a a a underlying structure in the data |
---|
0:18:57 | um so |
---|
0:18:59 | this is a sum to take advantage of this interesting structure we started looking at |
---|
0:19:03 | hi |
---|
0:19:04 | and C |
---|
0:19:05 | a speech therapy application |
---|
0:19:07 | can be done by exploiting sort of this can structure we see |
---|
0:19:11 | uh in terms of the describing the phonetic variability |
---|
0:19:15 | and also a like a number of other people are talk to go the conference looking at multilingual a acoustic |
---|
0:19:20 | modeling application |
---|
0:19:23 | a much |
---|
0:19:28 | questions |
---|
0:19:37 | uh |
---|
0:19:38 | a great |
---|
0:19:38 | i i think we |
---|
0:19:39 | is probably similar to uh |
---|
0:19:43 | right thirty |
---|
0:19:44 | that's tired i believe we uh initialize the and vectors |
---|
0:19:49 | to the in the first |
---|
0:19:51 | a column of the and actors to the means |
---|
0:19:54 | of the ubm and the |
---|
0:19:56 | uh and the V that actors were initialised to unity |
---|
0:20:04 | two |
---|
0:20:04 | no questions |
---|
0:20:07 | no in that case of these computer station thank you very much for this truck |
---|
0:20:12 | i |
---|