0:00:18uh_huh
0:00:22so the and so there was a sum of overlap uh between that you talk in right about that
0:00:26to a problem used to say that uh
0:00:29presentations of like painting a lot user takes a couple of coats and
0:00:34some people it never sticks and you know who you are
0:00:37a
0:00:38so uh base you guys start of the
0:00:40the the purpose of the interest in this work is investigating techniques for performing
0:00:44uh sub-spaces over acoustic model parameter
0:00:48and the obvious
0:00:50uh you know a aspect of that is that we for more efficient characterisations of sources of variability
0:00:55by representing used for sources of variability in low dimensional spaces
0:01:00so you
0:01:02say a speaker adaptation and speaker verification
0:01:05uh there's been there's there's clear advantages here to forming sub-spaces over
0:01:09uh a or i guess
0:01:11describing a global variability by forming
0:01:15the or or over these super vectors which correspond to concatenated means of continues gaussian scenes hmms and G M
0:01:23and there's
0:01:23examples in speaker verification go back to the eigenvoice approach is where you know where where they show the
0:01:30it then it's of forming these
0:01:32these uh these global sub-spaces and also extend into uh obviously into speaker verification
0:01:39where things like joint factor analysis also benefit from these low-dimensional
0:01:43uh a is formed over these giant supervectors vectors
0:01:47so we have the more interesting developments uh that's occurred recently is this notion of extent
0:01:54these uh these supervector based techniques to modelling state-level variability
0:02:00and they and we do that by
0:02:02uh basically
0:02:03oh for mean of multiple sub-spaces
0:02:06uh uh where as we rather than these single subspace spaces that we form for these super vector
0:02:12take
0:02:13and uh this is the guy of these subspace gaussian mixture model it proposed by
0:02:18bad bad and in uh two thousand ten
0:02:20oh do as we doing an experimental study to compare the
0:02:24uh the asr performance of both so these both of these types of approach
0:02:29uh i i guess is important it to note again that we're forming sub-spaces spaces over model parameters and
0:02:35and i guess this is a simple example where we form these
0:02:38these uh sub-spaces over super vectors in a speaker adaptation problem and so this little example here we have
0:02:46a a a a a a a a a a population of a speakers
0:02:50and we're for and uh we have um
0:02:53uh i do but from to these N speakers and i have a little plot in at some arbitrary two
0:02:57dimensional space
0:02:58of the feature vectors of those speakers we can train
0:03:01a a mixture model
0:03:03so we describing
0:03:04each of these populations of speakers in terms of model parameters
0:03:08uh a mixture of gaussians
0:03:10a two dimensional space
0:03:11and then
0:03:12yeah as we could let's say will form a one dimensional subspace
0:03:16um that describes variation over these um
0:03:20uh a or these and dimensional model parameters
0:03:23this case and can be quite large the cake cat need the concatenation of these
0:03:27he's mixture vector
0:03:29so i
0:03:30we can do that
0:03:32a speaker dependent supervectors by concatenating the means of each of these
0:03:37i each each of these distributions
0:03:39and we can identify some sort of this the subspace projection
0:03:43by using something like print principal components analysis or maximum likelihood clustering
0:03:48uh
0:03:49based on being is that
0:03:51mention or
0:03:52uh super super vectors
0:03:54having done that
0:03:55we can uh do adaptation now all had sound adaptation data
0:04:00we are well
0:04:01we also have a a speaker-independent model parameters which we might of training from multi-speaker speaker
0:04:06uh training data
0:04:07and we also have
0:04:09a a a a a a a a a subspace projection matrix that will have a trained on our super
0:04:14vector
0:04:15then i
0:04:16and vowels
0:04:17uh coming up with a a a a vector which describes
0:04:21uh the you know the position
0:04:23oh of our model with respect to ride data in this
0:04:27uh in in this low dimensional space
0:04:29and so that's a that that's race basically what these subspace space adaptation procedures are
0:04:34uh uh in um uh when when we uh
0:04:37are defined are subspace over these supervectors
0:04:40and the good thing about them in a a a for example and uh
0:04:44and speaker adaptation
0:04:45is we can uh uh can we could do adaptation was seconds of speech rather than and it's it's speech
0:04:51like you might need when you're doing were you know sort of regression based H
0:04:55so that's a hall
0:04:56but all well on good and these things are well known
0:04:58and that the oh
0:05:00a we got there's are model i forgot about that
0:05:02so that
0:05:04so is that's all fine and and um
0:05:06uh so the idea here we but in the subspace gmm
0:05:10as we can next
0:05:12which
0:05:13which are do subspace adaptation for a or form subspace
0:05:17um models to model sort of global variability
0:05:20to modeling so state level variability
0:05:23hmmm
0:05:23and was
0:05:25in and
0:05:26stop me if you've heard this one
0:05:28but there's a number of generalisations when we go from the um
0:05:32a a for from these as uh to uh vector based approaches to
0:05:36to of the uh subspace gmm the first one
0:05:39as as as and pointed out we
0:05:41for a a a a uh uh are are state dependent
0:05:45observation probabilities are state dependent probability
0:05:48in terms of use
0:05:50uh a full covariance
0:05:52the have since is shared pool a full covariance gaussians and we call that
0:05:56a a basically a a a universal background model
0:06:00following from the
0:06:01terminology in uh in in speaker verification
0:06:05and so we have
0:06:06a a and the hundreds of these
0:06:08uh a where from you know what a couple of hundred to a thousand
0:06:12a a shared covariance map uh gaussians
0:06:15uh that we define a distributions over
0:06:18from next
0:06:18you as a nation is that were forming
0:06:21these uh as these subspace projections
0:06:25a one for each of these gaussians in the pool
0:06:28and uh and so we'll have multiple projections
0:06:32a rather than a single a projection as we were be
0:06:35uh the the speaker the space based approach
0:06:38and then the final generalisation of is that we
0:06:41uh the the state dependent means
0:06:44and these state dependent weights are now uh basically obtain
0:06:48yeah it's projections
0:06:50with an these sub-spaces
0:06:52so the the i
0:06:54a the the mean vector for the j-th state and the i a mixture component
0:07:00is obtained from a
0:07:02uh this this projection where this and so by
0:07:05is the the the cursor
0:07:07oh hmmm really and so by
0:07:10it is the uh
0:07:11is the uh uh the subspace projection
0:07:14a matrix and the V G A it is this state vector that the and mentioned before
0:07:19but we uh we describe in has here it is as a a set
0:07:23uh to the uh a and uh you of these uh a you uh a uh the mean of the
0:07:29uh universal background model
0:07:31uh but that's a a a that that's not terribly important
0:07:34and then T
0:07:35the state dependent weights are obtained from a weight projection vectors
0:07:39and that's these
0:07:40W use of ice here
0:07:42uh and uh again applied
0:07:44uh to these state depended V vector
0:07:47and uh the fan of these weights are i you know they're their normal right so they sum to one
0:07:54the exponential why makes it uh look vaguely like a um
0:07:58multiclass logistic regression but in fact
0:08:01i guess just stand probably pointed out that this exponential is really
0:08:05an important here when was comes to optimize the objective function
0:08:09the expectation maximisation al gore
0:08:11so
0:08:12a a so this is a very interesting a what are they really interesting aspects of but is we've got
0:08:17a just a really small number of parameters to represent the state
0:08:20single state vector V in this case
0:08:23i where as in a traditional continues density hmm all the parameters are basically state depend we just got a
0:08:29big pile gas
0:08:31so these
0:08:32this sub state matrices and the U
0:08:34i'll share
0:08:35i can then extend this
0:08:37more arrows we can extend this
0:08:39for other by by adding the notion of sub states
0:08:43i we now
0:08:44instead of had a single sub just uh single Z vector per state now we can have any number of
0:08:49them
0:08:50i met just a house is to play with a parameterisation
0:08:53um and so now we have a mixed
0:08:56a oh what will combination of and
0:08:59uh uh uh uh a sub state
0:09:02a a a uh densities
0:09:04per state
0:09:05and now the means and these weights and a in backs post by sub state and and stage J
0:09:10that's so they uh somebody has something you about the um
0:09:14um and the number of parameters and the model the interesting thing
0:09:18is that is dominated generally uh depending a parameterisation course
0:09:23but
0:09:23you generally have five or more
0:09:25uh uh uh uh a shared parameters then you had state dependent prior
0:09:30and
0:09:31uh
0:09:31the example system that the examples set since my hat that might be as much as ten to one
0:09:37oh
0:09:38that's a little bit extreme but that that uh
0:09:40that's not unusual
0:09:41okay
0:09:42so issues of training uh we
0:09:45we doing maximum likelihood training with the em algorithm
0:09:49i i if we have
0:09:52uh basically the maximum likelihood training of these subspace parameters and the
0:09:57and the uh these state vectors
0:09:59is a really simple straightforward extension of the case for the this global
0:10:03a a subspace model what we trained these sub-spaces over super vectors
0:10:08and a stand but
0:10:09what happens is that at and work very well when you don't have the weight vectors those
0:10:13oh those additional degrees of freedom are important
0:10:16and so we knew we had that was a in you you there is basically an additional component to the
0:10:21ml auxiliary function
0:10:23uh of the
0:10:24the uh uh a the these solution the out to you don't you don't have unique optimum and so you
0:10:30have to be very careful
0:10:31about how you
0:10:32optimize
0:10:33that auxiliary function
0:10:35as far as initialising things you start a
0:10:38uh initialising the the context
0:10:41um the state context
0:10:43for the sgmm
0:10:45if the phonetic context clustered
0:10:47uh a continuous density hmm state
0:10:50uh we initialize the the means and uh for covariance
0:10:55a matrices
0:10:57for
0:10:58for
0:10:59go
0:11:00a as a as a um
0:11:03i just using unsupervised gmm training
0:11:06and uh a rather than initialising of the the other parameters of the system we basically and initialize the uh
0:11:15state
0:11:16should the joint state um
0:11:18uh a uh uh a mixture component of posteriors
0:11:23where
0:11:23uh a as a product of the state posteriors from a initial cd hmm
0:11:30and the mixture component part last from are from are you M
0:11:34so
0:11:36yeah that
0:11:37a uh basically we're doing a experimental study to compare
0:11:40the performance of the
0:11:43uh a subspace gmm
0:11:45with the unsupervised uh subspace adaptation
0:11:48when yeah okay in that and we're doing that using the resource management database and will acknowledge that that's a
0:11:54a a fairly small corpus collected under constrained conditions we have about four hours of training
0:11:59and about a hundred speakers
0:12:01um
0:12:02the the advantage for us though it is
0:12:05as it's not very amenable uh to various other adaptation approach the regression based adaptation and uh
0:12:12vtln and so on
0:12:13don't do much with it so we can see where there is there isn't the issue of the overlap of
0:12:19the effects we get from the S from from the adaptation scheme we're doing here
0:12:23with other possible um
0:12:25you normalisation strategies
0:12:27so
0:12:28yeah baseline system with seventeen hundred uh context clustered
0:12:33uh states and about six gaussians per state which is pretty typical
0:12:37and the
0:12:38a a speaker dependent evaluation task
0:12:40is for quite nine percent and that's pretty much uh in in the alignment with the state the are
0:12:45um
0:12:46and the another point here about the parameter the allocation of the sgmm parameters so for this particular system
0:12:54we're starting at
0:12:56dependent a a a are the continuous density hmm
0:12:59uh system has about eight hundred thousand parameters
0:13:03um
0:13:04and all of those are basically a shared parameter
0:13:07for the
0:13:09for the the right this in the first row or this table for the single sub state
0:13:13per state
0:13:14a subspace gmm
0:13:16a roughly ninety percent of those parameters are shared across all states we have
0:13:21you know a
0:13:23the shape parameters correspond to these sub state projection matrices
0:13:27uh and the full covariance gaussians
0:13:30and was about six hundred thirty K of those
0:13:32but only about sixty K A of these state depend of parameters which correspond to these V that
0:13:38so i for that particular system we have a as you know small number of parameters as we do
0:13:44uh for the can use continuous density hmm
0:13:47and but are a of the uh parameterisation
0:13:50uh uh that that that the not that much different
0:13:54then the continuous density hmm
0:13:56and it no matter about the parameterisation is an our case
0:13:59we tends to be have a rate lee you know we were were this heavily biased
0:14:03to as a number of parameters
0:14:05you know uh it's
0:14:07them
0:14:08so the the E one or eight we've got here is we can
0:14:13by basically repeated this for point nine percent word accurate word error rate we got from the baseline continuous density
0:14:20hmm
0:14:21for
0:14:22subspace face G and and the best performance that we got was was about uh what what was three point
0:14:28nine nine percent so that's about a twenty percent improvement
0:14:31which is it's pretty substantial on the is i guess with
0:14:33dance comment meant that that when you have a small amount of data you your real level of the it's
0:14:37about twenty percent
0:14:39T V second and third row
0:14:42a described by uh i can basically compare a different a means of the initialisation than this uh this scheme
0:14:49for initialising the posterior probabilities of the sgmm
0:14:53a give us some small but statistically significant improvement performance over a flat start
0:14:58um
0:14:59for the data
0:15:00and a comparison of the sgmm with these supervector adaptation
0:15:05uh a approaches which of because uh well which are fairly well known
0:15:09uh these days
0:15:10i i basically what we did
0:15:12yeah
0:15:14uh estimated uh a a a a a a a sub state
0:15:19i'm sorry a subspace
0:15:20projection matrix E here which is
0:15:23a defined of over these supervectors of R
0:15:26a a a a a a a of a a a um
0:15:29uh a yeah of our scheme of continuous density hmm
0:15:33that's the uh the um in that first equation
0:15:36and then
0:15:37uh doing adaptation
0:15:39uh uh estimate this uh this use actor speaker dependent you vector from a single unlabeled labeled or i'm transcribe
0:15:47test utterance
0:15:48and so we have a basically a subspace dimension there is twenty
0:15:52um
0:15:52and uh
0:15:54a what we thought as we got about eight ten percent the
0:15:58but eight or not i guess that's nine or ten percent
0:16:00uh improve performance
0:16:04that's a for from this uh a super vector based at uh adaptation
0:16:09which is not as big as we got from the sgmm
0:16:12we could we were also a a a a i and this corpus tried the uh and
0:16:17uh these speaker at a you go
0:16:19throwing and
0:16:20the speaker subspace
0:16:22uh with the with the uh S gmm model we didn't get a a really statistically significant
0:16:28uh improvement from that
0:16:30uh i would suspect that the additional
0:16:32degrees of freedom in these speaker dependent weights in uh
0:16:36uh describing N's uh to earlier talk might have
0:16:40uh might have an impact on the
0:16:42uh
0:16:44i do in my done
0:16:45no i so that to me
0:16:48as a
0:16:48okay
0:16:50you could me a look
0:16:51so
0:16:51this this last slide is an anecdotal example
0:16:54oh the distribution of these sub state projection vectors
0:16:58um it or of the first two dimensions
0:17:01of those sub state projection vectors
0:17:03in this case for
0:17:05spanish language call home corpus or from an sgmm
0:17:08trained from the
0:17:10spanish language callhome home corpus
0:17:12and um
0:17:13we so we have these
0:17:15that's that's we were restricted uh this plot to uh a sort of a scatter diagram here of the center
0:17:20states
0:17:21of the five spanish vowels
0:17:24with the easily in the set indicating so the location of the cluster centroids
0:17:29this is very similar
0:17:30to a plot that look but good and better words did
0:17:34uh i i'm uh in this language corpus
0:17:36and and um basically the thing i found that was really interesting about this it is
0:17:41a a a a a you you see is very nice clustering
0:17:44oh a three is uh uh of these state dependent vectors of uh for the for the different files
0:17:49that's something you just don't see in a continuous density hmm right you get
0:17:53uh
0:17:54a you you certainly can't look at the means
0:17:58oh of the densities in a continues sit density hmm and see any kind of
0:18:02visible is structure there so this is
0:18:04a very interesting thing how the structure sort of discovered automatically from the sgmm and
0:18:09and the C
0:18:12the that uh the really are some other interesting uses of the thing
0:18:15um so to summarise it here i i guess some out of time
0:18:19we got this a rather substantial
0:18:23i say eighty percent before
0:18:25but it's getting pour so before for get my team percent by the end of the talk i i than
0:18:29eighteen percent reduction and word error rate
0:18:32uh compared to the cd hmm
0:18:34and uh basically a did better then be and supervise a subspace based speaker adaptation
0:18:40um
0:18:42oh we yes and this sort of general comment that uh the is very interesting and it's more anecdotal comment
0:18:47that these state level parameters
0:18:49seem to
0:18:50uncover cover
0:18:51uh i i so sadie picked
0:18:53a a a a a a a a underlying structure in the data
0:18:57um so
0:18:59this is a sum to take advantage of this interesting structure we started looking at
0:19:03hi
0:19:04and C
0:19:05a speech therapy application
0:19:07can be done by exploiting sort of this can structure we see
0:19:11uh in terms of the describing the phonetic variability
0:19:15and also a like a number of other people are talk to go the conference looking at multilingual a acoustic
0:19:20modeling application
0:19:23a much
0:19:28questions
0:19:37uh
0:19:38a great
0:19:38i i think we
0:19:39is probably similar to uh
0:19:43right thirty
0:19:44that's tired i believe we uh initialize the and vectors
0:19:49to the in the first
0:19:51a column of the and actors to the means
0:19:54of the ubm and the
0:19:56uh and the V that actors were initialised to unity
0:20:04two
0:20:04no questions
0:20:07no in that case of these computer station thank you very much for this truck
0:20:12i