Speech Transcript - FRONT-END FEATURE TRANSFORMS WITH CONTEXT FILTERING FOR SPEAKER ADAPTATION

0:00:13	are
0:00:14	um
0:00:15	i i america of the list term the uh session chair and um a one second an advertisement if you
0:00:21	see people wearing a wine may ask them about a asr you
0:00:25	okay
0:00:26	um we're gonna start off
0:00:28	a a first uh paper is uh
0:00:31	front-end feature transforms with context filtering for speaker adaptation
0:00:35	um papers by ageing one
0:00:37	a
0:00:38	i i take a
0:00:40	as well uh
0:00:41	as as well as raw were yeah
0:00:43	a a all and and
0:00:45	by how a go go all
0:00:47	um and it will be presented by you any
0:00:56	okay so uh the top is front in feature transforms with context filtering for speaker adaptation
0:01:02	um
0:01:03	so uh this an line of the talk
0:01:06	first stop briefly motivate
0:01:08	a other work can explain a
0:01:10	basically what's
0:01:12	uh what we're trying to accomplish
0:01:14	uh then next all give an overview of
0:01:16	uh and the new technique called maximum likelihood context filtering
0:01:21	and then we'll move uh strain in some experiments and results to see how works
0:01:26	okay so uh the top is front
0:01:29	speaker adaptation
0:01:30	uh
0:01:31	in terms of front-end transforms we usually uh do in your transforms
0:01:35	or conditionally linear transform
0:01:38	a perhaps the most popular technique is feature space mllr or maybe more popularly named constraint
0:01:44	yeah mllr
0:01:46	and the course there's just discriminative uh techniques that of developed and nonlinear transformations
0:01:53	a some variance of F from a additive been
0:01:55	a work on and in years are quick of more are
0:01:59	and uh full covariance
0:02:01	a formal are
0:02:02	um
0:02:03	so to they'll tell you about another
0:02:06	aryan of the full are
0:02:07	and but first that let's review of more so the ideas you're given
0:02:12	set of adaptation data
0:02:14	and you want to estimate uh when you're transformation a
0:02:17	and a bias
0:02:18	be uh which can be
0:02:20	concatenated
0:02:21	in into the linear
0:02:23	a a matrix W
0:02:25	so uh the key point about
0:02:28	a a a from are for the purposes of this talk is that the a matrix is square
0:02:32	uh T by D and the notation used here
0:02:35	and uh
0:02:36	that makes it particularly easy to learn
0:02:41	um
0:02:41	so
0:02:42	of course
0:02:43	the main thing you need to deal with when you up these transforms is
0:02:47	uh the volume
0:02:48	uh change compensation
0:02:50	in the case of a when you're transformation it's just the log determinant a and red you C
0:02:55	in our objective function Q
0:02:57	uh the second term there is just your
0:03:00	a typical term you see it has the posterior
0:03:03	probability of all the components of the acoustic model your evaluating that's
0:03:08	gamma
0:03:09	uh subscript scripted by J for each gaussian
0:03:13	okay so when we
0:03:16	start to think about the non square case
0:03:18	uh
0:03:20	what what do we need to do so first that's set up the notation
0:03:23	so we use the notation X had of T
0:03:25	to do know of
0:03:27	a vector with context in this case there's context size one
0:03:30	so there's X T might as one T in T plus one of being concatenated to make X have the
0:03:36	T
0:03:36	so the model is
0:03:38	why i
0:03:39	because a X have
0:03:41	um
0:03:42	we can uh condense this notation to
0:03:45	um
0:03:46	form W
0:03:47	and hear them the main difference here is the a is not square in this case it is D by
0:03:52	three D
0:03:53	because the output Y
0:03:54	is the original dimension of of the input X
0:03:58	but X that is three D
0:04:01	so how do we estimate
0:04:03	a a a a a non square matrix the max my like
0:04:07	uh
0:04:08	so
0:04:08	a an important point is there's no direct obvious way to do this and that's because
0:04:13	the
0:04:13	uh
0:04:15	if your you change you changing the dimension of the space so that is no determine volume
0:04:21	the you can use
0:04:22	a a in a straightforward manner to
0:04:24	accomplish this
0:04:25	so let's go back and look at how we get that term
0:04:28	so basically what you say is that the
0:04:30	log likelihood under the transformation
0:04:33	of Y
0:04:35	uh
0:04:36	is of equal to the log likelihood of the input variable up to a constant
0:04:40	that is your
0:04:41	to cope in term
0:04:43	so in the case that uh you soon that a square you can readily
0:04:48	confirmed that the term is
0:04:51	have log
0:04:52	ratio of the determinant
0:04:54	of the input and output mall assuming the gaussian
0:04:57	so this slide is just showing how you would ride that
0:05:00	there's L X
0:05:02	a gaussian
0:05:03	L Y
0:05:04	i when you to
0:05:05	are are get as you know the when your transform data
0:05:08	and essentially you quite them the fine what C is in you find that
0:05:11	it is the log ratio
0:05:13	a a as we started before
0:05:15	so on the bottom line and read
0:05:18	if you break down that not show you see that uh the covariance of
0:05:22	D variable Y
0:05:24	uh
0:05:25	this is just a known uh identity it's a a
0:05:29	a a signal X transport
0:05:31	transpose
0:05:32	a transpose
0:05:34	E
0:05:34	a a signal X
0:05:35	at
0:05:36	transpose
0:05:37	um so the compensation term ends up being log determine a
0:05:42	um so in our case we're gonna assume that the compensation term
0:05:45	remains the same
0:05:47	uh will drop the
0:05:49	log determinant of sigma X had term
0:05:52	because it does not depend on a
0:05:54	number left
0:05:55	with the
0:05:56	a segment X a transport
0:05:58	pos turned uh that we had in this case that
0:06:01	they was square
0:06:03	um
0:06:04	so the modified objective becomes
0:06:07	uh are the following
0:06:08	and the one one point is that well what is this
0:06:11	signal X had that was used well what they did is they use
0:06:15	uh
0:06:15	a full covariance approximations all the speech features
0:06:19	to come up with a that
0:06:21	a a full covariance sigma X hat
0:06:23	a used it in this subjective to
0:06:25	learn K
0:06:30	okay so in terms of optimising
0:06:32	a uh this modified objective this the statistics that you need of
0:06:36	the same
0:06:37	form is in the square case scores the sizes are different
0:06:41	uh
0:06:41	but to uh main quantities that you need to optimize
0:06:45	the objective are
0:06:47	to be able to evaluate the objective Q and the derivative of the objective
0:06:52	um and so the row by row it or iterative up
0:06:55	data are real that people normally use
0:06:58	not be applied here at least it's not obvious how to do it
0:07:01	there
0:07:01	uh
0:07:03	a a a a we're looking at it now but there are some ways to do something
0:07:06	very similar
0:07:07	uh but it uh for the purposes of this paper just a a a gas optimization a uh
0:07:14	package was used the H C L package
0:07:17	and uh as i mentioned before you just has in the function in in a function and its gradient available
0:07:22	at any point that the optimiser wants to evaluate
0:07:27	okay okay so uh
0:07:29	that's
0:07:30	that's essentially the map it'll trying leave some time for questions at the end
0:07:34	this any more uh
0:07:35	details that are
0:07:37	request
0:07:38	um so moving right on to training data and models so
0:07:41	uh the training data
0:07:43	for the task we evaluated this
0:07:44	technique on was collected in stationary noise
0:07:47	there's about eight hundred hours of it
0:07:49	a a word internal uh
0:07:52	a weird internal model with kind of phone contact
0:07:55	eight hundred thirty
0:07:57	context dependent states in a
0:07:58	ten K gaussians was train
0:08:01	and uh the technique was tested on
0:08:04	a L D A forty dimensional features bill and on models
0:08:08	built using maximum likelihood
0:08:10	the M i
0:08:12	and
0:08:12	on a model uh with an F a i transformation applied before
0:08:17	uh we apply context filtering
0:08:21	uh in terms of test data uh was recorded in car at three different speeds zeros thirty at sixty miles
0:08:27	per hour
0:08:28	uh there were four tasks dressed
0:08:30	digits commands and radio control
0:08:33	and that's about uh
0:08:34	twenty six K utterances and uh
0:08:37	a total of a hundred and thirty thousand word
0:08:40	he is that the distribution of the
0:08:42	the snr distribution of this data
0:08:45	in terms of
0:08:46	speed
0:08:47	how you can see will see that most in is obtained that for the sixty
0:08:52	for our data and for that data about
0:08:54	basically half of the data is below
0:08:57	say twelve and a half T V
0:08:59	uh a we estimate a using a forced alignment
0:09:05	okay so for experiments
0:09:07	uh a context filtering was tried for speaker adaptation
0:09:11	training speaker dependent uh
0:09:13	a a that being uh
0:09:15	the canonical model
0:09:16	so uh a and all C a and just a little uh nomenclature here
0:09:21	it is uh
0:09:22	maximum likelihood context filtering with context size and
0:09:26	so one would be plus or minus one
0:09:29	aim is included in the context
0:09:31	when computing the transform
0:09:35	uh
0:09:36	so for all the experiments
0:09:38	uh we
0:09:39	the transform was in lies with identity uh with respect to the current
0:09:43	frames parameters
0:09:44	and the side frames where
0:09:46	uh than a lies to
0:09:48	have zero
0:09:49	to zeros
0:09:50	so
0:09:51	just for reference they also tried using
0:09:53	for the centre
0:09:54	a a part of the matrix the F from or that was estimated
0:09:58	uh
0:09:59	you uh
0:10:00	using the usual technique
0:10:03	okay so in terms a result
0:10:05	give skip it had to that so
0:10:08	uh
0:10:08	clearly a from R
0:10:10	uh brings a lot over the baseline on this data
0:10:13	and when a you turn on context filtering
0:10:16	yeah actually get some significant gains
0:10:18	in the sixty per hour call call and you can see that there are late and red
0:10:22	so this is actually twenty three percent
0:10:25	uh relative gain in word error rate thirty percent and sensor rate over a more
0:10:30	um um
0:10:31	the other point here is it's starting with a more are
0:10:33	and then adapting actually doesn't give you any advantage over
0:10:37	uh starting with an and it then you made
0:10:41	a this point is just showing how
0:10:44	uh performance varies with than um with the amount of data you provide
0:10:48	uh to the transform estimation
0:10:50	so uh
0:10:52	where we can see that actually
0:10:54	the relative a degradation in performance when you have less data as in this case ten utterance
0:10:59	ten utterances as
0:11:01	all utterances i believe all is a hundred in this case
0:11:05	um
0:11:06	is less and i i think the argument here is that
0:11:10	uh uh you're using context so you can do some averaging
0:11:14	of the data you see and that's
0:11:16	that effectively regular thing yes
0:11:18	estimation the sum
0:11:19	extent although there's more parameters to estimate so
0:11:22	uh
0:11:23	kind of counter intuitive i think
0:11:28	okay a this just a picture of but typical F mark transform estimated uh using our
0:11:33	system it's
0:11:34	it's uh
0:11:35	for the most part no
0:11:38	and uh this is the corresponding
0:11:40	a one frame of context
0:11:42	a context filtering transforms so you can see
0:11:44	interestingly it's not symmetric the
0:11:47	the uh
0:11:48	previous
0:11:49	the mapping from previous the current frame is almost i no so is the current the current frame
0:11:54	mapping
0:11:55	scene
0:11:58	a but the
0:12:00	a count of the future looks
0:12:01	kind of random
0:12:05	and thing to keep in mind is that this is
0:12:07	uh
0:12:08	no uh it's
0:12:09	is actually
0:12:11	that's whole subspace a lot most solutions to this problem
0:12:15	so it's not clear if this is an artifact of the optimization package
0:12:18	perhaps the order in it that that uh that it optimize the subspace and whatnot not
0:12:25	hmmm
0:12:26	using
0:12:28	okay so here's more results
0:12:30	uh
0:12:31	a collective a uh using a you my model
0:12:34	and again uh we're seeing seven significant gains
0:12:38	uh
0:12:39	oh over F a are about ten percent relative improvement
0:12:43	on the six team up our data
0:12:47	a once again when we when we uh train have a my transform and then apply context filtering
0:12:53	we're still actually getting some gains
0:12:56	it's about
0:12:58	nine percent
0:12:59	relative sent error rate reduction over a more
0:13:06	okay so one summer a uh and i'll see a fixed ends well the full rank square matrix
0:13:11	technique
0:13:12	colour from R to not square meter
0:13:15	and now there's some very nice gains
0:13:18	on a
0:13:19	some pretty good systems
0:13:20	uh
0:13:21	the use to be am am i have from my
0:13:24	uh when we apply this technique
0:13:26	so terms the future work
0:13:28	course is the use a
0:13:29	uh
0:13:30	we should uh
0:13:32	trying a discriminative objective function is something that i think they're looking at in the course
0:13:37	uh
0:13:38	the another question is how this technique interacts with traditional noise robust as methods like spectral subtraction
0:13:45	dynamic noise adaptation et cetera
0:13:48	okay
0:13:48	so
0:13:49	that's all i have hopefully this sometime time of for questions
0:13:55	i is use my
0:14:00	so the plot you have for improvement so that you had to have ten utterance
0:14:04	for each speed how do you do that
0:14:07	a practical sense are you going to keep track
0:14:10	yeah
0:14:10	hi
0:14:12	uh let me just go to the us you
0:14:15	i mean this is just a investigating the amount of data is needed for the transform to be effective
0:14:21	so uh the this is useful
0:14:24	in sense that if you need to roll a speaker for example on a cell phone
0:14:28	he only needs to talk ten utterances by the way that's a good point uh each utterance is only about
0:14:33	three seconds
0:14:34	so we're talking about
0:14:36	you know uh at thirty seconds of a of data uh were already
0:14:40	almost that
0:14:42	a completely adapted to do the speaker as opposed to
0:14:45	that from more are that's
0:14:46	actually seems the need about thirty
0:14:48	to be at that stage
0:14:56	a
0:14:58	oh microphone
0:14:59	for the third one right there
0:15:03	so from this chart you're working all
0:15:05	screws me
0:15:06	uh you know
0:15:08	utterances collected at sixty
0:15:10	models mouse or speech
0:15:12	in a real scenario in people drive store high we
0:15:16	so the you know sequence
0:15:18	uh uh you know uh as made you that with this same are is not this scenery
0:15:22	oh have you test that's scenario
0:15:25	yeah i they in consider that in this work but it so this is a block
0:15:30	this block optimization of the matrix actually
0:15:32	so
0:15:33	just take a section of
0:15:34	speaker data and
0:15:35	see how many utterances are required
0:15:38	a
0:15:38	to can to get uh decent gains
0:15:42	but that's certainly an important problem
0:15:49	to more questions
0:15:52	have a quick one
0:15:53	um i was actually kind of interested when you were looking at
0:15:57	the results for one and two
0:15:59	like to be two different um when you know
0:16:02	the they actually do visualisation of the two
0:16:05	um context
0:16:07	was it's great and i found that the visual interesting and i was wondering at that
0:16:12	i think but different
0:16:15	uh
0:16:16	oh i see
0:16:18	right i i think that uh
0:16:21	think this is one of the only ones they actually look that okay
0:16:24	uh
0:16:25	i was very curious about that myself
0:16:27	so
0:16:28	ready put them in at the last moment
0:16:29	actually
0:16:31	um
0:16:32	yeah that's very true choosing certainly uh
0:16:35	uh
0:16:37	but it for the experiments they did they found that performance was that eating at about
0:16:42	uh
0:16:43	but and right context
0:16:44	two
0:16:45	so uh uh i i think that uh is symmetry and that
0:16:49	X
0:16:50	for future uh
0:16:52	investigation and understanding
0:16:55	i think the speaker
0:17:00	maybe we we're gonna need a minute to set up the next peak there so
0:17:03	sir
0:17:03	but

FRONT-END FEATURE TRANSFORMS WITH CONTEXT FILTERING FOR SPEAKER ADAPTATION

Adaptation for ASR

Presented by: Steven Rennie, Author(s): Jing Huang, IBM T.J. Watson Research Center, United States; Karthik Visweswariah, IBM India Research, India; Peder Olsen, Vaibhava Goel, IBM T.J. Watson Research Center, United States