Speech Transcript - Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE15

0:00:15	kind of the transition from the systems and in the previous salary into the n
0:00:21	n's in what
0:00:22	people do deeply we all could have
0:00:25	no i don't think so
0:00:29	we all clodo presented in both but i think this is a good transition "'cause"
0:00:32	we did have some kind of new things that we did that i wanna talk
0:00:36	about
0:00:37	this is work with my colleagues correct cell and daniel from johns hopkins of both
0:00:41	of whom were unable unfortunately to get spousal permission to attend this work so but
0:00:47	they have that excuses rags wife had their second child two weeks ago and daniels
0:00:51	is due in about two so
0:00:54	they have a reason
0:00:57	so i'm gonna present an overview of the d n an i-vector system that we
0:01:02	submitted to leo every fifteen
0:01:04	i wanna hear give a shout out nist for introducing his fixed training data condition
0:01:10	which actually allowed us to make a very competitive system with only three people which
0:01:15	is a not very common in our is historically
0:01:20	the approach that we used algorithmically i'll go in the more detail but we use
0:01:25	the n n's unlike some of the previous presentations you've seen we were able to
0:01:30	get good performance not just with the bottleneck features but also with the nn state
0:01:35	labels i'll talk about that
0:01:38	we used a three different kind of i-vectors i'll explain that more but
0:01:43	everyone had acoustic systems and those are very good we're able to do quite well
0:01:47	with the phonotactic i-vector system as well and here we're trying for the first time
0:01:52	a joint i-vector which does both things at once
0:01:56	because we had a fairly powerful system that we were comfortable with and we didn't
0:02:02	trust that we had enough development data we used i think the you simplest and
0:02:07	most naive fusion of anybody a net seem to work for us because we actually
0:02:10	got of using game which i think also we were one of the few
0:02:14	and that was just to some the scores together and then scale "'em" with the
0:02:18	duration model that all talk about
0:02:21	and lastly as i think it's been mentioned but i wanna go with a little
0:02:24	bit more because this was a limited data task data augmentation turned out to be
0:02:29	very helpful for us
0:02:33	so in the top i'll go through our bayes the i-vectors system design a talk
0:02:38	about the two ways that we use the d n n's that have both been
0:02:41	touched on previously today
0:02:43	and i'll talk about the use alternate i-vectors to we experimented with
0:02:50	talks a more specifically about the lre fifteen task and how we use the data
0:02:54	and what we learn later about how we could have used the data
0:02:59	and trying to that will talk about the results that we had in the summation
0:03:02	in some interesting things that we've learned since both about whatever other systems could have
0:03:07	done and also how we could've done better with the systems that we that we
0:03:10	use
0:03:13	so here's a but block diagram of
0:03:16	our lid system
0:03:21	it's a little i-vector system so we can be split into two parts the first
0:03:24	uses the unlabeled data to the to do the ubm and the t matrix learning
0:03:29	and then the supervised system is basically the two covariance model
0:03:34	within class across class covariance that's first used in lda to reduce the dimension and
0:03:39	in the same matrices are used for the gaussian scoring following on after that
0:03:45	we've done for awhile rather than having a separate back end to do the work
0:03:48	we do a discriminative refinement of these gaussian parameters
0:03:53	to produce a system that not only performs a little bit better but also produces
0:03:58	naturally calibrated scores
0:04:00	and we do that in a two-step process first we learn a scale factor of
0:04:05	this within class covariance
0:04:07	and then we go into all the class means and adjust them to better or
0:04:10	provide the discriminative power and that's we we're using the mmi algorithm from gmm training
0:04:17	in a really simplified mode
0:04:19	and of course that's the same criterion is the multiclass cross entropy but all everybody
0:04:23	uses for every day
0:04:28	so just layout data
0:04:29	talk more about how we use the d n and together people dimension it but
0:04:33	let me have some pictures of so you can see a better of what we're
0:04:36	doing splitting up the normal use the gmm to do the alignment and then compute
0:04:40	the stats after the fire
0:04:42	from that
0:04:43	where splitting it out in two ways and using the announced the first is simply
0:04:47	to replace the mfccs with bottleneck features
0:04:51	from indiana and we are just using a straightforward bottleneck note that kind of anything
0:04:55	else
0:04:56	and then the
0:04:58	second system
0:04:59	is a little bit more complicated were used to the nn to generate the frame
0:05:03	posteriors for the signals are for the cluster all state
0:05:06	that used to label the data and you the alignment and then you use the
0:05:10	ubm after that
0:05:15	the unit time are to draw indian and but this is daniel's best rendition of
0:05:19	a
0:05:20	of a probably d n and a couple of things that the power particular perhaps
0:05:23	about our system or about the cali way of doing things
0:05:27	which by the way we do highly recommend
0:05:30	is it uses this t-norm would just kind of like to max pooling so there
0:05:33	is a there's an expansion in a contractual made at each layer that's how the
0:05:37	nonlinearity come there
0:05:40	what else we i think probably nobody says these days but we're not using fmllr
0:05:43	which i think it is common
0:05:45	for our purposes
0:05:48	you can see we basically use the same architecture either for this you know posteriors
0:05:52	are or we introduce the bottleneck to the one that's just gonna be the bottleneck
0:05:56	that goes
0:05:57	the that's a little in your layer before the
0:06:01	in the middle that one there
0:06:06	we have
0:06:07	about nine thousand output state so it is it is a pretty big ubm that
0:06:13	we get out of this
0:06:14	and of course it's trained using switchboard one "'cause" that's what we were given for
0:06:18	the a fixed data condition
0:06:20	in you know
0:06:24	so let me talk about desire is a little bit the one that
0:06:29	we're all familiar what we're gonna fall acoustic i-vector this is based on a gaussian
0:06:33	probability model and german output in a little parentheses for use a given that the
0:06:39	alignments already know otherwise it would be much more complicated
0:06:44	and but because of that it's a big gaussian supervector problem there's closed form solution
0:06:48	for the map estimate that the i-vector
0:06:51	there's an em algorithm for the that the estimation
0:06:55	the second approach is phonotactic thing now i think the you guys mentioned that used
0:07:00	it for a number of areas before
0:07:03	the this is well i'll talk about the details of the or lighter but that
0:07:07	the king is we can still have sort of a gaussian model for an i-vector
0:07:12	but the output now is the latent model we're talking about the weights of gmm
0:07:17	instead of the means
0:07:19	and those things are naturally gonna be count based so we need a multinomial probability
0:07:24	model out not a gaussian probability model
0:07:27	and the way we do that with is to go from log space with the
0:07:30	softmax singular probability part
0:07:33	even when they're fairly simple formula unfortunately there's not a closed form solution for what
0:07:38	is the optimal i-vectors of these additions method iteration
0:07:42	and similarly there's not a two year for the t matrix that we know what
0:07:46	yet so there is a alternating maximization algorithm
0:07:53	so we presented this phonotactics a thing for lid the four
0:07:59	and in the meantime we don't think it okay we have two systems we have
0:08:02	an acoustic in a phonotactic are we gonna combine
0:08:05	actually the first thing we knew score fusion and yes we did that and use
0:08:08	that works
0:08:09	and then we are a little more except well
0:08:11	about two i-vector systems there are doing the same thing why don't i stack the
0:08:15	i-vectors together and get one big i-vector and then run one i-vector system and does
0:08:19	that work
0:08:20	and yes that works two
0:08:22	and we thought of as more and said well
0:08:24	why the widely twos independent i-vector extractors
0:08:28	what can i make one latent variable the both models
0:08:31	the means of the gmm the latent gmm the generated code and the weights of
0:08:35	the gmm generated to cut
0:08:38	the fact is the math says that you can i'll go into a little more
0:08:42	detail but basically this is
0:08:44	a permutation of the subspace gmm that the input we was talking about in two
0:08:49	thousand eight thousand nine
0:08:52	to see leslie workshop and sense
0:08:54	so there are algorithms for doing this we had to manipulate them a little bit
0:08:58	for our purpose
0:09:02	so a couple of the tails how to do this we have some references in
0:09:07	the paper
0:09:08	so complex in particular that we're doing differently than then if you just to get
0:09:12	out of what bandwidth
0:09:14	the first is he did everything but sort of ml estimates so we didn't have
0:09:17	any prior didn't how many backup
0:09:19	obviously for acoustic we don't wanna use ml i-vectors we wanna use map i-vectors
0:09:24	we've actually shown previously that for a phonotactic system map is also beneficial and if
0:09:29	we're gonna do a jointly it's
0:09:31	critical the to be the same criterion for both things because it back
0:09:35	it is a joint optimization of
0:09:38	map of the overall likelihood plus that the prior
0:09:44	a nice trick we can do with this joint i-vector is since this closed form
0:09:47	solution for the acoustic we can
0:09:49	initialize the newton's method with the acoustic and then just refine it using the phonotactic
0:09:54	as well
0:09:55	and that gets us to a starting point pretty easily where we can then do
0:09:59	winning greatly simplify the newton's descent
0:10:03	in particular by pretending everything is independent of each other which is a huge spud
0:10:08	improvement because the doing full haskins in this update
0:10:11	is anybody who's ever looked at it is a pretty tedious
0:10:15	so once we do that
0:10:16	it's essentially rather than being much slower than acoustic i-vector system it's essentially the same
0:10:22	order it's very simple
0:10:33	so that no okay
0:10:36	the lre fifteen task which has been discussed
0:10:39	this i guess isn't happening here there is telephone and broadcast narrowband speech with it
0:10:44	twenty language six confusable clusters
0:10:48	but the limited training condition is very important element from what we were able to
0:10:52	get away with
0:10:53	and of course that means
0:10:54	both that you have limited a little data to more only twenty languages but also
0:10:58	means that you can only train your supervised the nn
0:11:01	on the switchboard english because that's the only thing that had transcripts
0:11:06	which is not our favourite thing to do it was kind of limiting but it
0:11:09	allows nist exercise the technology
0:11:12	and because of the languages didn't have much data that was also would keep
0:11:20	so all of our systems
0:11:21	basically because we had a small team we didn't built too much complicated stuff
0:11:26	i described really everything that we did
0:11:28	so we had two different ways of the using the d n and we had
0:11:31	three different kinds of i-vectors that we could have built out of each of the
0:11:34	to the in an i-vector the unit systems
0:11:37	out of that we could've done six things i'll talk about a few that were
0:11:40	interesting and the ones that we actually
0:11:43	but everything was the same classifier
0:11:48	as i missed because the systems are already calibrated a by this mmi process
0:11:54	we didn't have to use a complicated back end
0:11:57	the thing we get introduced because there is we knew there was this range of
0:12:01	durations that had to be exercised
0:12:04	i think the simplest way that we could get there was to re reuse some
0:12:08	work that we had done previously on making a
0:12:11	duration dependent backend where there's a continuous function which maps
0:12:15	duration into scale factor score
0:12:19	between of the raw score and the true log likelihood estimate that you're trying to
0:12:23	make
0:12:25	and that there's a justification for that function but for our purposes the important thing
0:12:29	is that
0:12:29	it's very simply trainable because it's just got to free parameters
0:12:34	so then you can use this cross entropy criterion and figure out the best parameters
0:12:39	and then because we have is a very simple system
0:12:43	we just at all scores together assume that they were independent estimates of things and
0:12:48	then rescaled the whole thing to bring it back in
0:12:52	and we found that to be helpful for us
0:12:58	another thing about lre fifteen which was mentioned but maybe to go to be more
0:13:01	familiar with the task you it went past incorrectly is very important
0:13:05	so nist the
0:13:07	proposed these somewhat on task of close to the texan across each of the clusters
0:13:13	what we did is
0:13:15	it is generated each cluster is an id score which means that each cluster had
0:13:19	a id posteriors on the one since the ri six clusters of means we gave
0:13:23	an scores from the six which means if nist wanted to evaluate across cluster performance
0:13:29	it was meeting this
0:13:32	and we had to convert these ideas to detection log likelihood ratios which is something
0:13:36	we've all over how to do your
0:13:39	but one thing i wanna mention about our system is we didn't do anything
0:13:42	cluster specific anywhere we just change train a twenty language lid system
0:13:47	and then just the
0:13:50	spun on the scores for each of the clusters because that's what nist one
0:13:54	i think we would like in the future for a more generic lid task
0:14:01	not the key element that i mentioned is the
0:14:04	other with limited training data so
0:14:08	we had figure out what to do with that
0:14:11	as i mentioned we have the unsupervised and supervised power we took the theory which
0:14:16	was later proven not quite ready to we would use everything we could
0:14:20	for the unsupervised data which included switchboard which is english only and was not one
0:14:25	of the languages
0:14:27	for that we could've done better than that all talk about it
0:14:30	and then for the classifier design we did find it helpful
0:14:34	to do augmentation to do duration modeling a cut so we can use all sides
0:14:39	we use segments that were duration
0:14:42	appropriate for the lid task
0:14:44	and we used argumentation used augmentation to change the limited clean data
0:14:49	and try and give us more examples of things to learn what i-vectors would look
0:14:53	like
0:14:55	to go into the augmentation a little bit more
0:14:58	many of these are standard things the this big thing indian ends now is to
0:15:02	do augmentation
0:15:05	so sample rate perturbation additive noise
0:15:08	right made a kind of forty kind of an additive noise but maybe that's more
0:15:11	interesting we did throw in reverb
0:15:15	and a multi band compression is kind of a signal processing thing that you might
0:15:18	see in an audio signal
0:15:20	but the thing i wanna mention and the thing that we have actually don't have
0:15:23	been slides but if you look in the paper
0:15:26	the most effective single augmentation for us in the task was to run to use
0:15:30	"'em" so you were encoder decoder against
0:15:32	which kind of the makes sense
0:15:35	as a thing to do
0:15:36	and as former speech coding to a fairly attractive
0:15:42	so our submission performance
0:15:45	these are the for things that we submitted our primary wasn't fact one of the
0:15:49	bottom which looks like it's pretty good choice out of the were available to us
0:15:54	so we did a joint i-vector on the bottleneck features we have well i'll show
0:15:58	later of the when some more stable i guess other through that they know what
0:16:02	dimensional ways in this submission
0:16:04	our second basis them was actually slightly better than our bottleneck system and again
0:16:09	that makes that the best sort of phonotactic system i think than anybody saw because
0:16:14	everyone else from the bottlenecks will be the only really good thing to do
0:16:18	and fusion provided again partly because we have simple fusion and partly because we have
0:16:23	two systems which are pretty good
0:16:28	so we get a couple things post email with we found someone educational the first
0:16:34	one i will go in the much details in the paper but
0:16:38	within the family of gaussian scoring there's a question of whether you count trials as
0:16:42	independent are not which in speaker you typically pertain you all had one you only
0:16:46	had one trial for
0:16:47	enrolment is all one hour
0:16:50	that turned on and what we submitted we usually see it is slightly better turns
0:16:53	out for this develop a slightly worse
0:16:55	i have no idea work
0:16:57	the other thing that might be a little bit more interesting is the list usage
0:17:00	we spent quite invaded time even with their the metadata trying this
0:17:05	decide what to do with the ubm and t
0:17:08	but i think that it turned out to work best
0:17:10	we didn't try because we thought of as a dom idea which is to just
0:17:14	use
0:17:15	only the lid data
0:17:16	and only for cuts
0:17:18	which i forget exactly but i think that's only three or four thousand cuts or
0:17:21	something
0:17:22	it ought to be nowhere near enough to train a t matrix we thought
0:17:27	but without or
0:17:30	so here again there's a more numbers splitting things out the first thing which is
0:17:34	kind of interesting for us as we went and rented to this acoustic baseline so
0:17:38	what we would have done with previous technology we are definitely better with all the
0:17:42	stuff we have i dunno if we're not instantly better but we're better
0:17:48	thing
0:17:49	i'm sorry
0:17:51	the ldc is now we split out with the scene on system the three different
0:17:54	kinds of i-vectors and the first thing is the phonotactic system by itself
0:17:59	is actually better than the acoustic system which is what we have seen before and
0:18:03	i think that's
0:18:04	well linguist might about whether it's really a phonotactic system to look at the counts
0:18:08	of frame posteriors but
0:18:10	that aside it's i think the best performing phonotactic system that's out of for lid
0:18:16	right now and then you see also that the joint i-vector doesn't five given noticeable
0:18:21	gain over the acoustic
0:18:23	so that's
0:18:44	okay and the fusion still work let me just go one so then included in
0:18:52	we were able to get pretty good performance in this evaluation with a small team
0:18:55	and of relatively straightforward system
0:18:58	we think that there is still whole in the signal count system that doesn't have
0:19:03	to be just bottleneck
0:19:05	and we were able show that
0:19:07	we think that the phonotactic in the joint i-vectors the joint i-vector especially is a
0:19:12	nice simple way to capture that
0:19:14	that information is one of things that enables the signal system to be competitive
0:19:20	we think it is helpful to use a really simple fusion if you have this
0:19:23	discriminatively trained classifier to start with
0:19:27	and find the data augmentation it can be a very valuable thing for the manager
0:19:32	at
0:19:33	limited data
0:19:35	thank you
0:19:43	we have time some questions
0:19:55	thank you for told you proposed able to collect all is doing marks
0:20:02	we can focus to the lower levels double is the tools like to other tools
0:20:10	for d is a classical for more old home too
0:20:15	yes there are we always use the same mailed again gaussian classifiers
0:20:20	no matter what kind of i-vectors
0:20:22	"'cause" distribution is not
0:20:24	no the intention is the i-vector could still have been in a gaussian space that
0:20:29	that's this is why we like this kind of
0:20:33	a subspace there are other count subspace algorithms like lda a non-negative matrix factorization i
0:20:40	think that in for example is compared some of those
0:20:42	where the subspaces in the linear probability space and that
0:20:47	i don't think would be well modeled by gaussian fact i know it wouldn't be
0:20:50	well my time gaussian pretty comfortable that "'cause" it's positive
0:20:53	but by going into the log space i think you it does
0:20:57	it really is going to lda and that's right tools okay cindy
0:21:20	but i'm very much like the additional processing that you're doing to kind of or
0:21:24	to the data you had casework security of sample rate perturbation most of the speech
0:21:29	coders most versions
0:21:31	if you had to go back again we're which ones you think actually would help
0:21:35	i think you mean imagine which up there is a table in the paper
0:21:41	which many of them are helpful but the speech coder is the most helpful on
0:21:45	its own
0:21:45	so we choose the sample rate conversion was at a really big variations
0:21:51	we did things like plus or minus ten percent plus or minus five percent but
0:21:56	i think
0:21:57	i would say that big
0:22:02	so we use a big difference maybe we have other cts broadcast news progress has
0:22:08	been which would typically be guessing
0:22:12	we didn't break them apart
0:22:24	try other nations that just
0:22:27	a pattern p norm
0:22:30	we have since
0:22:31	and
0:22:34	it's
0:22:34	so little bit it seems like for this particular task it looks like the sigmoid
0:22:39	is that some other people use are a little bit better i'm not sure if
0:22:43	we think that's a universal statement
0:22:46	excuse me the sigmoid are better for training the bottlenecks
0:22:51	i think for this you know maybe not
0:22:54	so we have looked a little bit
0:22:56	there is more to explore
0:23:07	so if there are no more questions and we assume everybody here knows everything about
0:23:12	language recognition got common both systems
0:23:16	so that the same speaker again

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE15

Speaker & Language Recognition: Deep learning approaches

Alan Mccree, Greg Sell, Daniel Garcia-Romero