Speech Transcript - SIMPLIFICATION AND OPTIMIZATION OF I-VECTOR EXTRACTION

0:00:13	i and it you set
0:00:14	uh so uh i was that you told you what is the the a vector let me go quickly the
0:00:18	through
0:00:19	uh
0:00:21	do you have a two an information rich low dimensional fixed like thing to representing of voice print
0:00:25	uh i an arbitrary long uh and
0:00:28	so uh we like these little in is because they no a time me and they turn the speaker I
0:00:33	D
0:00:33	uh a task into a pattern recognition problem and and are you already told are already shown
0:00:38	uh how to do with then
0:00:40	so uh just to go quickly about uh this estimate alone again so
0:00:45	uh
0:00:46	what we wanna model is the um that the data
0:00:49	uh that that come up
0:00:50	uh so we here an example of an utterance
0:00:53	a buttons
0:00:54	so we usually model them using the the gaussian mixture model
0:00:58	we forget about the the variance and we're and remember the
0:01:01	a a the means we
0:01:03	construct a the super vector of the mean
0:01:06	no i uh do is that we look at more data
0:01:09	and uh are we extract the means
0:01:11	uh a of of that of all the utterances
0:01:14	and we're trying to see that be and we see that they have some kind of a new be of
0:01:18	and this this this is what we assume in the i-vector so
0:01:21	uh we see that of the got some some uh
0:01:23	a offset which is uh represent but the U B and mean is the end symbol
0:01:27	and this picture
0:01:28	and
0:01:30	which is represented by the by the uh a to ads and then we have
0:01:34	uh the
0:01:35	total variability space
0:01:37	i was represented by the um by the hours which which as and in which direction we can
0:01:42	she the mean to adopt the mean to the to the incoming and
0:01:46	to to you describe the the directions of the has ability of as
0:01:51	uh and uh vector W
0:01:54	as a bit of a a banana such we can impose a uh uh we can impose a a prior
0:01:58	on its so will choose the uh
0:02:00	i got some uh uh uh that's not stand alone a prior
0:02:03	and getting some uh
0:02:04	incoming data X
0:02:06	uh we compute the posterior here
0:02:08	uh i'll be very are and uh
0:02:12	she's also gaussian with uh mean W X and a precision matrix X
0:02:17	and basically uh we recall not is the is the mean of this year
0:02:22	so that that given any any any details so uh uh this is just a a like a cookbook
0:02:27	uh uh
0:02:29	codebook book um
0:02:30	talk
0:02:31	so it to compute the either we need to a the statistics uh extracted from the ubm so we have
0:02:36	this your order statistics
0:02:37	and the the first order statistics
0:02:40	uh
0:02:41	the we go any further we do a little tricks so we
0:02:43	uh a to the data around the the ubm so we find which cluster of the data comes to which
0:02:48	which class and of the of the of the you and you and then
0:02:51	and we we should that allow
0:02:54	uh uh and we also
0:02:56	uh white and the data are uh using the a ubm covariance matrix
0:03:01	but that this covariance matrix uh uh can be
0:03:04	matrix as you may have already um
0:03:06	realise is to with battles
0:03:08	i i stock
0:03:09	oh
0:03:11	a a which makes the yeah
0:03:13	virtually it makes the uh the as of the of the individual
0:03:17	a a gmm components uh equal to identity
0:03:22	so a he's um he's a a a a um
0:03:25	a a codebook book question for for computing the so that did it D W is basically a dot product
0:03:31	between
0:03:31	uh some um
0:03:33	oh aims of the post a distribution
0:03:35	uh uh the factor we matrix T which describes the subspace
0:03:40	and the first order statistics
0:03:42	be uh precision matrix a um that is a basically a sum of over
0:03:47	of all the uh a gaussian
0:03:49	associated pieces of the team manager
0:03:52	uh what it used by the by the uh but this you order
0:03:56	stats
0:03:56	oh of that incoming utterance
0:03:59	and that's to a little analysis of of of of uh what this what this function that's in a a
0:04:03	a a a computer so
0:04:05	uh
0:04:06	we have a and
0:04:07	we have C gmm component
0:04:10	yeah have F dimensional
0:04:11	uh feature space and we have and subspace
0:04:15	uh well and dimensions subspace
0:04:17	uh
0:04:18	was describes are are are are are a space
0:04:21	and uh so the um
0:04:23	do um and to the power of is the is the version there's nothing much we can do battery
0:04:28	um
0:04:29	that is um
0:04:30	the the biggest problem actually use the uh is the sum
0:04:34	and the precision computation
0:04:37	and then we have the the um
0:04:39	the dot product
0:04:40	oh of of the individual matrix is
0:04:43	and the first from you
0:04:45	the memory complexity of uh uh um
0:04:48	oh
0:04:49	just to just to say restore everything when when we computing the stuff but we can put computed be pre-computed
0:04:54	and balance
0:04:55	and with to start this product in advance because not dependent on data
0:04:58	so the the memory compress is really uh a high for this uh uh for this for this model
0:05:04	so uh uh if we mention that in in a typical model we have um um you know thousands of
0:05:08	gaussians since
0:05:09	uh this can be a but a really for
0:05:12	a now be uh were we also have to store is the
0:05:15	as the T matrix
0:05:16	so these two terms of balance the bound the can complexity of of or other
0:05:21	so
0:05:22	uh the motivation for simplification of this of this form that's actually wanted to put the application to small scale
0:05:27	devices
0:05:28	as part of might be a project
0:05:30	and uh yeah we also it to prepare a this uh i that a framework for discriminative training what we
0:05:36	thought that um such equations could be
0:05:39	oh quite difficult to to to compute gradient in four
0:05:43	but that's first take look at the first simplification simplifications that we but we uh and assume here in the
0:05:48	first M san isn't the pictures that the the proportion
0:05:51	of the data generated by each gaussian in the in and the you and am
0:05:55	as to is constant across the chris or utterances
0:05:59	a uh and this proportions a is actually a
0:06:02	uh a given by the ubm rates
0:06:05	so what happens is that the um the and
0:06:08	the the sum in the in the precision computation
0:06:11	it is uh that's
0:06:12	independent
0:06:14	of the data and we can really effectively pre-computed in rounds
0:06:17	so we don't have to um
0:06:20	each time we we we compute that's some i mean we we we compute the precision
0:06:24	we just uh
0:06:25	we look at this formula
0:06:27	oh we just instead of adding the the the sum the
0:06:29	going from the from the
0:06:30	i
0:06:31	to the button most
0:06:32	uh we see that we only have a scaled um
0:06:36	scale to uh
0:06:38	addition of two matrices
0:06:42	so a a little analysis so we totally only got rid of the of the
0:06:46	of the C square
0:06:48	um
0:06:49	to um in the computational complexity
0:06:51	and close of memory complexity signal
0:06:54	or uh
0:06:54	basically
0:06:56	for good most of the data that we were storing
0:06:58	before
0:06:59	i just a time for for the
0:07:01	before before the results section the
0:07:03	the the number of gaussians were
0:07:05	is is thousands is that said and and the typical size of of the subspace
0:07:10	four hundred
0:07:12	size and hundreds
0:07:14	uh a so this so the first simplification out
0:07:17	we also had a the thought uh or
0:07:19	we would try to sue
0:07:21	that uh we can find a
0:07:23	uh uh thing and is a orthogonalization transformation G
0:07:27	some G that would uh you know that have been rise or the T transposed times T
0:07:32	uh
0:07:33	component associated parts of the of the of the factor loading matrix T
0:07:38	which are bothering us in the in the precision computation
0:07:41	for lab
0:07:44	as a transformation
0:07:46	then uh we can uh you know multiply very the equation from both sides and uh
0:07:51	uh a something like this and then uh to get the original precision we would just uh
0:07:56	multiply from paul says by the inverse of G
0:07:58	um if our sense from was was correct
0:08:03	uh so than i thing here is that uh we would be something the diagonal matrices
0:08:08	which uh
0:08:09	can be implemented effectively an C or my occur
0:08:12	and also so the other thing was that uh the the the the the
0:08:16	the died nice
0:08:18	precision matrix is diagonal is diagonal also
0:08:21	if you remember a uh we were inviting in in the in the i-vector
0:08:25	uh extraction from a city a vector
0:08:27	so the so um
0:08:29	version of the
0:08:31	is diagonal matrix is trivial here
0:08:34	or if that the effectively written
0:08:36	uh
0:08:37	we can we can pack
0:08:39	uh we can pack the um
0:08:41	and the you T times it to and you a T transpose T terms
0:08:45	uh uh of the gonna has into a single matrix and we can simply
0:08:49	we can simply uh
0:08:50	to dot product with the with the vector of zero order statistics
0:08:54	the the close and at
0:08:57	and we can uh
0:08:58	this this lower "'cause" gonna the X symbol stands for in the diagonal of that matrix
0:09:04	to a a a a column vector and the
0:09:06	capital dag
0:09:08	and a simple
0:09:09	again maps that
0:09:11	column vector to a diagonal matrix
0:09:13	and the i-vector extraction
0:09:15	is then uh
0:09:16	and by to by D second question here and i think about is that the do transpose in the middle
0:09:21	of the and
0:09:22	can be projected directly to the to the T matrix which is a which uh would you can give the
0:09:26	some benefit
0:09:27	and the S we set the a matrix and can be inverted to effectively
0:09:31	so if we look at the analysis again
0:09:33	uh
0:09:34	and the computational complexity
0:09:37	we but rid of but that the terms that's that's were in on the only the diagonal
0:09:41	and uh uh for combat some the memory complexity
0:09:44	uh
0:09:45	we got an extra term um uh the the the the um
0:09:48	see M but we got the
0:09:50	do you we got rid of that C and square term there
0:09:56	the question is how to the how we compute the G matrix of the first uh well the first uh
0:10:02	i i was to use pca
0:10:04	uh which we will see that works
0:10:06	the second i but was to use this this had had just good st clean linear discriminant analysis
0:10:11	uh
0:10:12	he a was the simple example what
0:10:15	a basically want
0:10:16	i it to rotate that that uh those two covariance matrices
0:10:19	forty five degrees but uh
0:10:22	and the the um
0:10:24	the um
0:10:25	average within class covariance would be a identity matrix here so
0:10:29	oh
0:10:30	first that was the inspiration
0:10:32	it it's with the lvcsr tasks
0:10:36	uh i just a quick step uh
0:10:38	we say thing about of the T matrix for those wouldn't no
0:10:41	uh are that there's uh
0:10:43	a pair of a can load is that we have to accumulate can relate while training
0:10:47	uh the T matrix we got all utterances of or or or all training utterances and we use some
0:10:53	some computation that and we can relate that and we do some some up
0:10:57	at the end of this of of this procedure
0:10:59	uh but inside this uh
0:11:02	that in theoretical explanation and sat inside of these uh uh these uh this computation
0:11:08	we see that we can use the the double which is the final why vector and is the precision matrix
0:11:13	so if we know that we can simplify this precision matrix
0:11:16	we can simplify the lead actors and
0:11:18	or this this this training procedure
0:11:20	it's uh a simplified so um the memory use the the use each with this this so um
0:11:26	hmmm well
0:11:27	that would this simple trick
0:11:28	uh we get to about a half of the memory we the gen we can
0:11:32	we can maybe effectively
0:11:33	try to uh increase
0:11:35	the other parameters
0:11:37	because number about the parameters to
0:11:39	to two for comparison
0:11:42	so i for experimental setup uh we use mfcc features uh the standard thing
0:11:48	uh
0:11:49	um um
0:11:50	short-time cepstral mean and variance normalisation
0:11:53	uh we used double but that doesn't double the that thus
0:11:56	for the training set uh uh with different combinations of the switchboard two phase two and three speech for solar
0:12:01	are the nist two thousand four two thousand six
0:12:05	uh we use sure in which one and two for training the team of J
0:12:09	uh the test set we evaluated on the nist sre ten extend core condition five which is the telephone telephone
0:12:15	female and me
0:12:17	uh
0:12:18	one to mention the slides that we use exactly the same scoring the thing that's as as as i mention
0:12:23	in is previous talk
0:12:24	so is the cosine distance
0:12:26	with uh within class normalisation
0:12:28	uh the performance set uh because we a measure of the the the um the speed and the
0:12:34	a a and the memory code and the memory demands
0:12:36	so do use the matlab environment uh which what which was set to a single core
0:12:41	a a single third operation and around on some internal then
0:12:45	a process to
0:12:46	and we measuring the speed or four fifty randomly picked utterances from the mixer corpus but we had the out
0:12:52	the statistics
0:12:53	we computed from uh
0:12:55	sort the so the of statistics collection is not included in the analysis
0:12:59	and as ubm ubm M was diagonal covariance uh
0:13:02	two thousand four eight component ubm and as was trained on um
0:13:06	and do it about the fisher
0:13:09	so a the summary of numbers uh uh are we used two thousand forty eight gaussians
0:13:13	the feature dimension was sixty
0:13:15	and we use of for and it uh a dimensional subspace
0:13:19	uh uh for a and was is been chosen as a trade between performance and and technical conditions so
0:13:24	but a can i mean uh the the configuration of of of of the this of the machines that computed
0:13:29	the i-vector
0:13:31	that uh as i said we were able to know um
0:13:34	in one in one of our
0:13:36	and the for simplification we were able to
0:13:38	uh a decrease the memory demands so we to be fair which i had to uh use also and and
0:13:43	in equal to and under eight hundred
0:13:46	uh uh just to see
0:13:47	just to see what happens
0:13:50	and is a little uh a constellation plot
0:13:52	oh for the results are the uh the X
0:13:55	here is the is the baseline
0:13:57	uh
0:13:59	oh of course of the the the the little
0:14:01	the little or a block down and the the most
0:14:04	for "'cause" the eight hundred traditional
0:14:06	uh i-vector extraction
0:14:09	see that the systems
0:14:11	perform from slightly poorer than the than the baseline but uh
0:14:15	uh
0:14:15	this is just an informative picture
0:14:17	uh we see that the best
0:14:20	i'm traditional
0:14:21	or
0:14:22	yeah that's non traditional i-vector extractor
0:14:24	goes from a uh a sick the three point six to about three point eight
0:14:28	uh equal error rate
0:14:30	uh the same can the something
0:14:32	and logically with the norm D C have
0:14:35	uh
0:14:37	are so the system are slightly worse but um
0:14:39	this work was
0:14:41	and i
0:14:42	that's on the analysis of the speech so for look at that
0:14:44	and the speech of uh a of the of B computation so
0:14:48	uh
0:14:49	with the
0:14:50	with the baseline
0:14:51	to extract those fifty fifty i vectors it to class
0:14:54	uh uh thirteen seconds
0:14:56	not thirteen second
0:14:58	and um
0:15:01	so you the uh you C D the the the relative
0:15:03	the relative uh
0:15:05	well uh numbers here so
0:15:06	"'cause" they're talking the you and eight hundred baseline
0:15:09	as there is a huge a decrease in performance because the the complex the complexity there is
0:15:14	score dropped go
0:15:17	uh i that's to have a nice that was that
0:15:19	if we if we are able to train the system somehow or without without a hardware
0:15:23	we can afford to use a hundred
0:15:25	uh dimension dimension right to
0:15:27	and still get to but you know ten percent
0:15:30	uh a ten percent of the original
0:15:32	time
0:15:33	uh there was necessary to compute those fifty i vector
0:15:37	uh
0:15:38	now let let's take a look at the uh comparison of of memory usage
0:15:42	so uh for the you for the baseline
0:15:45	so the the first column i mean the second column uh what's is constant uh that's something that we can
0:15:50	change something that we have to store in memory
0:15:53	a a good in specific uh uh and numbers
0:15:55	uh show to medical decrease in in in memory needs
0:15:59	for the for the uh for the simplified
0:16:02	uh algorithms
0:16:03	so as to
0:16:04	uh uh if we if we want to use the uh uh
0:16:07	and have good
0:16:08	a dimensional uh i vectors
0:16:10	we to still that a fraction of of the memory that that
0:16:15	the traditional
0:16:16	a total of about the eight hundred baseline system which is which which again close
0:16:21	uh a practically
0:16:24	so this is this is just a prove that we can use those simplification on and the in the vector
0:16:29	training procedure
0:16:30	uh a we save space
0:16:33	uh a but also little this the the simplification make this process a lot faster
0:16:37	and uh
0:16:38	this this these numbers just show that uh
0:16:41	the difference between between uh
0:16:44	the fact that we train
0:16:45	uh
0:16:46	using the the traditional i-vector extraction and the simplified i-vector extraction
0:16:51	does of that we can
0:16:52	i can really
0:16:53	i the E D D D simplified five
0:16:57	so the conclusion is that uh we managed to simplify the state-of-the-art technique
0:17:02	in terms of speed and memory
0:17:03	with uh are sacrificing some of the
0:17:06	uh
0:17:07	you know that the performance
0:17:08	the the recognition performance
0:17:11	oh we have also simplify the form a uh so that the uh easily
0:17:15	differentiable for a future work which which uh which is going to be the discriminative training of of the i-vector
0:17:21	extractor
0:17:22	the matrix T or the
0:17:24	G V and and uh and others
0:17:27	uh and uh a finally you
0:17:29	we managed to fit the guy vector based to the system
0:17:32	to to to a cellphone application which was uh uh which was uh
0:17:36	one of the tasks
0:17:37	and we use puns are to be or project which was on
0:17:42	i model a
0:17:43	speaker recognition
0:17:46	um
0:17:47	thank you
0:17:53	some something Q with
0:17:54	time for one and two questions
0:18:02	no questions um
0:18:05	i of the questions so you you may two assumptions
0:18:08	hmmm to simplify your
0:18:10	you go to them uh did you very fine in some way that the which was emission is that were
0:18:15	to or
0:18:16	source source can we did you find in some way with the data and that the just
0:18:21	by looking at the score
0:18:23	no
0:18:23	hmmm was uh a one or the other
0:18:26	assumption was wrong or or a yeah well they were
0:18:30	i looking at the at the at the recognition performance that all
0:18:33	but
0:18:34	slightly one i mean uh
0:18:36	yeah that
0:18:37	that was a mismatch of course
0:18:39	uh if the the um the the
0:18:41	the proportion of the data generated by the by the gauss since this is different it's not always a equal
0:18:46	to the to the ubm way
0:18:48	and the um
0:18:50	uh the
0:18:51	because guess we we're using two thousand forty eight gaussians since and of finding a one single orthogonalization matrix
0:18:58	um
0:18:58	oh
0:18:59	is is also probably
0:19:01	an appropriate here so so
0:19:03	but i
0:19:04	we tried and and and um
0:19:06	and it was
0:19:07	some of the
0:19:08	yeah
0:19:09	okay
0:19:10	the questions you
0:19:20	uh no
0:19:21	no would i did not combine the techniques
0:19:23	and of a combined the techniques so
0:19:31	it it yeah i'm sorry i didn't i i'm i'm sorry
0:19:33	i it was better than pca
0:19:35	for you for gonna
0:19:40	the baseline
0:19:41	uh no
0:19:44	yeah i thank you yeah yeah that's
0:19:46	that's a good point
0:19:48	yeah
0:19:52	okay there is no of the question
0:19:55	so let's
0:19:56	thanks
0:19:57	speaker again

SIMPLIFICATION AND OPTIMIZATION OF I-VECTOR EXTRACTION

Miscellaneous Speaker Identification

Presented by: Ondřej Glembek, Author(s): Ondrej Glembek, Lukas Burget, Pavel Matejka, Martin Karafiát, Brno University of Technology, Czech Republic; Patrick Kenny, CRIM, Canada