Speech Transcript - A Small Footprint i-Vector Extractor

0:00:24	This whole session should be about compact representations for speaker identification and the first talk
0:00:30	is... the title of the first talk is A small footprint i-vector extractor
0:00:39	So, I repeat, this talk
0:00:44	by a small footprint
0:00:46	i-vector extractor
0:00:47	one that
0:00:50	amount to memory
0:01:00	the troubles we have
0:01:01	basic problem is that these kind of algorithms for extracting i-vec
0:01:06	are quadratic
0:01:07	both of them, both the memory and
0:01:10	computational reqiurements
0:01:13	we present
0:01:15	is a follow-up on
0:01:18	paper that Ondrej
0:01:20	Glembek presented at the last year's ICAASSP,
0:01:28	The approximation on how i-vectors could be extrac
0:01:32	with minimal memory
0:01:33	overhead, but after
0:01:34	after discussing that work, I intend to
0:01:39	is to show how that idea
0:01:49	The principal motivation for doing this work
0:01:53	it's well known that
0:01:56	you can introduce approximations at run time as
0:01:59	we have minor degradation in the
0:02:02	in the recognition performance
0:02:06	However, these approximations generally do cause problems if you want to do training. So, the
0:02:12	motivation for aiming at the exact posterior computations was to be able to do training
0:02:20	and, in particular, to be able to do training on a very large scale. Traditionally,
0:02:26	most work with i-vectors has been done with dimensions four or six hundred. In other
0:02:34	areas of pattern recognition principal components analyzers have much higher dimension than they constructed, so
0:02:42	the
0:02:43	purpose of the paper was to be able to run experiments with very high- dimensional
0:02:50	i-vector extractors.
0:02:52	As it happens, this didn't pay off. But the experiments needed to be done, you
0:02:57	know, in any case.
0:03:03	Okay, so the ... the point of that i-vectors, then, is that they
0:03:09	they provide a compact representation of an utterance
0:03:14	typically, vector of four or eight hundred dimensions independently of the length of the utterance.
0:03:19	So, that the time dimension is banished alltogether, which greatly simplifies the problem.
0:03:29	Essentially, it now becomes a traditional
0:03:33	biometric pattern recognition problem without the complication introduced by
0:03:40	arbitrary duration. So, many standard techniques apply and joint factor analysis becomes vastly simpler; so
0:03:48	simple that is now has another name it's called probabilistic linear
0:03:53	discriminant analysis
0:03:56	And, of course, the simplicity of this representation has that
0:04:01	well, fruitful research in other areas, like language recognition. Even speaker diarization for i-vectors
0:04:11	can be extracted from short speaker turns as short as just one second.
0:04:20	So, the basic idea is there is an implicit assumption that, given an utterance, it
0:04:25	can be represented by a Gaussian mixture model.
0:04:30	If the
0:04:33	if that GMM were observable,
0:04:35	then the problem of extracting the i-vector would simply be a matter of applying a
0:04:41	standard probabilistic principle components analysis
0:04:47	to the GMM supervector
0:04:50	So, the basic assumption is that of the supervectors lie below dimensional space, the basis
0:04:56	of that space
0:04:59	is known as the eigenvoices and the coordinates of the supervector relative to that basis
0:05:06	is the i-vector representation. So, the idea is that the components of the i-vector should
0:05:15	represent high level aspects of the utterance, which are independent of the phonetic content.
0:05:22	Because all of this apparatus is built
0:05:24	of the UBM, UBM can play the role of modelling the phonetic variability in the
0:05:31	utterance
0:05:32	and the i-vector then should capture things like speaker characteristics
0:05:39	room impulse response
0:05:40	and the other global aspects of the
0:05:45	utterance
0:05:48	So, the problem that arises is that the
0:05:54	the GMM supervector is not observable. The way to get around the problem is by
0:06:01	thinking of the Baum-Welch statistics.
0:06:04	It's typically collected with the universal
0:06:07	background model as ... summarising a noisy observation of the... of the GMM supervector.
0:06:20	the
0:06:21	From the mathematical point of view. The only difference between this situation and a standard
0:06:28	probabilistic principal components analysis is that in the standard situation you get to observe every
0:06:36	component of the vector exactly once and this situation, you observe different parts of the
0:06:41	vector different number of times.
0:06:45	Other than that, there is nothing misterious in the derivation.
0:06:49	So, this is the mathematical model, the supervectors
0:06:55	are assumed to be confined in a multidimensional
0:07:00	subspace or the supervector space.
0:07:04	The vector y is assumed in the prior, a standard, normal distribution. Now, the problem
0:07:10	is given Baum-Welch statistics to produce a point estimeate of y and tyhat is the
0:07:16	i-vector representation if the utterance.
0:07:21	You can also write the terms of the individual
0:07:23	components of the GML
0:07:28	the standard assumption is that the covariance matrix here remains unchanged, it's the same for
0:07:35	all utterances.
0:07:36	the
0:07:39	Attempting to make that utterance- dependent seems to lead to insuperable problems in practice,
0:07:46	nobody, to my knowledge has
0:07:46	ever made the progress with that problem.
0:07:55	One aspect that is common to most implementations is that
0:08:02	some of the parameters, mainly the mean vectors and the covariance matrices
0:08:07	is copied, technically, from the UBM, into the
0:08:12	probabilistic model
0:08:15	That, actually, leads to a slight improvement in the performance
0:08:19	I'll report to some results
0:08:22	later
0:08:25	The main advantage, though, is that you can simplify the implementation by performance of affine
0:08:30	transformation of the parameters
0:08:35	which enables you to take the mean vectors to be zero and covariance matrices
0:08:39	to be the identity
0:08:41	and that enables you to handle UBMs with full covariance matrices
0:08:50	in the simplest way.
0:08:51	It's well known that using
0:08:53	covariance matrices
0:08:55	does help.
0:08:59	So, these are the standard equations for extracting the i-vectors, assuming that the model parameters
0:09:07	are known.
0:09:09	the matrices be
0:09:10	the
0:09:13	problem is accumulating this... this matrix here.
0:09:19	Those are the zero order statistics,
0:09:24	that are extracted with the UBM.
0:09:29	The standard procedure is to precompute the terms here These matrices,
0:09:34	these here, they're symmetric matrices.
0:09:37	So, you only need the
0:09:41	proper triangle.
0:09:41	The problem, then, in the memory point of view, is that... because these are quadratic
0:09:48	in the i-vector dimension, you have to pay a heavy
0:09:52	price in terms of memory.
0:09:57	So, those are some
0:09:58	typical figures
0:09:59	for fairly standard sort of
0:10:01	configuration.
0:10:08	that is
0:10:11	These are the standard training algorithms; the only point I wanted to make in
0:10:17	putting up this equation here is that: both in training and in extracting the i-vector,
0:10:25	which was
0:10:26	the previous slide, the principal
0:10:30	computation is a matter of calculating the
0:10:34	posterior distribution of that factor y.
0:10:39	So, that is the problem,
0:10:40	calculating the posterior distribution of y.
0:10:45	Not just the point
0:10:54	So, the contribution of this paper
0:10:56	is to use a variational Bayes implementation of the probability model
0:11:03	in order to solve this
0:11:06	particular problem of the
0:11:08	doing
0:11:12	So, the standard assumption is to assume that the
0:11:18	posterior distribution that you're interested in factorizes; in other words, that you have a statistical
0:11:25	independent assumption relation
0:11:27	that you can impose
0:11:31	Estimating these terms here's carried out by a standard variational Bayes update procedure, which you
0:11:40	can
0:11:41	find
0:11:44	within the reference
0:11:44	of the Bishop's book.
0:11:48	This notation here means you take the vector y and
0:11:53	you
0:11:55	calculate an expectation over all components, rather than particular components
0:12:01	that you happen to be interested in when you're updating
0:12:02	the particular term.
0:12:06	the...
0:12:09	These updated rules are guaranteed to increase the variational
0:12:12	lower bound and that's useful
0:12:17	property
0:12:20	So, this is an iterative method,
0:12:22	you have to single iteration, what consists of
0:12:27	cycling over the
0:12:29	components of the i-vector.
0:12:34	This is just to explain that the computation's actually brought down to something very simple
0:12:41	assumptions are
0:12:42	Gaussian
0:12:43	the
0:12:46	factors in the variational factorization are also Gaussian. To get the normals
0:12:52	you just
0:12:57	expression
0:12:58	and the point about the memory then is just as the... in the full posterior
0:13:05	calculation. Pre-computing these matrices here enables you to speed up the computation at a constant
0:13:15	memory
0:13:16	The things you have to be pre-compute here are just the diagonal versions of these
0:13:22	things here and for that the memory overhead is negligible.
0:13:27	So, this is all based on the assumption that the
0:13:32	posterior that we can assume a diagonal posterior covariance matrix,
0:13:40	So,
0:13:44	is's explained in the paper why the variational Bayes method
0:13:49	, even if that assumption turns out to be wrong,
0:13:53	the variational Bayes is guaranteed to find the point estimate of the i-vector
0:13:57	exactly.
0:14:00	See? So, the only
0:14:03	error that's introduced here
0:14:05	is in the posterior covariance matrix,
0:14:07	it's assumed to be diagonal
0:14:11	There's no error
0:14:13	in the point estimate of the posterior, thus it's
0:14:27	If you're familiar with the numerical
0:14:28	the mechanics correspond to something known as the
0:14:37	which in this case happens to be guaranteed
0:14:40	versions happens to be guaranteed because
0:14:43	variational Bayes
0:14:45	So the method is exact, the only real issue is how efficient
0:14:50	it is. That turns out to raise the question of how
0:14:55	is the
0:14:57	assumption that
0:14:58	the covariance matrix can be treated as diagonal.
0:15:06	Two points here, to bear in mind.
0:15:10	In order to show why the assumptions are
0:15:12	reasonable.
0:15:14	First is that the i-vector model is not uniquely defined.
0:15:18	You can perform a rotation in the i-vector coordinates
0:15:25	provided that you perform a corresponding transformation on the
0:15:29	eigenvoices, the model remains unchanged.
0:15:32	The
0:15:34	posterior,
0:15:35	prior factor of the
0:15:36	why it continues to be the center of whole distrib
0:15:42	You have freedom in
0:15:43	rotating the
0:15:46	the basis.
0:15:47	The other point, this was the point that Ondrej Glembek
0:15:55	named in his
0:15:57	ICAASP paper last year. That, in general, this is a good approximation to the posterior
0:16:04	precision matrix,
0:16:06	provided you have sufficient data. So, those W's there are just the mixture
0:16:11	weights in the
0:16:16	unversal background model
0:16:18	and talking the number of frames
0:16:20	in, for example,
0:16:21	in scenario like core condition it may be somethin
0:16:29	so that you have sufficinetly many
0:16:31	frames that this
0:16:32	approximation here
0:16:34	would be reasoned.
0:16:38	If you combine those two things together,
0:16:41	okay? You can say that by diagonalizing this sum here you
0:16:46	form this sum just once, using the
0:16:51	the mixture weights. Then you will produce a basis of the i-vector space with respect
0:16:58	to which all the posterior
0:17:00	precision matrices are approximately diagonal.
0:17:04	That's the justification for the
0:17:06	diagonal assumption. You have to use a preferred
0:17:09	basis
0:17:10	in order to
0:17:13	do the calculations.
0:17:15	And using this... using this basis guarantees that the variational Bayes algorithm
0:17:21	will converge very quickly.
0:17:22	Typically, three iterations are enough,
0:17:26	three iterations independently
0:17:27	of the
0:17:29	rank of the dimensionality
0:17:30	of the i-vector.
0:17:34	And that's the basis of my contention that this algorythm's
0:17:39	computational requirements are
0:17:40	linear in the
0:17:47	So, memory overhead is negligible and
0:17:49	the computation
0:17:52	scales linearly
0:17:53	rather than quadratic.
0:18:01	If you're using this,
0:18:04	the preferred basis is going to change
0:18:09	, so you should not overlook that.
0:18:17	Whenever you have a variational Bayes method, you have a variational lower bound, which is
0:18:23	very similar to auxiliary function and
0:18:29	which weighting the auxiliary function
0:18:31	which is guaranteed to increase
0:18:33	on successive iterations of your
0:18:39	algotythm.
0:18:43	So, it's useful to be able to evaluate this, and
0:18:47	the formula is given in the paper.
0:18:50	It's guaranteed to increase on successive
0:18:52	iterations of variational Bayes.
0:18:54	So,
0:18:57	used for debuging. In principle,
0:18:59	it can be used to monitor convergence, but it actually turns out that the
0:19:02	overhead of using it for that purpose
0:19:06	slows down the algorythm,
0:19:07	it's not used for that in practice.
0:19:14	I think is that it can be used to monitor convergence when
0:19:17	you are training an i-vector extractor with variat
0:19:23	The point here is that the
0:19:25	exact evidence, which is
0:19:28	the thing you woudl normally use
0:19:29	to monitor convergence,
0:19:34	cannot be used
0:19:36	in this particular case
0:19:39	if you're assuming that the posterior
0:19:44	is diagonal, then you have to
0:19:46	modify the calculation
0:20:03	Okay, so a few examples
0:20:05	of questions that I dealt with in the paper.
0:20:10	One is how accurate is
0:20:12	variational Bayes algorythm?
0:20:15	To be clear here, there is no issue
0:20:21	at run time, you are guaranteed to get the exact
0:20:21	point-estimate of your i-vector,
0:20:25	provided you
0:20:27	monitor iterations
0:20:33	The only issue is
0:20:36	the recent approximation, when you treat the posterior precision or covariance matrix as the
0:20:44	diagonal.
0:20:47	And those posterior precisions
0:20:49	do interrupt the
0:20:50	training model, so it's concievable. But using the
0:20:59	assumption on the posterior precisions could affec
0:21:00	the way training behaves.
0:21:02	So, that's one point
0:21:04	that needs to be checked.
0:21:07	I mentioned at the beginning,
0:21:09	this is well known, but
0:21:11	I think needed to be tested.
0:21:14	If you make the simplifying transformation, which allows you to take the mean vectors to
0:21:21	be zero or the covariance
0:21:22	matrices
0:21:27	You're copying some parameters
0:21:29	from the UBM into the probabilistic model for
0:21:33	i-vectors.
0:21:34	There is a question as to
0:21:37	obviously
0:21:38	plausable reason
0:21:46	How efficient is variational Bayes?
0:21:49	Obviously, there's going to be some price to be paid.
0:21:53	The standard implementation
0:21:57	you can recude computational burden
0:21:59	at the cost of
0:22:00	several gigabytes of
0:22:04	memory,
0:22:07	so you no longer have opportunity
0:22:08	of using all that memory.
0:22:12	The question about efficiency.
0:22:15	And finally, there is an issue of
0:22:20	training very high dimensional i-vector extractors.
0:22:23	You cannot train very high dimensional
0:22:24	i-vectors exactly using
0:22:31	variational approach.
0:22:32	Bayes approach does enable you to do it,
0:22:34	but there is an impairment of doing that.
0:22:42	Ok, so the testbed was
0:22:45	female det two trials
0:22:46	this is a matter of telephone speech,
0:22:50	extended core condition of the NIST
0:22:52	two thousand and ten
0:22:55	evaluation.
0:22:59	Extended core condition of the millions of
0:23:03	very much large number of trials
0:23:05	than the original
0:23:07	evaluation protocol.
0:23:13	The standard front end, the standard UBM
0:23:15	diagonal covariance matrices, trained on the usual
0:23:25	In other respects, the classifier
0:23:27	was quite standard
0:23:30	the used heavy-tailed PLDA
0:23:45	These were
0:23:49	results obtained with JFA executables,which is
0:23:54	the way i-vectors were originally
0:23:56	built and they were just
0:24:00	produced as Benchmark
0:24:01	There was a problem, the activity detection explai
0:24:06	error rates were a little higher than expected.
0:24:10	With variaional Bayes I actually got marginally better
0:24:13	results, which turned out to be a more effective c
0:24:21	covariance matrices. Copying the covariance matrices
0:24:24	actually
0:24:27	effect is to reduce to other estimated variances
0:24:42	I need to get to
0:24:45	efficiency
0:24:47	My figures for extracting a four hundred-dimensional i-vector extractor
0:24:52	are typically about half a second.
0:24:54	Almost all of the time is spent in
0:24:55	BLAS routines, accumulating the posterior covariance matrices
0:24:59	to take seventy five percent of the time.
0:25:06	An estimate a of quarter of second, which
0:25:07	suggests that compiler optimization may
0:25:09	be helpful, everything is going on inside the clas
0:25:17	For the variational Bayes method, I've got an estimate of
0:25:22	point nine seconds instead of point five.
0:25:26	I've fixed the number of iterations at five.
0:25:33	Variational Bayes method really comes to it's own
0:25:35	when you work
0:25:37	with higher-dimensional of
0:25:39	i-vector extractors.
0:25:47	It's one last table. I did try and put several
0:25:51	dimensions, up to sixteen
0:25:52	hundred and got a very good indurance and performa
0:26:05	Okay, thank you.
0:26:17	Well, it depends on the ... on the dimensionality of the i-vector extractor.
0:26:25	A couple of gigabytes, it's as big as the large vocabulary continuous speech recognizer, it's
0:26:30	not as intelligent, but it's as big.
0:26:37	Just the eigenvoices, okay, together with the stuff you have to pre-compute in order to
0:26:44	extract the i-vectors efficiently.
0:26:53	Yeah, that what require
0:26:57	You still have to store the eigenvoices, but thing that you did not know, that's
0:27:02	not the big part. The big part is a bunch of a triangular matrices that
0:27:07	we store in order to calculate i-vectors efficiently, using the standard approach.
0:27:14	The point of this was to use the variational Bayes to void that computation.
0:27:23	I'm afraid that we've just spent the time for questions here and so I... I
0:27:27	guess that Patrick will get lots of questions offline and maybe even at the end
0:27:31	of this talk they can answer those questions together with Sandro, who is giving the
0:27:35	next talk

A Small Footprint i-Vector Extractor

SESSION 01: Speaker Recognition - Compact Representation

Patrick Kenny