Speech Transcript - Rapid Computation of I-vector

0:00:15	thanks project but i introductions
0:00:17	and graph and all that it but the them going to present its adjoint what
0:00:22	my a student's t
0:00:24	it's wise the are we hi joanne prof young from nineteen you still pose and
0:00:29	try to train a
0:00:30	so put into the right context we called it to a post present about one
0:00:36	way and in central
0:00:38	is on the use of i-vectors in the lda
0:00:41	so in this paper stand alone to present but the intention is to we use
0:00:46	the computations
0:00:47	in i-vector extraction so we call repeat competition i-vectors
0:00:53	"'kay" for going to detail is let me as bank of a slight
0:00:57	to we send the background and so as the motivations of the work
0:01:02	so and i-vectors extraction process can be seen as a compression process
0:01:07	right maybe you compress
0:01:09	across the crime
0:01:11	and the supervector space
0:01:13	the optimal which is a low and fixed dimensional vector speech recall i-vectors which can
0:01:18	see this
0:01:19	not only the speaker information is but we have the characteristics of the recording devices
0:01:25	the microphones to use
0:01:27	the transmission channel characteristics which including the ankle is made that we use
0:01:32	in transmission
0:01:34	for this transmitted of the speech signals that as well as the cost experiments
0:01:38	so
0:01:39	point two would be a mathematical form this is the i-vector
0:01:44	this is i-vectors and i-vectors
0:01:47	is the mlp x timit of the
0:01:50	latent variables
0:01:52	and
0:01:53	if you see here we have a single latent variable which is high cross
0:01:57	gaussian
0:01:58	and it i of course of frames so tying across frames and also is the
0:02:03	one that gives us that compressions process
0:02:06	compressed "'cause" a time in this but with the space
0:02:09	so
0:02:11	we assume that we know the alignment of frames to gaussian
0:02:15	and in the actual implementations this year of a frame alignment of gaussians
0:02:20	could be you love ideally what the gmm pasta you
0:02:23	all
0:02:24	most of is only used a single posteriors i
0:02:28	so no if we look at this latent variables
0:02:31	we
0:02:33	there is the assumption that the
0:02:36	trial
0:02:37	of this late in trouble is the standard gaussian distributions to be zero mean and
0:02:41	unit variance
0:02:43	so even the observation sequence
0:02:46	we could x t makes the post you which is and that of gaussians
0:02:50	we main five and covariance are inverse
0:02:54	of course this five
0:02:56	will be applied the speech is the posterior means of the latent variable x
0:03:00	and
0:03:02	one can see i-vectors is italy about it was the covariance the pot over t
0:03:07	matrix c
0:03:08	think mars is the colour matrix of the ubm
0:03:12	and f is the centroids first order statistics
0:03:16	and
0:03:16	l inverse which is the post your covariance is under determined by the
0:03:21	joe the statistics
0:03:23	so one point or not is that
0:03:26	in order to compute what extent the i-vectors
0:03:29	we have to compute
0:03:31	the posterior covariance
0:03:33	because this is part of the questions
0:03:36	okay
0:03:38	we cannot in this paper reviews what we called up you want the statistics
0:03:43	where we want to do is to be active speech this task in the house
0:03:48	and it's open to t and f similar here
0:03:51	so this sector simplified equations
0:03:54	we ought having the stick my speaker
0:04:04	okay so now the we have only one
0:04:07	objective in this paper that is really of the computations complexity of i-vector extraction
0:04:13	while keeping a memory common the low
0:04:16	and which like all perhaps not degradations on the performance
0:04:21	okay so why it is important because
0:04:25	is important because implementations of a very fast
0:04:30	exclamation i-vectors could be
0:04:32	before on hand held devices
0:04:34	all for that scale how based applications where a single server may have to
0:04:41	receive request
0:04:42	from hundred or one thousand quite some kind of the same time
0:04:46	okay and
0:04:48	also we reason we also recently we have you know increasing
0:04:52	the numbers of gaussian w is a system for example in the people there is
0:04:56	going to present coming
0:04:58	sections
0:05:00	see
0:05:01	number one thousand which process ten thousand so direct computation would be
0:05:06	something while for these
0:05:08	scenario
0:05:10	okay and
0:05:11	i know whatever estimation is that the
0:05:13	the and i think is on the right precomputation i-vectors
0:05:17	rather conservative exclamation t matrix because t matrix is extreme at once and usually
0:05:22	offline
0:05:23	and we can use a huge amount of computation resources
0:05:26	they can use fixed but
0:05:32	okay so
0:05:34	yes the
0:05:35	problem statement
0:05:39	the computation of alternate of i-vector extractions
0:05:43	lights as the exclamations of the posterior means
0:05:46	requires us to
0:05:49	extreme at first the post your covariance
0:05:52	so are they are
0:05:54	couples of existing solutions to solve this problem
0:05:57	and
0:05:58	including the eigen decomposition method also covariance model but we
0:06:04	fix compose account by a guy
0:06:07	factors subspace
0:06:08	by up a little
0:06:09	and we also on the sparse coding to improve the you know a simplified
0:06:15	the most your cover estimations
0:06:18	so in this paper what we propose is to
0:06:23	complexity may rightly the posterior means be up and it to evade it will still
0:06:28	covariance
0:06:29	so we did this by doing a first one we call to use an informative
0:06:34	prior
0:06:35	which are going to shows later
0:06:37	and the uniform occupancy assumptions are still with the commission this tool
0:06:41	we can do a fuss extreme i-vectors
0:06:44	of course without the need to estimate the posterior covariance
0:06:52	okay so
0:06:53	in the combination of all
0:06:55	the
0:06:57	i-vector extraction we issue a standard doesn't profile
0:07:03	and
0:07:03	no if we can see those
0:07:07	involvement for all
0:07:08	we
0:07:09	mean given by new p and the core and you must marquee then i-vector extractions
0:07:15	is given by this regions where we have to an additional terms here
0:07:20	people determines by the
0:07:23	cover the prior
0:07:25	and this new mike
0:07:28	so no if we consider the case where this like with the zero this cycle
0:07:32	demanded a matrix then distance will disappear
0:07:36	and is only go to the i didn't matrix so we did use to the
0:07:39	standard form
0:07:42	so in this paper we propose to use this
0:07:47	well for informative problem
0:07:49	where the means to zero but the
0:07:52	but over in this young by this
0:07:54	t is the total where t matrix still we have the inner product
0:07:59	of that order bitexts of and in bus
0:08:01	to be a book file
0:08:03	so okay now i've able to reduce i think
0:08:07	so what is that we in the i-vector second formulas we have additional terms you
0:08:12	about the problem right so now if you plot is into this i-vector extraction from
0:08:17	will then we'll when the get this right so we can always share that it
0:08:23	transpose t there is a inverse because we can this always full rank
0:08:27	i given that the assumption of training data
0:08:30	then we could take this t l
0:08:32	from
0:08:34	no and again this in both then we'll get
0:08:38	right
0:08:41	and then us these matrix inversion identity which
0:08:46	i copied from the matrix a global
0:08:49	okay so like the idea guys of you have a matrix p and q and
0:08:53	p here we construct the although something
0:08:56	p and q by putting this in the front right
0:09:00	so if you look at this formula speech is the same as
0:09:07	this one
0:09:10	right so we can say this is the p is it's a key when it's
0:09:15	the pa then we can put this
0:09:17	for what
0:09:18	and then sort of these right so no if you do and this formulas write
0:09:22	this is the linear algebra this is a projection matrix right approaches in matrix is
0:09:27	you know you can buy in this fall what you want you to a although
0:09:32	than a matrix meaning that
0:09:33	each column of this
0:09:35	you want
0:09:37	is a
0:09:39	all the love each other columns
0:09:41	and there is a unique now
0:09:43	and you wanna spend the same subspace as the t matrix
0:09:47	okay and this
0:09:49	although the nice properties is actually introduced to the primal
0:09:54	right and that's why we call it
0:09:56	the problem we use
0:10:00	at the subspace of the nineteen prior
0:10:04	okay so
0:10:06	if it'll it
0:10:09	well like a avoiding the exclamation the posterior covariance
0:10:13	by you know we can data extreme at the post you means you
0:10:16	but the thing is that if you use this formula is going to encode more
0:10:21	computations because we are dealing with the t
0:10:26	t transpose which is a very big matrix
0:10:30	so there's a reason why we have to introduce another assumptions recon uniform occupancy assumptions
0:10:35	which speed up the computations
0:10:39	okay so to do so
0:10:40	we first of all window a singular value decomposition of t
0:10:45	into t
0:10:47	into u s b u one be a be a single but in a single
0:10:52	but others matrix
0:10:55	okay and then you
0:10:57	is this
0:10:58	side speech is assumed at stft matrix
0:11:11	okay so
0:11:13	one dataset is that you one which is the u one in the previous slide
0:11:17	is
0:11:19	spend the same subspace t
0:11:22	and then you two
0:11:23	is all together when you one okay then we use this property to simplify this
0:11:29	formulas
0:11:31	right so we can express t transit inverse t into this fall because this
0:11:39	is equal to this right
0:11:41	and then this can be expressed in to this file
0:11:45	okay because of this property
0:11:48	then we can multiply and into this so we have i plus and this okay
0:11:54	next
0:11:56	it's a i class and is equal to a
0:12:01	and then apply
0:12:02	the matrix inversion lemma
0:12:05	in this from this is what we get
0:12:07	and we apply gains this the are
0:12:10	matrix inversion entity that we used before here we have these
0:12:17	p
0:12:18	he'll and p right now we can put this p the front
0:12:24	then
0:12:26	have a few when p
0:12:28	so that is that we want to express this thing
0:12:32	on the laugh
0:12:34	in two days
0:12:35	a inverse and i terms
0:12:39	expressed in terms of you two
0:12:41	which is orthogonal be you one or to go an o b g
0:12:48	right
0:12:49	so is the a uniform occupants assumptions
0:12:53	because okay
0:12:54	okay is
0:12:56	i class and
0:12:59	and itself is the diagonal matrix
0:13:02	so if you look into individual elements of this
0:13:06	matrix here what we get is this thing here what we get this and see
0:13:11	divided by i
0:13:14	one class and see
0:13:15	right so that you need vol occupancy assumption says that
0:13:21	for all the doesn't components
0:13:24	the occupancy count divided by one cluster occupancy call is the same for all the
0:13:30	constants right here we do need to know what's of value of what is appropriate
0:13:34	value of all file
0:13:36	what we assume is that this the same of a
0:13:40	would be applied forty percent right
0:13:43	so
0:13:44	by doing so we have this
0:13:46	into this fall
0:13:48	and if you multiply this if you this is the i-vector extractor on this so
0:13:53	if you multiply this t
0:13:56	in two
0:13:57	we did you to then this to move we can sell
0:14:00	so we end up with this formula for i-vector extraction this is very fast because
0:14:05	a week and pre-computed systems
0:14:09	and this is thus
0:14:10	this is a diagonal matrix right so taking the inverse is
0:14:14	is very simple
0:14:17	right
0:14:21	okay no that's a look at the eer computational complexity
0:14:25	so we have four
0:14:29	comparison of for different the algorithm so we have the baseline i-vector extraction which is
0:14:34	the standard fall
0:14:36	we have the you know we have to do d in the product the of
0:14:40	the
0:14:41	but with these metrics
0:14:43	t c transpose d c
0:14:45	and for all the c components so this is your by c f m square
0:14:52	and
0:14:53	the m u is due to the metric conversions
0:14:57	also in terms of memory cost may have to install but and i t matrix
0:15:02	so this is the c f m
0:15:05	okay so now forty fast baseline we can actually be computed is a t transpose
0:15:11	t
0:15:13	and story while this computer cost all for this
0:15:17	a c m square
0:15:18	but we will actually we use the complete data cost from this to this
0:15:25	okay and that for all
0:15:27	what was made using the informative prior
0:15:31	without the uniform occupants assumptions
0:15:34	the a computational complexity and memory cost is it could be at the same and
0:15:40	the fast baseline
0:15:41	"'kay"
0:15:42	because we can recompute distance and story
0:15:46	well as for the fast
0:15:48	the proposed method
0:15:50	we have
0:15:51	computational complexity we use stream and the to be a this them
0:15:56	and we can pretty complete distance down to memory so in terms of computational complexity
0:16:02	the proposed
0:16:03	fast meant that is
0:16:05	twelve times faster
0:16:07	then the fast baseline
0:16:08	and had a time faster than the s o baseline
0:16:18	okay so
0:16:20	you know there is to present a shall we talk about
0:16:23	a as of today propagation
0:16:25	we need to post your problem
0:16:29	then i mean yes application of an impostor common so the pasta correct could actually
0:16:34	be computed using the same fast method
0:16:37	a given by these cushion here
0:16:40	using the same informative prior
0:16:42	as well as the uniform corpus assumption i mean this the computational complexity
0:16:48	also
0:16:51	we can actually use this that informative prior
0:16:53	given by d transposed he
0:16:55	into the is that
0:16:57	but be in the em a fixed emissions of the t matrix
0:17:02	okay of course we only use in the is that but in the sense that
0:17:05	we actually
0:17:07	this car but others associated with a prior which
0:17:11	which
0:17:12	allows you i think in the form
0:17:20	okay
0:17:21	experiments the experiment was conducted on the is as i ten x and the fast
0:17:27	come with condition one to nine
0:17:29	we use a gender and then ubm we found two gaussians
0:17:34	we fifty seven dimension mfcc and the ubm is trained on switchboard as i four
0:17:39	or five or six and we use you we use the same the about to
0:17:42	train the t matrix
0:17:43	we do a co-ranks of four hundred
0:17:47	based on the obvious p lda for scoring so our before
0:17:52	passing the p lda we use the dimension i-vector those two hundred using lda
0:17:57	and followed by an angle
0:17:58	and for the p lda we have the art when the speaker factors then we
0:18:02	use a full
0:18:04	race you can go into
0:18:05	more the session but
0:18:10	okay so this table shows the
0:18:15	without so for the baseline
0:18:18	the proposed as that method proposed fast method
0:18:22	so the first rule
0:18:23	so's the eer the second rule is the mean dcf so i'll know if we
0:18:29	compare this
0:18:31	results with this
0:18:33	well we can see that the result is not really much difference so we can
0:18:38	say that
0:18:40	by using implement a project what we use
0:18:43	it does not seem to degrade performance
0:18:46	okay then a if we look at the common condition five
0:18:54	which is a telephone conditions
0:18:56	for the proposed fast make the degradation is actually
0:19:01	about ten percent eer and four point five percent and mindcf
0:19:06	k and t v c across all the night common conditions
0:19:10	the relative degradation is ranging from ten to sixteen percent and
0:19:16	where is you can be a source that you with six seven percent
0:19:20	up to twenty point four percent mindcf
0:19:27	okay so i'm is okay so this is
0:19:33	this is the system that we use
0:19:37	this it's of
0:19:38	white data centre i suppose of the statistics
0:19:41	normalize three the an the occupancy kernel
0:19:45	so we use this as a small vectors
0:19:47	and we'd work pca
0:19:49	right
0:19:50	and then we do what projections of all these test or training utterance
0:19:55	and woman
0:19:56	into the low dimensional subspace
0:19:58	and useful for the p l d a simple
0:20:00	so
0:20:02	a what you can see that
0:20:04	okay i'll why we do that because
0:20:12	if you look at these formulas
0:20:15	this is the can be seen as a transformation matrix
0:20:20	and this is the input vector
0:20:22	and is the projection of this input vector
0:20:25	into a low dimensional vectors
0:20:34	binary comparing to resolve this we don't fast made but it's the others shows that
0:20:38	by using the t matrix training with the em
0:20:41	in the commission of phone give a better performance
0:20:46	no a
0:20:47	this result shows the comparisons of you matrix
0:20:50	train we do not all be informative problem with standard doesn't prowl
0:20:56	but extremely informative problem
0:20:58	so
0:21:00	comparing this tool we can see that the proposed as that may to actually give
0:21:04	a slightly better result
0:21:11	okay so in conclusions we introduced two new concept
0:21:16	of already computation i-vectors
0:21:18	the first one is what we call the subspace l optimising pro
0:21:22	and we
0:21:24	the use of subspace modeling probably can about in the to compute the posterior covariance
0:21:30	okay before computing the pasta means
0:21:33	and then we use a uniform workable assumption because read used
0:21:37	the
0:21:38	computed complicity
0:21:41	so we the combined combination use of this to the assumptions and informative prior
0:21:46	we speed of the i-vector extraction process
0:21:50	but i-vector trial we a slight degradation in terms of accuracy
0:21:57	is my have
0:22:03	we have time for a few questions
0:22:15	so it seems useful problem of course
0:22:19	i have so that i so i
0:22:24	i
0:22:31	this the performance of to me by saying this that's that we notice the same
0:22:36	as baseline is you have access also we what the as that because
0:22:45	exactly as we of the use of the uniform occupants assumptions
0:22:49	by just using the subspace the other than same problem
0:22:56	because we want to see that a by introducing difference that we first introduce the
0:23:01	starts based recogniser brow and informal by a uniform the basic assumptions so want to
0:23:06	see a in t v just
0:23:08	what is the a
0:23:10	what if x
0:23:12	maybe use you know we introduce a subset of the problem
0:23:16	we get a better performance of slightly was performance

Rapid Computation of I-vector

Speaker Recognition: i-vector approaches

Longting Xu, Kong Aik Lee, Haizhou Li, Zhen Yang