Speech Transcript - Variance-Spectra based Normalization for I-vector Standard and Probabilistic Linear Discriminant Analysis

0:00:15	mission
0:00:15	i'm not sure must be from but
0:00:18	and we will have a five
0:00:21	papers
0:00:21	first one
0:00:23	will the was that incorporation of eer for a start but
0:00:29	score variance spectra based normalization for i-vector standard probabilistic linear discriminant analysis
0:00:36	the authors are okay started
0:00:38	if we show skit domain is it is possible
0:00:42	not possible task
0:00:44	also you are the last one i can problems
0:00:47	so
0:00:49	present paper
0:00:54	yes thinking on that
0:00:55	so that in the past on that just mention needs a collaborative work so actually
0:01:01	it's also offer a lot because
0:01:04	the
0:01:05	right
0:01:06	the work has been started with condition scale and because you don't invest in the
0:01:11	speech and so i wanna start with some analysis of for what we did before
0:01:17	and also try to improve the work that has been previously
0:01:22	so
0:01:23	this is based on i think so and kcca welcome back to i-vectors
0:01:28	so i would start with a brief description of a system and which is based
0:01:34	on classical i-vector in i
0:01:37	yeah
0:01:40	a tall muscular the post processing of the i-vectors beef between the i-vector extraction the
0:01:45	plp
0:01:46	which is the buttons a system where we try to improve the discriminant see so
0:01:52	usually by using a D approaches
0:01:55	and also to compensate for the session variability so one way to do it is
0:02:00	to use the length normalization there are plenty of way to do this but i
0:02:03	will focus on these two
0:02:06	and as the discriminant C is a related to the variance
0:02:10	the data are and we look at
0:02:14	in the between and within class variability
0:02:19	so
0:02:20	we start with the description of the system so that
0:02:23	between on one for
0:02:26	so the system is just a classical ubm
0:02:30	everything is gender dependent from the beginning to the end
0:02:33	so the idea is to some distribution
0:02:36	the we extract mfcc sixty dimensions of the use the based on the use recognizer
0:02:44	and the constraint is the very classical so using a large amount of data for
0:02:50	based on four or five or six
0:02:52	and wait
0:02:55	so for the second pass the i-vector extractor also gender dependent and
0:03:01	we only telephone data from these four or five was the switchboard future
0:03:06	i think it's quite the state of the art
0:03:09	so just a rough idea of the number of sessions
0:03:12	and for the i would say that a normalization and classification training which includes both
0:03:18	the gplda training
0:03:19	and you training and everything will see in the following
0:03:24	we used a gender dependent subsets of the various sets of data
0:03:28	based on still for five or six and sweet spot and we use only is
0:03:32	because of the number of sessions yeah
0:03:35	and the we restrain the development set to segments for which the nominal X
0:03:42	is higher than one hundred eighty seconds
0:03:46	so no look at some tools that can be useful when we talk about variability
0:03:52	so first i would just remind discriminant C and covariances so
0:03:58	we
0:03:59	a commonly used the covariance matrices of the total covariance the between class covariance the
0:04:05	within class covariance
0:04:07	but usually it's very i mean it's very common speaker verification to instead of using
0:04:12	the between and within class covariance matrices to use the scatter
0:04:16	matrices
0:04:18	so the definition is roughly similar and so they can use
0:04:21	is that one of the ozark and for several applications
0:04:25	the recent chapter
0:04:27	is that
0:04:28	i don't the scatter matrices the do not take into account
0:04:32	the number of sessions per speaker so the weight actually a speaker is that the
0:04:36	one of the pounding of the number of sessions
0:04:39	so i think it's a commonly used look at we just need a few experiments
0:04:45	distance to see
0:04:47	in our system
0:04:49	one of the other it's much efficient
0:04:52	so what talking about classification what we are interested in is to
0:04:57	read use
0:04:58	the maximise the between
0:05:01	speaker variability and reduce the within speaker variability
0:05:05	and one way to do this is to look at the covariance
0:05:10	and so what we need to do
0:05:13	this to which is a spectrum and so too is very common to
0:05:17	yeah of the raw
0:05:19	the main
0:05:21	what is it so on this graph of we can see three plots which are
0:05:25	coming from the top of any to the violence
0:05:28	is that science and within class variance of for us so the speaker and session
0:05:35	so what we compute the between class covariance matrix
0:05:38	B
0:05:39	then we rotate all the data on the development set in the i-vector basis
0:05:44	can be
0:05:46	we compute then dimensions
0:05:49	and then we just but the diagonal of this matrix so you can see that
0:05:52	the variability
0:05:53	in the first dimensions is higher for the speaker and also for the sessions
0:06:01	so now talking about this way to maximize this ratio is to use the very
0:06:05	common lda someone is just maximizing the rayleigh coefficient
0:06:10	so there is completely defined is really coefficient using the within and between-class covariance matrices
0:06:17	or using the scatter matrix
0:06:19	so in this work the it would be used to reduce an exercise from six
0:06:24	hundred to eight so this is constant for all the experiments we have
0:06:29	and the to go is that it system description
0:06:33	we try to define scoring the first one is based on the two covariance model
0:06:39	that has been used by need to two years ago we can write
0:06:42	and so
0:06:46	shen
0:06:47	and the second one
0:06:49	is based on the period using the gaussian assumption
0:06:53	that you were used is based on so we used the eigenchannel matrix of the
0:07:00	key
0:07:01	but the full range because on television this time was using the diagonal see
0:07:06	so the number of speaker factors in the key thing but i mean at to
0:07:11	be consistent with the lda
0:07:14	and the number of channel factors six something because it's the way to
0:07:19	compensate for the diagonal
0:07:24	so that the problem is all this
0:07:27	students including to model programs and here is that everything is based on the questioned
0:07:34	assumption and
0:07:36	for those working you know that's two D C
0:07:39	we have very good to know that we are talking about it at the T
0:07:45	and the noise very company
0:07:49	not in the community that the i-vector are not following the nice motion but something
0:07:54	a bit more that you like
0:07:56	they didn't
0:07:57	distribution
0:07:59	so what we do is that we try to take all decided these i-vectors and
0:08:04	make that make the distribution motion
0:08:07	in one way to do this just been proposed initially by to present the same
0:08:12	time
0:08:13	is so then i guess they're male and that's the speech intention
0:08:18	is to normalize the magnitude of i-vectors
0:08:21	so using this formula as this one and just the
0:08:25	we centered at thousand that we just normalize them into
0:08:31	so using this method the distribution the car become a bit more cushion
0:08:37	and we can see that the effect is
0:08:40	very efficient
0:08:42	so just using the tool to
0:08:45	but
0:08:46	two covariance model
0:08:48	we can see that again in both equal error rate
0:08:50	and this form at mit on nist two thousand and so this
0:08:55	and instances two thousand and extending
0:08:58	is a simple presentation
0:09:01	so everything until now is very common so going back to the to the to
0:09:06	introduce previously
0:09:07	oh we would like to show the effect of length normalization
0:09:12	provides a by a spectrum
0:09:15	and as you can clearly see
0:09:17	a det curves a exactly the same except for the rest of the value was
0:09:22	because a normalizing the magnitude
0:09:24	we can see that
0:09:26	the button on the right side are smaller but it doesn't affect me much just
0:09:32	so
0:09:35	fortunately
0:09:36	an initial papers the maximization as we introduce with
0:09:41	whitening so it has to be done after whitening of the data
0:09:45	so that they got several in this in this algorithm so the whitening is just
0:09:50	using the total covariance matrix you know when i vectors and then we apply the
0:09:56	length normalization
0:09:58	at the same time initial risque introduce the eigenvector measure which is just a whitening
0:10:03	plus like summarisation but don't iterative
0:10:07	and by this iteratively the interest of this method is that for
0:10:12	converge very fast
0:10:13	and we introduce some properties
0:10:17	that we can use further
0:10:20	so the properties out that the need of the development set is a converging to
0:10:25	zero very fast
0:10:26	the covariance matrix the total covariance matrix is become the identity you five i
0:10:34	and going from this all the eigen vectors for the from the
0:10:39	between class covariance matrix
0:10:42	because also eigen vectors of the within class covariance matrix
0:10:47	and thus using all this property together
0:10:50	it happens that the eigen vectors of the
0:10:54	between and so within class covariance matrices
0:10:57	now solution of the
0:11:00	and the optimization
0:11:01	that means after all this
0:11:03	it at and the eight yeah improvement is
0:11:08	so that was one of the conclusion junctions
0:11:12	first paper
0:11:13	and the that we can see the effect of the this normalization of the on
0:11:18	the variance spectra
0:11:19	so before we a treatment i-vector based on this
0:11:25	provide
0:11:26	and after one
0:11:29	after one iteration which is exactly what the former romero
0:11:33	proposed
0:11:34	and so what i think the signal
0:11:37	we can see that the total covariance spectra become a flat
0:11:42	after two iterations
0:11:44	even better
0:11:45	and after three
0:11:47	almost perfect at least for the human eye
0:11:50	so you can see that
0:11:52	the big advantage of this paper is that the first dimensions
0:11:55	data does not contain the major portion
0:12:00	the variability
0:12:01	there might a portion of the session variability
0:12:04	so what actually
0:12:06	after this treatment the i-vectors become
0:12:09	optimal for the weighting coefficient optimization that means this should be the
0:12:15	optimization of at
0:12:19	so to illustrate this some results using the lda then we use the two covariance
0:12:27	model for score
0:12:29	and
0:12:30	so the baseline is just the length normalization when i say length normalization
0:12:34	without any whitening
0:12:36	is just the magnitude normalization
0:12:40	so you can see that using the
0:12:43	and eigenvector original doesn't improve
0:12:46	the performance after one iteration
0:12:48	if we use the scatter matrices to compute the U
0:12:52	but in the case we compute the
0:12:54	the at you using the between and within class covariance
0:12:57	we can see that for the female at least it improves the performance
0:13:02	and after two iterations
0:13:04	we can see that the conclusion is the same means using this data
0:13:08	the between and within class covariance matrices
0:13:11	he seems a not optimal so it's better to inspectors use the between class covariance
0:13:18	the initial definition
0:13:21	so that after this result we try to apply the same data to
0:13:26	before the P which is more robust maybe the covariance model
0:13:31	so
0:13:32	this is the baseline using only length normalization and when we apply two iterations eigenvector
0:13:37	original which is optimal in this case
0:13:40	that we see that the data is not adapted for the key idea
0:13:44	so the performance on the bizarre
0:13:47	might states even worse
0:13:49	but
0:13:50	there
0:13:51	so it was a extending this work a by still looking at the covariance is
0:13:58	but
0:13:59	thinking that after the length normalization everything is on the sphere so that means we
0:14:04	have a spherical surface and what it does not like this
0:14:08	and is very difficult to estimate the covariance matrix
0:14:11	because when you look at each speaker
0:14:13	from one side of this field one also that the within speaker variability we
0:14:19	very different
0:14:21	and if we just take the average of this
0:14:23	to estimate the development set within class covariance matrix
0:14:27	then it doesn't make sense anymore because the them at the
0:14:31	metrics negative for some speaker but obviously not for
0:14:35	so what was in this paper is that keeping the detectors on the surface because
0:14:42	no it's commonly admitted that is
0:14:46	really to use t-normalization for the session compensation
0:14:50	but we want to be the principal directions for the decision boundaries
0:14:56	that means
0:14:57	we won't us within class covariance matrix to become
0:15:01	diagonal and even better if it's the just the i don't teams about
0:15:06	a constant
0:15:08	so we decided to apply exactly the same algorithm as previously
0:15:12	an iterative process which is using the same process instead that we replace
0:15:17	the performance metrics
0:15:19	by the within class commencement
0:15:22	and so by doing this
0:15:24	we can see on the spectral of the same set of development that one iteration
0:15:29	make them
0:15:31	the set become very fast so this is the session but we can see that
0:15:35	it's almost what spread
0:15:37	oh the dimensions
0:15:41	and the after two iterations
0:15:44	for all or from the point of view fume and still exactly the same but
0:15:48	in the rate that you the performance
0:15:51	so that weak emission we can see that it's completely flat and what's the effect
0:15:56	so this
0:15:57	when we use them to us so that's why i'm gonna show in a few
0:16:00	minutes
0:16:01	but before that i just want to identify the this process can also be used
0:16:05	to initialize the key here matrices
0:16:09	actually
0:16:11	for most of us are using a pc in order to them
0:16:16	to initialize the key idea matrices because
0:16:20	provide the first point
0:16:23	the first information space
0:16:25	which we can reproduce so that's a very good starting point
0:16:29	and but actually what we propose here is to use this process so we what
0:16:34	they all the i-vectors the eigenvectors basis of B
0:16:39	and then we initialize the was that this is the speech in the speaker factor
0:16:43	matrices matrix
0:16:45	we each we initialize by using the first ten dollars
0:16:49	the distance
0:16:51	then for the eigenchannel metrics we use the
0:16:56	to rescue decomposition of the brain
0:16:59	the within class commencement
0:17:00	actually if you
0:17:02	if you can see that actually using you wanted to think the eigenchannel matrix
0:17:07	you can just initialize the signal using the same process works
0:17:11	i think
0:17:12	so that some results using the so we don't using before it's just detectors plus
0:17:18	the normalization process
0:17:20	and
0:17:21	so i just want to mention that
0:17:23	for the random initialization of the pac as the performance can vary it
0:17:30	depending on the initialization point
0:17:32	we performed and experiments with different physician and then we may be averaged the results
0:17:40	so you can see the baseline that i previously presented and also the eigenvector method
0:17:45	which is not efficient this case
0:17:48	and you can see that using the spherical normalization
0:17:52	how we call this
0:17:54	you normalization
0:17:55	performance
0:17:56	so improve in the case
0:18:01	so
0:18:02	no
0:18:03	the say the C station
0:18:06	process that i just described we can see that the performance of data
0:18:10	but i just want to that the fact that performance on the best are actually
0:18:14	it's just the fact that
0:18:16	in this case going towards for
0:18:19	the performance when using this physician are just the lower bound
0:18:23	of what we obtained by using mandarin physician
0:18:26	so that means it's a it's maybe better but i guarantee a certain of and
0:18:36	performance
0:18:38	so
0:18:40	to conclude this presentation i just want to new
0:18:44	and for that the fact that we used
0:18:49	so i didn't do this to the band spectra which is very well known be
0:18:53	non-separable so that
0:18:56	use that used in the presentation may be a few
0:18:59	use it
0:19:00	but this tool was to analyze the performance of the system and actually can also
0:19:05	be used to
0:19:06	what i'm thinking after obtaining the two i-vectors
0:19:10	it's a very good indicator of the quite
0:19:13	what
0:19:13	extractor
0:19:14	because just looking at the spectral you can have a rough idea of the performance
0:19:17	we get that yeah
0:19:19	and i think iteration is doing some experiments at this time and you will present
0:19:23	this
0:19:24	in this thesis i think very soon or he doesn't
0:19:29	so i
0:19:30	this would have to be useful for analysis proposed
0:19:34	so for the case we shoot
0:19:38	coming back to our previous paper we show that the rating process
0:19:43	the normalization whitening
0:19:45	to improve the performance slightly so it's not that the improvement
0:19:49	why not doing it twice and it's three
0:19:52	and
0:19:54	also that the co-occurrence matrices
0:19:56	i think you know case perform better than using the scatter matrix
0:20:00	then to and this talk just remember that the spherical nuisance normalization
0:20:06	in the in the middle
0:20:08	improve the performance of in the case of
0:20:11	you scoring
0:20:13	and also that
0:20:14	something in mentioned before but when you use the this type of process to initialize
0:20:19	you didn't matrices
0:20:21	the and the you don't need to perform
0:20:24	so yeah em iterations
0:20:26	so for the case i presently we obtained the best performance but
0:20:31	using hundred iterations of yeah
0:20:34	in case of problem can see section
0:20:36	using this process we just need to make ten iterations
0:20:41	so if the key is not the requesting them
0:20:44	training
0:20:45	in some ways to reduce the time
0:20:49	so no if you and question
0:20:51	yeah
0:20:59	oh
0:21:04	the
0:21:07	oh
0:21:09	i
0:21:14	i
0:21:17	oh
0:21:18	i
0:21:22	i
0:21:43	i
0:21:51	yeah i actually if we get really i don't like the length normalization because it's
0:21:56	three a
0:21:57	and only not process which is going just right now so apps of justly but
0:22:03	and i think we need to find a way we address this issue
0:22:06	by finding something more
0:22:09	consistent
0:22:10	you
0:22:13	yeah i
0:22:26	and you
0:22:27	i

Variance-Spectra based Normalization for I-vector Standard and Probabilistic Linear Discriminant Analysis

SESSION 06: Speaker Recognition - Channel Robustness

Anthony Larcher