Speech Transcript - Regularization of All-Pole Models for Speaker Verification Under Additive Noise

0:00:15	presentation is regularization of all-pole models for speaker verification under additive noise
0:00:21	condition
0:00:22	and Cemal will be present
0:00:34	hello everyone
0:00:36	I am Cemal from Turkey
0:00:39	and today I am going to talk about Regularization of
0:00:42	Linear prediction method
0:00:44	for speaker verification
0:00:46	This work was a joint work with Tomi Kinnunen from University of Eastern Finland
0:00:52	Rahim Saeidi
0:00:54	from Radboud University
0:00:56	and professor Paavo Alku
0:00:59	from Aalto University
0:01:04	as all we know
0:01:05	Speaker recognition system
0:01:07	is usually quite high recognition accuracy
0:01:10	if each speech sample
0:01:12	that be used was collected under clinical and
0:01:16	controlled condition
0:01:19	However
0:01:22	the result of the speaker recognition decreases a lot in the case of
0:01:27	channel mismatch or additive noise
0:01:31	so previously present that spectrum estimation
0:01:34	have a big impact
0:01:36	on the result of speaker recognition under additive noise
0:01:40	so what do I mean by spectrum estimation
0:01:43	okay here is that speech samples, speech frame
0:01:49	on the left side we choose clean speech spectrum FFT analysis spectral
0:01:54	and on the right hand side
0:01:56	it shows this noisy version
0:01:58	0 dB SNR level
0:02:01	if you look at the spectral
0:02:06	our spectrum is distorted in the case of noisy condition
0:02:11	this distortion
0:02:13	and causes degradation on the speaker recognition perform
0:02:17	so if we have a spectrum estimation method
0:02:21	that was not affected that much
0:02:24	from the noise
0:02:26	we wouldn't have so much degradation on the perform but
0:02:29	unfortunately we don't have
0:02:31	such as spectrum estimation method so that why we were looking for
0:02:35	a better way of estimation the spectrum
0:02:38	so yesterday
0:02:41	...
0:02:42	not to touch MFCC
0:02:45	it is the best but
0:02:46	I am so to tell you this must we need touch
0:02:49	so basically and simply still using MFCC
0:02:56	that nothing is wrong with this much but we just simply
0:02:59	replace FFT spectrum estimation
0:03:03	with new
0:03:05	spectrum estimation method
0:03:07	that what we are doing
0:03:10	so the first
0:03:13	spectrum estimation method that i am going to talk about
0:03:16	conventional linear prediction
0:03:18	as you know linear prediction assume that speech samples
0:03:22	can be estimated from its previous samples
0:03:27	The objective is to compute predictor coefficients alpha by minimizing the energy of the residual
0:03:33	error
0:03:33	so the optimum alpha value is computed by
0:03:37	Toeplitz autocorrelation matrix
0:03:40	multiplied by the autocorrelation sequence
0:03:45	and the another method is
0:03:48	All-pole modeling method again is that temporally weighted linear prediction
0:03:53	it assumes the same idea as
0:03:58	as I explain for the linear prediction
0:04:00	again assume we can
0:04:02	estimate speech sample from
0:04:04	its previous sample but this time
0:04:06	we compute the optimum
0:04:09	predictor coefficients by minimizing the weighted
0:04:12	energy of the residual error
0:04:16	here psi N is weighting function and will be used as short time analysis as
0:04:23	the weighting function
0:04:24	and again this time we compute the optimum
0:04:29	predictor coefficients
0:04:31	multiply the inverse
0:04:33	of the autocorrelation matrix but
0:04:35	this time we need modify because we have
0:04:38	here weighting function the modified auto covariance matrix
0:04:41	multiply autocorrelation sequence
0:04:45	however this LP
0:04:47	this spectrum estimation maybe on time problem at decode because
0:04:52	especially in the speech coding point of view if we have speech sample produced by
0:04:59	high pitched speaker
0:05:00	eventually it will cause the sharp peak in the spectrum
0:05:05	and this
0:05:08	this needs to be solved for the speech coding and the
0:05:12	perspective and recently
0:05:14	to solve this
0:05:17	problem
0:05:18	the regularized linear prediction was proposed
0:05:20	by Ekman
0:05:25	smooth
0:05:26	spectrum
0:05:27	produce by high pitch speaker
0:05:29	to smooth this sharpen in this spectrum
0:05:33	they modify the conventional linear prediction by
0:05:37	adding a penalty function
0:05:39	phi a which is a functional
0:05:41	predictive coefficients and
0:05:43	and regularization factor lambda
0:05:45	here is the penalty function in the original paper was selected at
0:05:51	formula here and
0:05:55	the reason in here A prime is the
0:05:58	frequency derivative of the RLP
0:06:03	RLP spectrum and
0:06:06	omega ... in here
0:06:09	spectrum envelope estimate
0:06:16	So the reason of selecting this kind of penalty function is that
0:06:21	they turn that a closed form non-iterative solution
0:06:26	so that why they use this penalty function
0:06:31	and in the paper they
0:06:34	if we look at the paper we can find the details actually done
0:06:37	the depend on the function can be written in the matrix form as given here
0:06:44	in here F is the Toeplitz matrix of the window autocorrelation sequence
0:06:50	and in the paper they use the
0:06:52	boxcar window which is the special guess because if we use the boxcar window v(m)
0:06:57	as the boxcar window
0:06:59	the F matrix
0:07:02	correspond to autocorrelation matrix that we use in the conventional linear prediction
0:07:09	this study we also consider the Blackman and hamming window
0:07:15	and this Blackman
0:07:18	in the next slide and difference
0:07:20	when you compute the F matrix
0:07:23	and D is just diagonal matrix
0:07:25	in which each diagonal element has a
0:07:28	corresponding
0:07:29	row or column number
0:07:32	and given the penalty function
0:07:35	the optimum prediction coefficients can be computed by
0:07:39	given the equation here
0:07:42	so this is what
0:07:45	Ekman proposed in his original work and
0:07:48	we wanted to move on this because
0:07:50	this method applies on the
0:07:53	clean speech
0:07:54	and in the speech coding point of view
0:07:57	however we wanted to see this performance of this method on speaker recognition and especially
0:08:03	on the additive noise condition
0:08:06	so
0:08:08	in literature there exist some works which use the
0:08:14	double autocorrelation sequence to estimate the spectrum of given speech samples
0:08:19	the most recent one is the
0:08:22	shimamura
0:08:24	from the Interspeech 2010 and they use
0:08:27	this double autocorrelation sequence to estimate speech spectrum
0:08:32	in the presence of additive noise and they
0:08:35	they analyzed it
0:08:38	for the noisy word recognition
0:08:40	experiments
0:08:41	and in here in this work we propose
0:08:45	to use this double autocorrelation to compute F matrix which I explained in the previous
0:08:51	slide
0:08:51	to compute the penalty function
0:08:54	so basically also
0:08:57	... the Blackman ... hamming window
0:09:00	we also use this double autocorrelation sequence in penalty function
0:09:05	to see the effects of
0:09:08	effects on speaker verification
0:09:11	so if you look at the
0:09:13	average residual error and the
0:09:17	penalty function as the function of predictor order p
0:09:20	as expected
0:09:22	as the predictor order increases
0:09:25	error decreases but
0:09:29	again as expected
0:09:31	the phi the penalty function increases
0:09:34	increases the predictor error because it is at the beginning
0:09:41	the main idea of the regularization is the
0:09:44	to smooth the spectrum
0:09:46	so it should penalize the spectrum
0:09:51	so if you look at the
0:09:54	different spectrum estimation methods on the
0:09:57	this spectrum this is again a speech samples
0:10:01	clean speech sample on the left hand side and on the right hand side
0:10:05	it's noisy version again 0 dB SNR level
0:10:09	and we have here
0:10:10	... are FFT and its spectral
0:10:13	and regularized linear prediction
0:10:16	spectrum for different window functions
0:10:18	so as we see
0:10:20	both Blackman and hamming window
0:10:23	does not affect so much the spectral they look very similar to each other
0:10:27	however when we use the
0:10:29	double autocorrelation sequence in regularization
0:10:32	we got more smoother spectrum and another point is that
0:10:37	and I think that is the main
0:10:39	problem of the
0:10:41	additive noise condition because
0:10:43	if we look at the
0:10:46	the difference of the
0:10:47	given spectrum for example
0:10:49	for the spectrum if we look at the maximum and minimum values
0:10:53	of the spectrum
0:10:55	for both clean and noisy condition
0:10:58	there are so much variations between clean and noisy cases
0:11:02	so I think this mismatch
0:11:04	this mismatch causes the main problem of the
0:11:09	performance regularization however
0:11:12	if you look at the double autocorrelation sequence the mismatch the variations become that
0:11:18	in comparison of two other methods so
0:11:21	according to this figure a expect to
0:11:24	see not so much difference on the recognition performance
0:11:28	both Blackman and hamming windows because they also
0:11:33	almost produce the same spectral
0:11:35	but we should see some differences on the double autocorrelation sequence
0:11:43	so the Ekman's propose regularization for conventional autocorrelation based linear prediction
0:11:50	we also apply regularization to the other all-pole modeling ... weighted linear prediction
0:11:58	and it's doubled version doubled weighted linear prediction because
0:12:03	it is independent from
0:12:04	the all-pole modeling that we use
0:12:07	we just need to compute the autocorrelation matrix
0:12:11	and autocorrelation sequence
0:12:13	if we compute this to
0:12:16	we change regularize it independent from the method
0:12:20	so whatever we use whatever which method we are using is
0:12:26	it doesn't matter we just regularize to
0:12:30	method that we use
0:12:35	so if I
0:12:36	if you look at the experimental setup we use NIST 2002
0:12:40	corpus and with GMM-UBM modeling
0:12:43	at the features
0:12:45	we use first spectral subtraction we apply spectral subtraction to a
0:12:50	noisy speech samples and the we extract the 12 MFCCs
0:12:54	and their first and second order derivative on cepstral mean and variance normalization
0:13:00	we also apply T-norm normalization on log-likelihood scores
0:13:06	we use two different types of the noises in the experiment with
0:13:10	five different SNR levels
0:13:13	factory and babble noises from noisex-92 database
0:13:17	and
0:13:19	and as I said five different SNR levels I would like to point that we
0:13:24	added noise only to test set speech samples we didn't touch the training samples
0:13:29	I mean it is the original NIST samples
0:13:35	maybe it is not the correct term because it is the telephony speech
0:13:39	so it also includes some noise convolutive noise I mean it is not additive noise
0:13:44	but
0:13:44	I prefer clean as the original NIST samples
0:13:50	so the first thing that we need to do when we are using regularization
0:13:55	that we need to optimize the lambda parameter
0:13:58	because it has a big impact and
0:14:00	if you look at the
0:14:01	if you look at the regularization formula here in the dark box
0:14:06	if when the lambda equals to zero it reduces to the conventional linear prediction
0:14:12	so we need to optimize it first
0:14:14	in our experiment to optimize it we
0:14:17	we run the speaker recognition experiments on the original
0:14:22	case
0:14:23	I mean original training and original
0:14:25	test
0:14:27	case
0:14:28	so we optimize we run the experiments for different values of lambda when we look
0:14:34	at the
0:14:35	equal error rate as a function of the lambda we can say the
0:14:39	lambda when lambda is 10 power of -7
0:14:42	we got the smallest equal error rate
0:14:45	so in the remain
0:14:48	further experiment we will use this lambda value
0:14:51	in the regularized linear prediction and for the regularize
0:14:56	weighted linear prediction and its doubled version we optimize the lambda value in the same
0:15:02	in the similar way so we optimize each lambda value for each method separated
0:15:10	so
0:15:11	in the first experiment we just want to see the effect of
0:15:16	the autocorrelation windowing
0:15:18	on the recognition performance
0:15:21	can see from table in each row
0:15:25	the boldface numbers show the
0:15:31	the best value for each row
0:15:33	so when we look at the different windows from ...
0:15:36	as I mention when we look at the spectral
0:15:39	the different window function does not
0:15:43	have a big effect on the
0:15:45	recognition performance however
0:15:48	when we look at the double autocorrelation sequence
0:15:51	for regularization introduces the error rate
0:15:55	significantly
0:15:58	in the remain experiments we are gonna use the double autocorrelation sequence
0:16:03	for regularization
0:16:08	and for the regularization of other all-pole modeling techniques I mean weighted linear prediction and
0:16:15	stabilized weighted linear prediction
0:16:17	when we look at the page ... FFT is our baseline that we normally use
0:16:23	in MFCC extraction
0:16:24	as we can see in clean case regularization does not improve but also does not
0:16:31	harm the performance
0:16:32	but in noisy cases especially in zero dB and -10 dB cases
0:16:37	the regularization
0:16:39	improve the recognition accuracy compared to the un-regularized version
0:16:45	of each pair for example LP vs RLP
0:16:50	when we look at the
0:16:52	number there for example -10 dB babble noise
0:16:55	the EER reduces from
0:16:58	20% to 16%
0:17:02	and this is the same for the other all-pole modeling weighted and
0:17:05	weighted linear prediction RSWLP
0:17:08	so to show some
0:17:11	that curl again
0:17:13	this is the babble noise -10dB SNR level and
0:17:18	FFT
0:17:20	conventional LP regularization and we use double autocorrelation sequence
0:17:25	we can see again the large improvement
0:17:28	indicate of regularized LP
0:17:31	and the same as
0:17:33	weighted linear prediction
0:17:35	we cannot see so much between conventional FFT and weighted linear prediction but when we
0:17:41	regularize it
0:17:41	the recognition performance is improved
0:17:46	and same for the stabilized weighted linear prediction
0:17:50	if we regularize, it also improves the recognition accuracy
0:17:56	so if you want to summarize
0:17:58	our observations
0:18:00	the first thing is that the regularization does not harm the clean condition performance
0:18:07	and different window functions
0:18:09	does not affect the recognition performance a lot but
0:18:12	the double autocorrelation sequence
0:18:15	compute the F matrix in regularization ...
0:18:19	spectrum envelope estimate
0:18:22	improve the recognition accuracy
0:18:24	and we also apply regularization on
0:18:30	other kind of all-pole modeling techniques
0:18:32	such as weighted linear prediction and stabilized weighted linear prediction
0:18:37	and thank you
0:19:10	this regularization can help us to improve the recognition performance because
0:19:16	the distortion is the main problem
0:19:20	of in the case of additive noise in the spectrum
0:19:23	if we can penalize this distortion
0:19:26	somehow
0:19:28	we can improve our recognition performance
0:19:31	that was the main point
0:19:51	actually in the slide, no, but in the paper we have some deeper analysis
0:19:56	on the regularization
0:19:58	in term of spectral distortion and something like that we have
0:20:01	we have more experiments on that
0:20:49	to achieve the minimum smallest MinDCF
0:21:01	to get the smallest EER when we are optimizing the lambda

Regularization of All-Pole Models for Speaker Verification Under Additive Noise

SESSION 08: Features for Speaker Recognition

Cemal Hanilci