0:00:15 | presentation is regularization of all-pole models for speaker verification under additive noise |
---|
0:00:21 | condition |
---|
0:00:22 | and Cemal will be present |
---|
0:00:34 | hello everyone |
---|
0:00:36 | I am Cemal from Turkey |
---|
0:00:39 | and today I am going to talk about Regularization of |
---|
0:00:42 | Linear prediction method |
---|
0:00:44 | for speaker verification |
---|
0:00:46 | This work was a joint work with Tomi Kinnunen from University of Eastern Finland |
---|
0:00:52 | Rahim Saeidi |
---|
0:00:54 | from Radboud University |
---|
0:00:56 | and professor Paavo Alku |
---|
0:00:59 | from Aalto University |
---|
0:01:04 | as all we know |
---|
0:01:05 | Speaker recognition system |
---|
0:01:07 | is usually quite high recognition accuracy |
---|
0:01:10 | if each speech sample |
---|
0:01:12 | that be used was collected under clinical and |
---|
0:01:16 | controlled condition |
---|
0:01:19 | However |
---|
0:01:22 | the result of the speaker recognition decreases a lot in the case of |
---|
0:01:27 | channel mismatch or additive noise |
---|
0:01:31 | so previously present that spectrum estimation |
---|
0:01:34 | have a big impact |
---|
0:01:36 | on the result of speaker recognition under additive noise |
---|
0:01:40 | so what do I mean by spectrum estimation |
---|
0:01:43 | okay here is that speech samples, speech frame |
---|
0:01:49 | on the left side we choose clean speech spectrum FFT analysis spectral |
---|
0:01:54 | and on the right hand side |
---|
0:01:56 | it shows this noisy version |
---|
0:01:58 | 0 dB SNR level |
---|
0:02:01 | if you look at the spectral |
---|
0:02:06 | our spectrum is distorted in the case of noisy condition |
---|
0:02:11 | this distortion |
---|
0:02:13 | and causes degradation on the speaker recognition perform |
---|
0:02:17 | so if we have a spectrum estimation method |
---|
0:02:21 | that was not affected that much |
---|
0:02:24 | from the noise |
---|
0:02:26 | we wouldn't have so much degradation on the perform but |
---|
0:02:29 | unfortunately we don't have |
---|
0:02:31 | such as spectrum estimation method so that why we were looking for |
---|
0:02:35 | a better way of estimation the spectrum |
---|
0:02:38 | so yesterday |
---|
0:02:41 | ... |
---|
0:02:42 | not to touch MFCC |
---|
0:02:45 | it is the best but |
---|
0:02:46 | I am so to tell you this must we need touch |
---|
0:02:49 | so basically and simply still using MFCC |
---|
0:02:56 | that nothing is wrong with this much but we just simply |
---|
0:02:59 | replace FFT spectrum estimation |
---|
0:03:03 | with new |
---|
0:03:05 | spectrum estimation method |
---|
0:03:07 | that what we are doing |
---|
0:03:10 | so the first |
---|
0:03:13 | spectrum estimation method that i am going to talk about |
---|
0:03:16 | conventional linear prediction |
---|
0:03:18 | as you know linear prediction assume that speech samples |
---|
0:03:22 | can be estimated from its previous samples |
---|
0:03:27 | The objective is to compute predictor coefficients alpha by minimizing the energy of the residual |
---|
0:03:33 | error |
---|
0:03:33 | so the optimum alpha value is computed by |
---|
0:03:37 | Toeplitz autocorrelation matrix |
---|
0:03:40 | multiplied by the autocorrelation sequence |
---|
0:03:45 | and the another method is |
---|
0:03:48 | All-pole modeling method again is that temporally weighted linear prediction |
---|
0:03:53 | it assumes the same idea as |
---|
0:03:58 | as I explain for the linear prediction |
---|
0:04:00 | again assume we can |
---|
0:04:02 | estimate speech sample from |
---|
0:04:04 | its previous sample but this time |
---|
0:04:06 | we compute the optimum |
---|
0:04:09 | predictor coefficients by minimizing the weighted |
---|
0:04:12 | energy of the residual error |
---|
0:04:16 | here psi N is weighting function and will be used as short time analysis as |
---|
0:04:23 | the weighting function |
---|
0:04:24 | and again this time we compute the optimum |
---|
0:04:29 | predictor coefficients |
---|
0:04:31 | multiply the inverse |
---|
0:04:33 | of the autocorrelation matrix but |
---|
0:04:35 | this time we need modify because we have |
---|
0:04:38 | here weighting function the modified auto covariance matrix |
---|
0:04:41 | multiply autocorrelation sequence |
---|
0:04:45 | however this LP |
---|
0:04:47 | this spectrum estimation maybe on time problem at decode because |
---|
0:04:52 | especially in the speech coding point of view if we have speech sample produced by |
---|
0:04:59 | high pitched speaker |
---|
0:05:00 | eventually it will cause the sharp peak in the spectrum |
---|
0:05:05 | and this |
---|
0:05:08 | this needs to be solved for the speech coding and the |
---|
0:05:12 | perspective and recently |
---|
0:05:14 | to solve this |
---|
0:05:17 | problem |
---|
0:05:18 | the regularized linear prediction was proposed |
---|
0:05:20 | by Ekman |
---|
0:05:25 | smooth |
---|
0:05:26 | spectrum |
---|
0:05:27 | produce by high pitch speaker |
---|
0:05:29 | to smooth this sharpen in this spectrum |
---|
0:05:33 | they modify the conventional linear prediction by |
---|
0:05:37 | adding a penalty function |
---|
0:05:39 | phi a which is a functional |
---|
0:05:41 | predictive coefficients and |
---|
0:05:43 | and regularization factor lambda |
---|
0:05:45 | here is the penalty function in the original paper was selected at |
---|
0:05:51 | formula here and |
---|
0:05:55 | the reason in here A prime is the |
---|
0:05:58 | frequency derivative of the RLP |
---|
0:06:03 | RLP spectrum and |
---|
0:06:06 | omega ... in here |
---|
0:06:09 | spectrum envelope estimate |
---|
0:06:16 | So the reason of selecting this kind of penalty function is that |
---|
0:06:21 | they turn that a closed form non-iterative solution |
---|
0:06:26 | so that why they use this penalty function |
---|
0:06:31 | and in the paper they |
---|
0:06:34 | if we look at the paper we can find the details actually done |
---|
0:06:37 | the depend on the function can be written in the matrix form as given here |
---|
0:06:44 | in here F is the Toeplitz matrix of the window autocorrelation sequence |
---|
0:06:50 | and in the paper they use the |
---|
0:06:52 | boxcar window which is the special guess because if we use the boxcar window v(m) |
---|
0:06:57 | as the boxcar window |
---|
0:06:59 | the F matrix |
---|
0:07:02 | correspond to autocorrelation matrix that we use in the conventional linear prediction |
---|
0:07:09 | this study we also consider the Blackman and hamming window |
---|
0:07:15 | and this Blackman |
---|
0:07:18 | in the next slide and difference |
---|
0:07:20 | when you compute the F matrix |
---|
0:07:23 | and D is just diagonal matrix |
---|
0:07:25 | in which each diagonal element has a |
---|
0:07:28 | corresponding |
---|
0:07:29 | row or column number |
---|
0:07:32 | and given the penalty function |
---|
0:07:35 | the optimum prediction coefficients can be computed by |
---|
0:07:39 | given the equation here |
---|
0:07:42 | so this is what |
---|
0:07:45 | Ekman proposed in his original work and |
---|
0:07:48 | we wanted to move on this because |
---|
0:07:50 | this method applies on the |
---|
0:07:53 | clean speech |
---|
0:07:54 | and in the speech coding point of view |
---|
0:07:57 | however we wanted to see this performance of this method on speaker recognition and especially |
---|
0:08:03 | on the additive noise condition |
---|
0:08:06 | so |
---|
0:08:08 | in literature there exist some works which use the |
---|
0:08:14 | double autocorrelation sequence to estimate the spectrum of given speech samples |
---|
0:08:19 | the most recent one is the |
---|
0:08:22 | shimamura |
---|
0:08:24 | from the Interspeech 2010 and they use |
---|
0:08:27 | this double autocorrelation sequence to estimate speech spectrum |
---|
0:08:32 | in the presence of additive noise and they |
---|
0:08:35 | they analyzed it |
---|
0:08:38 | for the noisy word recognition |
---|
0:08:40 | experiments |
---|
0:08:41 | and in here in this work we propose |
---|
0:08:45 | to use this double autocorrelation to compute F matrix which I explained in the previous |
---|
0:08:51 | slide |
---|
0:08:51 | to compute the penalty function |
---|
0:08:54 | so basically also |
---|
0:08:57 | ... the Blackman ... hamming window |
---|
0:09:00 | we also use this double autocorrelation sequence in penalty function |
---|
0:09:05 | to see the effects of |
---|
0:09:08 | effects on speaker verification |
---|
0:09:11 | so if you look at the |
---|
0:09:13 | average residual error and the |
---|
0:09:17 | penalty function as the function of predictor order p |
---|
0:09:20 | as expected |
---|
0:09:22 | as the predictor order increases |
---|
0:09:25 | error decreases but |
---|
0:09:29 | again as expected |
---|
0:09:31 | the phi the penalty function increases |
---|
0:09:34 | increases the predictor error because it is at the beginning |
---|
0:09:41 | the main idea of the regularization is the |
---|
0:09:44 | to smooth the spectrum |
---|
0:09:46 | so it should penalize the spectrum |
---|
0:09:51 | so if you look at the |
---|
0:09:54 | different spectrum estimation methods on the |
---|
0:09:57 | this spectrum this is again a speech samples |
---|
0:10:01 | clean speech sample on the left hand side and on the right hand side |
---|
0:10:05 | it's noisy version again 0 dB SNR level |
---|
0:10:09 | and we have here |
---|
0:10:10 | ... are FFT and its spectral |
---|
0:10:13 | and regularized linear prediction |
---|
0:10:16 | spectrum for different window functions |
---|
0:10:18 | so as we see |
---|
0:10:20 | both Blackman and hamming window |
---|
0:10:23 | does not affect so much the spectral they look very similar to each other |
---|
0:10:27 | however when we use the |
---|
0:10:29 | double autocorrelation sequence in regularization |
---|
0:10:32 | we got more smoother spectrum and another point is that |
---|
0:10:37 | and I think that is the main |
---|
0:10:39 | problem of the |
---|
0:10:41 | additive noise condition because |
---|
0:10:43 | if we look at the |
---|
0:10:46 | the difference of the |
---|
0:10:47 | given spectrum for example |
---|
0:10:49 | for the spectrum if we look at the maximum and minimum values |
---|
0:10:53 | of the spectrum |
---|
0:10:55 | for both clean and noisy condition |
---|
0:10:58 | there are so much variations between clean and noisy cases |
---|
0:11:02 | so I think this mismatch |
---|
0:11:04 | this mismatch causes the main problem of the |
---|
0:11:09 | performance regularization however |
---|
0:11:12 | if you look at the double autocorrelation sequence the mismatch the variations become that |
---|
0:11:18 | in comparison of two other methods so |
---|
0:11:21 | according to this figure a expect to |
---|
0:11:24 | see not so much difference on the recognition performance |
---|
0:11:28 | both Blackman and hamming windows because they also |
---|
0:11:33 | almost produce the same spectral |
---|
0:11:35 | but we should see some differences on the double autocorrelation sequence |
---|
0:11:43 | so the Ekman's propose regularization for conventional autocorrelation based linear prediction |
---|
0:11:50 | we also apply regularization to the other all-pole modeling ... weighted linear prediction |
---|
0:11:58 | and it's doubled version doubled weighted linear prediction because |
---|
0:12:03 | it is independent from |
---|
0:12:04 | the all-pole modeling that we use |
---|
0:12:07 | we just need to compute the autocorrelation matrix |
---|
0:12:11 | and autocorrelation sequence |
---|
0:12:13 | if we compute this to |
---|
0:12:16 | we change regularize it independent from the method |
---|
0:12:20 | so whatever we use whatever which method we are using is |
---|
0:12:26 | it doesn't matter we just regularize to |
---|
0:12:30 | method that we use |
---|
0:12:35 | so if I |
---|
0:12:36 | if you look at the experimental setup we use NIST 2002 |
---|
0:12:40 | corpus and with GMM-UBM modeling |
---|
0:12:43 | at the features |
---|
0:12:45 | we use first spectral subtraction we apply spectral subtraction to a |
---|
0:12:50 | noisy speech samples and the we extract the 12 MFCCs |
---|
0:12:54 | and their first and second order derivative on cepstral mean and variance normalization |
---|
0:13:00 | we also apply T-norm normalization on log-likelihood scores |
---|
0:13:06 | we use two different types of the noises in the experiment with |
---|
0:13:10 | five different SNR levels |
---|
0:13:13 | factory and babble noises from noisex-92 database |
---|
0:13:17 | and |
---|
0:13:19 | and as I said five different SNR levels I would like to point that we |
---|
0:13:24 | added noise only to test set speech samples we didn't touch the training samples |
---|
0:13:29 | I mean it is the original NIST samples |
---|
0:13:35 | maybe it is not the correct term because it is the telephony speech |
---|
0:13:39 | so it also includes some noise convolutive noise I mean it is not additive noise |
---|
0:13:44 | but |
---|
0:13:44 | I prefer clean as the original NIST samples |
---|
0:13:50 | so the first thing that we need to do when we are using regularization |
---|
0:13:55 | that we need to optimize the lambda parameter |
---|
0:13:58 | because it has a big impact and |
---|
0:14:00 | if you look at the |
---|
0:14:01 | if you look at the regularization formula here in the dark box |
---|
0:14:06 | if when the lambda equals to zero it reduces to the conventional linear prediction |
---|
0:14:12 | so we need to optimize it first |
---|
0:14:14 | in our experiment to optimize it we |
---|
0:14:17 | we run the speaker recognition experiments on the original |
---|
0:14:22 | case |
---|
0:14:23 | I mean original training and original |
---|
0:14:25 | test |
---|
0:14:27 | case |
---|
0:14:28 | so we optimize we run the experiments for different values of lambda when we look |
---|
0:14:34 | at the |
---|
0:14:35 | equal error rate as a function of the lambda we can say the |
---|
0:14:39 | lambda when lambda is 10 power of -7 |
---|
0:14:42 | we got the smallest equal error rate |
---|
0:14:45 | so in the remain |
---|
0:14:48 | further experiment we will use this lambda value |
---|
0:14:51 | in the regularized linear prediction and for the regularize |
---|
0:14:56 | weighted linear prediction and its doubled version we optimize the lambda value in the same |
---|
0:15:02 | in the similar way so we optimize each lambda value for each method separated |
---|
0:15:10 | so |
---|
0:15:11 | in the first experiment we just want to see the effect of |
---|
0:15:16 | the autocorrelation windowing |
---|
0:15:18 | on the recognition performance |
---|
0:15:21 | can see from table in each row |
---|
0:15:25 | the boldface numbers show the |
---|
0:15:31 | the best value for each row |
---|
0:15:33 | so when we look at the different windows from ... |
---|
0:15:36 | as I mention when we look at the spectral |
---|
0:15:39 | the different window function does not |
---|
0:15:43 | have a big effect on the |
---|
0:15:45 | recognition performance however |
---|
0:15:48 | when we look at the double autocorrelation sequence |
---|
0:15:51 | for regularization introduces the error rate |
---|
0:15:55 | significantly |
---|
0:15:58 | in the remain experiments we are gonna use the double autocorrelation sequence |
---|
0:16:03 | for regularization |
---|
0:16:08 | and for the regularization of other all-pole modeling techniques I mean weighted linear prediction and |
---|
0:16:15 | stabilized weighted linear prediction |
---|
0:16:17 | when we look at the page ... FFT is our baseline that we normally use |
---|
0:16:23 | in MFCC extraction |
---|
0:16:24 | as we can see in clean case regularization does not improve but also does not |
---|
0:16:31 | harm the performance |
---|
0:16:32 | but in noisy cases especially in zero dB and -10 dB cases |
---|
0:16:37 | the regularization |
---|
0:16:39 | improve the recognition accuracy compared to the un-regularized version |
---|
0:16:45 | of each pair for example LP vs RLP |
---|
0:16:50 | when we look at the |
---|
0:16:52 | number there for example -10 dB babble noise |
---|
0:16:55 | the EER reduces from |
---|
0:16:58 | 20% to 16% |
---|
0:17:02 | and this is the same for the other all-pole modeling weighted and |
---|
0:17:05 | weighted linear prediction RSWLP |
---|
0:17:08 | so to show some |
---|
0:17:11 | that curl again |
---|
0:17:13 | this is the babble noise -10dB SNR level and |
---|
0:17:18 | FFT |
---|
0:17:20 | conventional LP regularization and we use double autocorrelation sequence |
---|
0:17:25 | we can see again the large improvement |
---|
0:17:28 | indicate of regularized LP |
---|
0:17:31 | and the same as |
---|
0:17:33 | weighted linear prediction |
---|
0:17:35 | we cannot see so much between conventional FFT and weighted linear prediction but when we |
---|
0:17:41 | regularize it |
---|
0:17:41 | the recognition performance is improved |
---|
0:17:46 | and same for the stabilized weighted linear prediction |
---|
0:17:50 | if we regularize, it also improves the recognition accuracy |
---|
0:17:56 | so if you want to summarize |
---|
0:17:58 | our observations |
---|
0:18:00 | the first thing is that the regularization does not harm the clean condition performance |
---|
0:18:07 | and different window functions |
---|
0:18:09 | does not affect the recognition performance a lot but |
---|
0:18:12 | the double autocorrelation sequence |
---|
0:18:15 | compute the F matrix in regularization ... |
---|
0:18:19 | spectrum envelope estimate |
---|
0:18:22 | improve the recognition accuracy |
---|
0:18:24 | and we also apply regularization on |
---|
0:18:30 | other kind of all-pole modeling techniques |
---|
0:18:32 | such as weighted linear prediction and stabilized weighted linear prediction |
---|
0:18:37 | and thank you |
---|
0:19:10 | this regularization can help us to improve the recognition performance because |
---|
0:19:16 | the distortion is the main problem |
---|
0:19:20 | of in the case of additive noise in the spectrum |
---|
0:19:23 | if we can penalize this distortion |
---|
0:19:26 | somehow |
---|
0:19:28 | we can improve our recognition performance |
---|
0:19:31 | that was the main point |
---|
0:19:51 | actually in the slide, no, but in the paper we have some deeper analysis |
---|
0:19:56 | on the regularization |
---|
0:19:58 | in term of spectral distortion and something like that we have |
---|
0:20:01 | we have more experiments on that |
---|
0:20:49 | to achieve the minimum smallest MinDCF |
---|
0:21:01 | to get the smallest EER when we are optimizing the lambda |
---|