Speech Transcript - Regularization of All-Pole Models for Speaker Verification Under Additive Noise

presentation is regularization of all-pole models for speaker verification under additive noise

condition

and Cemal will be present

hello everyone

I am Cemal from Turkey

and today I am going to talk about Regularization of

Linear prediction method

for speaker verification

This work was a joint work with Tomi Kinnunen from University of Eastern Finland

Rahim Saeidi

from Radboud University

and professor Paavo Alku

from Aalto University

as all we know

Speaker recognition system

is usually quite high recognition accuracy

if each speech sample

that be used was collected under clinical and

controlled condition

However

the result of the speaker recognition decreases a lot in the case of

channel mismatch or additive noise

so previously present that spectrum estimation

have a big impact

on the result of speaker recognition under additive noise

so what do I mean by spectrum estimation

okay here is that speech samples, speech frame

on the left side we choose clean speech spectrum FFT analysis spectral

and on the right hand side

it shows this noisy version

0 dB SNR level

if you look at the spectral

our spectrum is distorted in the case of noisy condition

this distortion

and causes degradation on the speaker recognition perform

so if we have a spectrum estimation method

that was not affected that much

from the noise

we wouldn't have so much degradation on the perform but

unfortunately we don't have

such as spectrum estimation method so that why we were looking for

a better way of estimation the spectrum

so yesterday

...

not to touch MFCC

it is the best but

I am so to tell you this must we need touch

so basically and simply still using MFCC

that nothing is wrong with this much but we just simply

replace FFT spectrum estimation

with new

spectrum estimation method

that what we are doing

so the first

spectrum estimation method that i am going to talk about

conventional linear prediction

as you know linear prediction assume that speech samples

can be estimated from its previous samples

The objective is to compute predictor coefficients alpha by minimizing the energy of the residual

error

so the optimum alpha value is computed by

Toeplitz autocorrelation matrix

multiplied by the autocorrelation sequence

and the another method is

All-pole modeling method again is that temporally weighted linear prediction

it assumes the same idea as

as I explain for the linear prediction

again assume we can

estimate speech sample from

its previous sample but this time

we compute the optimum

predictor coefficients by minimizing the weighted

energy of the residual error

here psi N is weighting function and will be used as short time analysis as

the weighting function

and again this time we compute the optimum

predictor coefficients

multiply the inverse

of the autocorrelation matrix but

this time we need modify because we have

here weighting function the modified auto covariance matrix

multiply autocorrelation sequence

however this LP

this spectrum estimation maybe on time problem at decode because

especially in the speech coding point of view if we have speech sample produced by

high pitched speaker

eventually it will cause the sharp peak in the spectrum

and this

this needs to be solved for the speech coding and the

perspective and recently

to solve this

problem

the regularized linear prediction was proposed

by Ekman

smooth

spectrum

produce by high pitch speaker

to smooth this sharpen in this spectrum

they modify the conventional linear prediction by

adding a penalty function

phi a which is a functional

predictive coefficients and

and regularization factor lambda

here is the penalty function in the original paper was selected at

formula here and

the reason in here A prime is the

frequency derivative of the RLP

RLP spectrum and

omega ... in here

spectrum envelope estimate

So the reason of selecting this kind of penalty function is that

they turn that a closed form non-iterative solution

so that why they use this penalty function

and in the paper they

if we look at the paper we can find the details actually done

the depend on the function can be written in the matrix form as given here

in here F is the Toeplitz matrix of the window autocorrelation sequence

and in the paper they use the

boxcar window which is the special guess because if we use the boxcar window v(m)

as the boxcar window

the F matrix

correspond to autocorrelation matrix that we use in the conventional linear prediction

this study we also consider the Blackman and hamming window

and this Blackman

in the next slide and difference

when you compute the F matrix

and D is just diagonal matrix

in which each diagonal element has a

corresponding

row or column number

and given the penalty function

the optimum prediction coefficients can be computed by

given the equation here

so this is what

Ekman proposed in his original work and

we wanted to move on this because

this method applies on the

clean speech

and in the speech coding point of view

however we wanted to see this performance of this method on speaker recognition and especially

on the additive noise condition

in literature there exist some works which use the

double autocorrelation sequence to estimate the spectrum of given speech samples

the most recent one is the

shimamura

from the Interspeech 2010 and they use

this double autocorrelation sequence to estimate speech spectrum

in the presence of additive noise and they

they analyzed it

for the noisy word recognition

experiments

and in here in this work we propose

to use this double autocorrelation to compute F matrix which I explained in the previous

slide

to compute the penalty function

so basically also

... the Blackman ... hamming window

we also use this double autocorrelation sequence in penalty function

to see the effects of

effects on speaker verification

so if you look at the

average residual error and the

penalty function as the function of predictor order p

as expected

as the predictor order increases

error decreases but

again as expected

the phi the penalty function increases

increases the predictor error because it is at the beginning

the main idea of the regularization is the

to smooth the spectrum

so it should penalize the spectrum

so if you look at the

different spectrum estimation methods on the

this spectrum this is again a speech samples

clean speech sample on the left hand side and on the right hand side

it's noisy version again 0 dB SNR level

and we have here

... are FFT and its spectral

and regularized linear prediction

spectrum for different window functions

so as we see

both Blackman and hamming window

does not affect so much the spectral they look very similar to each other

however when we use the

double autocorrelation sequence in regularization

we got more smoother spectrum and another point is that

and I think that is the main

problem of the

additive noise condition because

if we look at the

the difference of the

given spectrum for example

for the spectrum if we look at the maximum and minimum values

of the spectrum

for both clean and noisy condition

there are so much variations between clean and noisy cases

so I think this mismatch

this mismatch causes the main problem of the

performance regularization however

if you look at the double autocorrelation sequence the mismatch the variations become that

in comparison of two other methods so

according to this figure a expect to

see not so much difference on the recognition performance

both Blackman and hamming windows because they also

almost produce the same spectral

but we should see some differences on the double autocorrelation sequence

so the Ekman's propose regularization for conventional autocorrelation based linear prediction

we also apply regularization to the other all-pole modeling ... weighted linear prediction

and it's doubled version doubled weighted linear prediction because

it is independent from

the all-pole modeling that we use

we just need to compute the autocorrelation matrix

and autocorrelation sequence

if we compute this to

we change regularize it independent from the method

so whatever we use whatever which method we are using is

it doesn't matter we just regularize to

method that we use

so if I

if you look at the experimental setup we use NIST 2002

corpus and with GMM-UBM modeling

at the features

we use first spectral subtraction we apply spectral subtraction to a

noisy speech samples and the we extract the 12 MFCCs

and their first and second order derivative on cepstral mean and variance normalization

we also apply T-norm normalization on log-likelihood scores

we use two different types of the noises in the experiment with

five different SNR levels

factory and babble noises from noisex-92 database

and

and as I said five different SNR levels I would like to point that we

added noise only to test set speech samples we didn't touch the training samples

I mean it is the original NIST samples

maybe it is not the correct term because it is the telephony speech

so it also includes some noise convolutive noise I mean it is not additive noise

but

I prefer clean as the original NIST samples

so the first thing that we need to do when we are using regularization

that we need to optimize the lambda parameter

because it has a big impact and

if you look at the

if you look at the regularization formula here in the dark box

if when the lambda equals to zero it reduces to the conventional linear prediction

so we need to optimize it first

in our experiment to optimize it we

we run the speaker recognition experiments on the original

case

I mean original training and original

test

case

so we optimize we run the experiments for different values of lambda when we look

at the

equal error rate as a function of the lambda we can say the

lambda when lambda is 10 power of -7

we got the smallest equal error rate

so in the remain

further experiment we will use this lambda value

in the regularized linear prediction and for the regularize

weighted linear prediction and its doubled version we optimize the lambda value in the same

in the similar way so we optimize each lambda value for each method separated

in the first experiment we just want to see the effect of

the autocorrelation windowing

on the recognition performance

can see from table in each row

the boldface numbers show the

the best value for each row

so when we look at the different windows from ...

as I mention when we look at the spectral

the different window function does not

have a big effect on the

recognition performance however

when we look at the double autocorrelation sequence

for regularization introduces the error rate

significantly

in the remain experiments we are gonna use the double autocorrelation sequence

for regularization

and for the regularization of other all-pole modeling techniques I mean weighted linear prediction and

stabilized weighted linear prediction

when we look at the page ... FFT is our baseline that we normally use

in MFCC extraction

as we can see in clean case regularization does not improve but also does not

harm the performance

but in noisy cases especially in zero dB and -10 dB cases

the regularization

improve the recognition accuracy compared to the un-regularized version

of each pair for example LP vs RLP

when we look at the

number there for example -10 dB babble noise

the EER reduces from

20% to 16%

and this is the same for the other all-pole modeling weighted and

weighted linear prediction RSWLP

so to show some

that curl again

this is the babble noise -10dB SNR level and

FFT

conventional LP regularization and we use double autocorrelation sequence

we can see again the large improvement

indicate of regularized LP

and the same as

weighted linear prediction

we cannot see so much between conventional FFT and weighted linear prediction but when we

regularize it

the recognition performance is improved

and same for the stabilized weighted linear prediction

if we regularize, it also improves the recognition accuracy

so if you want to summarize

our observations

the first thing is that the regularization does not harm the clean condition performance

and different window functions

does not affect the recognition performance a lot but

the double autocorrelation sequence

compute the F matrix in regularization ...

spectrum envelope estimate

improve the recognition accuracy

and we also apply regularization on

other kind of all-pole modeling techniques

such as weighted linear prediction and stabilized weighted linear prediction

and thank you

this regularization can help us to improve the recognition performance because

the distortion is the main problem

of in the case of additive noise in the spectrum

if we can penalize this distortion

somehow

we can improve our recognition performance

that was the main point

actually in the slide, no, but in the paper we have some deeper analysis

on the regularization

in term of spectral distortion and something like that we have

we have more experiments on that

to achieve the minimum smallest MinDCF

to get the smallest EER when we are optimizing the lambda

Regularization of All-Pole Models for Speaker Verification Under Additive Noise

SESSION 08: Features for Speaker Recognition

Cemal Hanilci