presentation is regularization of all-pole models for speaker verification under additive noise
condition
and Cemal will be present
hello everyone
I am Cemal from Turkey
and today I am going to talk about Regularization of
Linear prediction method
for speaker verification
This work was a joint work with Tomi Kinnunen from University of Eastern Finland
Rahim Saeidi
from Radboud University
and professor Paavo Alku
from Aalto University
as all we know
Speaker recognition system
is usually quite high recognition accuracy
if each speech sample
that be used was collected under clinical and
controlled condition
However
the result of the speaker recognition decreases a lot in the case of
channel mismatch or additive noise
so previously present that spectrum estimation
have a big impact
on the result of speaker recognition under additive noise
so what do I mean by spectrum estimation
okay here is that speech samples, speech frame
on the left side we choose clean speech spectrum FFT analysis spectral
and on the right hand side
it shows this noisy version
0 dB SNR level
if you look at the spectral
our spectrum is distorted in the case of noisy condition
this distortion
and causes degradation on the speaker recognition perform
so if we have a spectrum estimation method
that was not affected that much
from the noise
we wouldn't have so much degradation on the perform but
unfortunately we don't have
such as spectrum estimation method so that why we were looking for
a better way of estimation the spectrum
so yesterday
...
not to touch MFCC
it is the best but
I am so to tell you this must we need touch
so basically and simply still using MFCC
that nothing is wrong with this much but we just simply
replace FFT spectrum estimation
with new
spectrum estimation method
that what we are doing
so the first
spectrum estimation method that i am going to talk about
conventional linear prediction
as you know linear prediction assume that speech samples
can be estimated from its previous samples
The objective is to compute predictor coefficients alpha by minimizing the energy of the residual
error
so the optimum alpha value is computed by
Toeplitz autocorrelation matrix
multiplied by the autocorrelation sequence
and the another method is
All-pole modeling method again is that temporally weighted linear prediction
it assumes the same idea as
as I explain for the linear prediction
again assume we can
estimate speech sample from
its previous sample but this time
we compute the optimum
predictor coefficients by minimizing the weighted
energy of the residual error
here psi N is weighting function and will be used as short time analysis as
the weighting function
and again this time we compute the optimum
predictor coefficients
multiply the inverse
of the autocorrelation matrix but
this time we need modify because we have
here weighting function the modified auto covariance matrix
multiply autocorrelation sequence
however this LP
this spectrum estimation maybe on time problem at decode because
especially in the speech coding point of view if we have speech sample produced by
high pitched speaker
eventually it will cause the sharp peak in the spectrum
and this
this needs to be solved for the speech coding and the
perspective and recently
to solve this
problem
the regularized linear prediction was proposed
by Ekman
smooth
spectrum
produce by high pitch speaker
to smooth this sharpen in this spectrum
they modify the conventional linear prediction by
adding a penalty function
phi a which is a functional
predictive coefficients and
and regularization factor lambda
here is the penalty function in the original paper was selected at
formula here and
the reason in here A prime is the
frequency derivative of the RLP
RLP spectrum and
omega ... in here
spectrum envelope estimate
So the reason of selecting this kind of penalty function is that
they turn that a closed form non-iterative solution
so that why they use this penalty function
and in the paper they
if we look at the paper we can find the details actually done
the depend on the function can be written in the matrix form as given here
in here F is the Toeplitz matrix of the window autocorrelation sequence
and in the paper they use the
boxcar window which is the special guess because if we use the boxcar window v(m)
as the boxcar window
the F matrix
correspond to autocorrelation matrix that we use in the conventional linear prediction
this study we also consider the Blackman and hamming window
and this Blackman
in the next slide and difference
when you compute the F matrix
and D is just diagonal matrix
in which each diagonal element has a
corresponding
row or column number
and given the penalty function
the optimum prediction coefficients can be computed by
given the equation here
so this is what
Ekman proposed in his original work and
we wanted to move on this because
this method applies on the
clean speech
and in the speech coding point of view
however we wanted to see this performance of this method on speaker recognition and especially
on the additive noise condition
so
in literature there exist some works which use the
double autocorrelation sequence to estimate the spectrum of given speech samples
the most recent one is the
shimamura
from the Interspeech 2010 and they use
this double autocorrelation sequence to estimate speech spectrum
in the presence of additive noise and they
they analyzed it
for the noisy word recognition
experiments
and in here in this work we propose
to use this double autocorrelation to compute F matrix which I explained in the previous
slide
to compute the penalty function
so basically also
... the Blackman ... hamming window
we also use this double autocorrelation sequence in penalty function
to see the effects of
effects on speaker verification
so if you look at the
average residual error and the
penalty function as the function of predictor order p
as expected
as the predictor order increases
error decreases but
again as expected
the phi the penalty function increases
increases the predictor error because it is at the beginning
the main idea of the regularization is the
to smooth the spectrum
so it should penalize the spectrum
so if you look at the
different spectrum estimation methods on the
this spectrum this is again a speech samples
clean speech sample on the left hand side and on the right hand side
it's noisy version again 0 dB SNR level
and we have here
... are FFT and its spectral
and regularized linear prediction
spectrum for different window functions
so as we see
both Blackman and hamming window
does not affect so much the spectral they look very similar to each other
however when we use the
double autocorrelation sequence in regularization
we got more smoother spectrum and another point is that
and I think that is the main
problem of the
additive noise condition because
if we look at the
the difference of the
given spectrum for example
for the spectrum if we look at the maximum and minimum values
of the spectrum
for both clean and noisy condition
there are so much variations between clean and noisy cases
so I think this mismatch
this mismatch causes the main problem of the
performance regularization however
if you look at the double autocorrelation sequence the mismatch the variations become that
in comparison of two other methods so
according to this figure a expect to
see not so much difference on the recognition performance
both Blackman and hamming windows because they also
almost produce the same spectral
but we should see some differences on the double autocorrelation sequence
so the Ekman's propose regularization for conventional autocorrelation based linear prediction
we also apply regularization to the other all-pole modeling ... weighted linear prediction
and it's doubled version doubled weighted linear prediction because
it is independent from
the all-pole modeling that we use
we just need to compute the autocorrelation matrix
and autocorrelation sequence
if we compute this to
we change regularize it independent from the method
so whatever we use whatever which method we are using is
it doesn't matter we just regularize to
method that we use
so if I
if you look at the experimental setup we use NIST 2002
corpus and with GMM-UBM modeling
at the features
we use first spectral subtraction we apply spectral subtraction to a
noisy speech samples and the we extract the 12 MFCCs
and their first and second order derivative on cepstral mean and variance normalization
we also apply T-norm normalization on log-likelihood scores
we use two different types of the noises in the experiment with
five different SNR levels
factory and babble noises from noisex-92 database
and
and as I said five different SNR levels I would like to point that we
added noise only to test set speech samples we didn't touch the training samples
I mean it is the original NIST samples
maybe it is not the correct term because it is the telephony speech
so it also includes some noise convolutive noise I mean it is not additive noise
but
I prefer clean as the original NIST samples
so the first thing that we need to do when we are using regularization
that we need to optimize the lambda parameter
because it has a big impact and
if you look at the
if you look at the regularization formula here in the dark box
if when the lambda equals to zero it reduces to the conventional linear prediction
so we need to optimize it first
in our experiment to optimize it we
we run the speaker recognition experiments on the original
case
I mean original training and original
test
case
so we optimize we run the experiments for different values of lambda when we look
at the
equal error rate as a function of the lambda we can say the
lambda when lambda is 10 power of -7
we got the smallest equal error rate
so in the remain
further experiment we will use this lambda value
in the regularized linear prediction and for the regularize
weighted linear prediction and its doubled version we optimize the lambda value in the same
in the similar way so we optimize each lambda value for each method separated
so
in the first experiment we just want to see the effect of
the autocorrelation windowing
on the recognition performance
can see from table in each row
the boldface numbers show the
the best value for each row
so when we look at the different windows from ...
as I mention when we look at the spectral
the different window function does not
have a big effect on the
recognition performance however
when we look at the double autocorrelation sequence
for regularization introduces the error rate
significantly
in the remain experiments we are gonna use the double autocorrelation sequence
for regularization
and for the regularization of other all-pole modeling techniques I mean weighted linear prediction and
stabilized weighted linear prediction
when we look at the page ... FFT is our baseline that we normally use
in MFCC extraction
as we can see in clean case regularization does not improve but also does not
harm the performance
but in noisy cases especially in zero dB and -10 dB cases
the regularization
improve the recognition accuracy compared to the un-regularized version
of each pair for example LP vs RLP
when we look at the
number there for example -10 dB babble noise
the EER reduces from
20% to 16%
and this is the same for the other all-pole modeling weighted and
weighted linear prediction RSWLP
so to show some
that curl again
this is the babble noise -10dB SNR level and
FFT
conventional LP regularization and we use double autocorrelation sequence
we can see again the large improvement
indicate of regularized LP
and the same as
weighted linear prediction
we cannot see so much between conventional FFT and weighted linear prediction but when we
regularize it
the recognition performance is improved
and same for the stabilized weighted linear prediction
if we regularize, it also improves the recognition accuracy
so if you want to summarize
our observations
the first thing is that the regularization does not harm the clean condition performance
and different window functions
does not affect the recognition performance a lot but
the double autocorrelation sequence
compute the F matrix in regularization ...
spectrum envelope estimate
improve the recognition accuracy
and we also apply regularization on
other kind of all-pole modeling techniques
such as weighted linear prediction and stabilized weighted linear prediction
and thank you
this regularization can help us to improve the recognition performance because
the distortion is the main problem
of in the case of additive noise in the spectrum
if we can penalize this distortion
somehow
we can improve our recognition performance
that was the main point
actually in the slide, no, but in the paper we have some deeper analysis
on the regularization
in term of spectral distortion and something like that we have
we have more experiments on that
to achieve the minimum smallest MinDCF
to get the smallest EER when we are optimizing the lambda