Speech Transcript - Iterative Bayesian and MMSE-based noise compensation techniques for speaker recognition in the i-vector space

so hi everyone i'll present thing to the iterative bayesian and mmse by noise compensation

techniques for speaker recognition in the i-vector space

so let's

start by setting up the problem

here we are working on noise also noise is one of the biggest problem in

speaker recognition

and the a lot of techniques have been proposed in the but in the past

years to deal with it in different domains

such as speech enhancement techniques

feature compensation mother compensation and robust scoring and in the last years the nn based

techniques

for a the robust feature extraction or a robust computations or statistics or

i-vector like representation of speech

so what we are proposing sheer ease a combination of two algorithms

in order to clean up and noisy i-vectors

so we are using a

clean front end so system trained using clean data and a clean back end so

in scoring model

so the first algorithm

in the past work in the previous work we presented a i'm up

it's an additive noise model operating in the i-vector space

it's based on a two hypothesis

the gaussianity of

the i-vectors distribution and the gaussianity of the night distribution

in the i-vector space

here i'm not saying that noise is additive in the i-vector space and just use

ink this model to represent relationship between clean and noisy i-vectors

just to be here

so using not criterion we can

there are in this equation

and we end up we a model that it given a y zero noisy i-vector

we can

d noise it

clean it up using

the between i-vectors distribution hyper parameters and the noise distribution hyper parameters

so in practice this algorithm is implemented like this given a test segment we start

by checking it's the snr level if the segment it's clean is clean so we

are okay

if it's not

we extract the noisy version of the i-vectors y zero and then using a voice

activity detection system we extract

noise from the signal using the silence intervals

and then we inject

this noise

into clean training utterances

this way we have clean i-vectors and they are noisy preference using the test noise

so we can build the noise model

using the gaussian distribution and then we can use the previous equation to clean up

the noisy i-vectors

the novelty of this paper is how can we improve the i'm

so that the problem is that we can apply time up many times

successfully

iteratively because we can guarantee the goshen hypothesis on the on the residual noise

so the solution that we came up with is to use another algorithm and to

iteratively between these two algorithms in order to achieve better training for the i-vectors

so this second algorithms this call the catfish algorithm it's used mainly in chemistry two

align different molecules so here we we're applying it on i-vectors and we're starting from

noisy i-vectors

and we want to estimate the best translation and rotation matrix

in order to go to the clean version

so formally for the formulation of the problem

it's called the

program this

problem and its start with two matrices to data matrices and noisy i-vectors

presented at a matrix and the clean version

this way we can estimate the best relation matrix or here

that relates the two

so in the training we start by

that we said that we are estimating a translation vector and the rotation matrix so

to get rid of the translation we start by center ink the data the we

compute the centroid on the clean data and the noisy data and then

we center

the clean and noisy very i-vectors

then

now we can compute the

to the best rotation matrix between the noisy i-vectors and their cleavers and using svd

decomposition

the once we've done this when we have the best translation and rotation for a

given noise

on the test

the weekend

extract the test i-vector

we apply we start by applying the translation a minus

here we subtract the centroid of the

the noisy i-vectors and then we apply the rotation and then either translation to and

up with its cleaver

so we use needs and switchboard data for training

and the nist two thousand and eight four test that seven condition we are using

nineteen mfcc coefficients plus energy plus their first and second derivatives

five hundred twelve components gmm

our i-vectors have a four hundred components under using the two covariance scoring

so here we are applying

each algorithm independently and then what combining the two

we've the first algorithm i'm up we can achieve from forty to sixty percent

for a t v equal error rate improvement

for each noise

for the first algorithm we jan achieved up to forty five percent of equal error

rate improvement but

when we combine the two

in the for one iteration or for you we can and up with up to

eighty five percent of whatever it improvement

here i presented the data for male they may

for male data and to your for you might but well for female it's

the error rates are a little bit tired but it's efficient for both

the and here we compare the two algorithms and their combination

on heterogeneous the setup it's the when we use a lot of data noisy and

clean data for enrollment and test with different snr levels on the target and test

and we can see that's a it's it remains efficient in this context

so as a summary

using

i'm out or that they kept algorithm we can improve the equal error rate from

forty to sixty percent but the interesting part is that combining the two

can achieve

for better gains

thank you

so we have questions

is the patient matrix a noise and it's

or anti noise that yes that's really different sorry

yes here we're estimating for each different noise at different a translation and rotation matrix

we just want to show the efficiency of this technique but in of the future

in another paper will be published in interspeech i guess we well it's except that

it will

we propose another approach so that the that does not

suppose a certain model of noise in the i-vector space

and that can be used for many noise

that can be trained using many noises and use it if you used efficiently

on the test with different places

so here is to just to show the how four we can go to the

best case scenario

but in another paper we show how we can extend this to go away many

noises

and i was presentation so

if you go back many years ago how lemon oppenheimer had a sequential map estimation

that be used for speech enhancement obliterated back and forth between noise suppression filters and

speech parameterization so you're iterating back and forth between two algorithms here

you show results we had one iteration to iteration is there any way to come

up with some well maybe two questions here anyway to come up with some form

of convergence criteria that you can assess and second is there any way to look

at the i-vectors as you go through the two iterations to see

which i-vectors are actually changing the most that might tell you a little bit more

about which vectors are more sensitive to the type of noise

so the first question

so the first question was is there any way to look at a convergence criteria

because when you say eight or two you need to know whether you convergence and

okay

so well here what we've that is just to iterate many tendency

at which

from a which level we get

we start the

making the results were so it's not really

it's not that the

we haven't the gone that gone there in that

so if you look at the two noise types you cycling fan noise and i

think you had to

car noise so both are low frequency type noises can you see if you have

similar changes in the i-vectors in both those noise types

yes

maybe i can't the common in that because i haven't then the full analysis but

the just from the right we can

i can tell you for sure for sure is the that the efficiency depends on

the

on which noise you're playing at all so

it sufficient store but it's it can be the

that is in the way that makes it more efficient if we have different noises

in the between enrollment and test

thank you for the nice presentation

one a while ago try to read original are mapped paper so if you don't

mind i just as a question about the original i'm out that the iterative one

sorry that i didn't understood original are you map

yes not data at one

okay so go like i mean in the block diagram that you how

can you go back to the block diagram of this

or email

yes

so you're estimating extracting noise from the signal or somehow estimating the noise and in

the signal

so and then you go up to the for noisy and of zero db that

the speech and noise are of steam similar or same strengths over there can you

tell us how would you or in extracting noise from signal in zero db

so here were using energy based voice activity detection system but we are we just

making the threshold more strict in order to avoid the and you got with speech

confused as noise so it's not we

we did the well as sophisticated the voice activity detection system for this task specific

well as the avoiding a slight as much as possible to end up with the

with speech by using a very strict this one on the energy

c use the it's just it's quite amazing the level of improvement you gain from

twenty something to date present it is it is quite something that it feel it

feels that you have very good model of noise here and if you have such

thing then it would make sense also to just check we is speech enhancement i

mean you have this

and misty based approach like wiener filtering if you have a good model the contract

the noise than it is good to also compare with that was to do you

like feature enhancement noise reduction in compare with that as well just a common

yes okay

okay that doesn't be any more questions over so that the speaker

Iterative Bayesian and MMSE-based noise compensation techniques for speaker recognition in the i-vector space

Speaker Recognition: i-vector approaches

Waad Ben Kheder, Driss Matrouf, Moez Ajili, Jean-Francois Bonastre