and

she's

computer science from having university

france

the title of our work is a single thing quarter for robust speaker recognition

in this work we try to focus on and that even noise

to compare and additive noise

for speaker recognition systems that use it "'cause" vector and endings

firstly we discuss about the problem of additive noise and the effect of additive noise

speaker it is

specifically in exactly framework

after they do that we have us from the lattice of our previous works that

are known

to compensate for additive noise and different levels

after that

are we discuss about

different denoising techniques that we used for

compensating the additive noise and extract an army

here you can see the name of denoising techniques that to be

you know just

like i that and you know using those encounters the are all techniques

and goes denoising go change in culture and the text you know a single chain

codes that are

new architectures and but we introduce them in this paper

after that i speak about the experiments experimental protocol and the results that the achieved

by denoising coding colours

in noisy environment

here you can see their problem of additive noise

here is a new techniques

that are used for speaker model like

deep learning techniques

that use that argumentation to their information about the noise reverberation and so on to

create a robust system

in noisy environments

if and when we use their state of the art speaker modeling system like extractor

if we see

new noise there are not seen in data alimentation

we can see that the results not dramatically different

this problem

and motivates us to do compensation

for additive noise and because the i-vector framework

because it's speaker recognition system we are looking for

for clean signal and we just want to have

with performance of recognizing

speakers

we can do i think noise compensation in different levels

i mean slim was like signal

like features

for example doing noise compensation on mfccs or

we can deal with higher levels like it "'cause" vectors speaker modeling level

to do noise compensation in our research we try to do compensation in it "'cause"

i-vector level because

in this level the vectors have a gaussian distribution and working in this level is

lower dimensions is easier

in previous works we can see that

there are some researchers

in signal level for example

in the first rule you can see a paper from nineteen that

different techniques one convolutional and b l this you know are used

to the noise on features

a like log my magnitude spectrum and

stft

in another paper you can see that

the

denoising is done on raw speech

in previous research

and that was not in i-vector domain

several statistical a neural techniques

proposed for denoising for example

wideband for the and two thousand at proposed i in that

to map from noisy to clean i-vectors

and there are also some other techniques like enjoyment i that and you know is

a denoising both encoders that

that i in the same manner and try to map from noisy i-vectors talking i-vectors

based on that

because the noisy techniques you would do you lose results and i-vector domain we can

propose the previous techniques

in extractors is also or

we can make the proposed and you techniques would you noise in it "'cause" i-vector

space

the first second that is used for minimizing is i that is the statistical techniques

that is used for denoising in i-vector space

i amount

we estimate that was because vectors clean and noisy i-vectors are

be like to dollars distribution there is an assumption

and the decision the noise random variable is the difference

between clean and noisy

here you can see these a probability it

what is zero

given x

what is your

is and noisy it "'cause" vector

and x is clean exact

we use

and i mean why table two

i estimate x zero

it can version of

it "'cause" it clean version

or a denoising version of exact reuse and the estimated to do that

here who casting the final solution for these

formula

a signal and is the covariance of

a noisy

vectors that are used for training

i mean is the average of

noisy because vectors

cy x

is the covariance a because vectors that are used for training and you know it's

is the average of

is the average of clean because vectors

this set second technique that is used in our paper for denoising is conditional denoising

of encoders

conventional everything also encoders tries to minimize l x and f y where l is

the loss function

and why is this portray why is distorted extractor and why is

the output of denoising coding condor

organized because vectors

a

frankly

a denoising opting for the rest right to minimize the distance between noisy "'cause" vectors

and p x vectors

we use this architecture in our research here you can see that in an input

and output layers

we use five hundred real to all the nodes

we used a linear activation function the number of

notice and this layers it is the thing because you want to exactly map and

noisy

because vectors

we want to have exactly same dimension is organized expect to and output layer

i in here down

or you know there we use one thousand two training for knows we use non

linear conjugate hyperbolic a few iterations activation function

function

here you can still error rate

the loss function that is used

for denoising in this paper is doing a square

our dinners at encoder is strenuous stochastic gradient descent

in these are to be mentioned that we used one thousand and twenty four nodes

in hidden layer because if you use

a small number of nodes and is there may have lost of information and it's

better to use lord knows in hidden layer

another technique that he's used in our paper is combination of denoising auto-encoder i'm

here we call i'm not as expensive because

a be used it for i-vector system

in this architecture we have noisy "'cause" vectors we try to deny these vectors firstly

by denoising auto-encoder then we do the output of denoising auto-encoder two

excel because

by doing these step we impose on

our system to

to achieve to extract a that have no statistical distribution

in another technique that we introduce and we called gaussian denoising auto-encoder really given noisy

exact resources and do we impose on you know is altering order to give those

in

a distribution for the noise

it "'cause" vectors here you can see the loss function that

put some the impose restrictions on the output of you know using three

here you can say again the mute and is the average sum rate because vector

c and i is the covariance of

and noisy because vector new x

is the average of

clean because vectors and co using likes

is there

of kleenex vectors

their final technique that we used in the stack you know is noting whether

this type the in denoising something closer tries to find an estimation for noise

by estimating the noise

a week and a

however better results because we did an experiment that we gave exactly the information about

the noise and we

at a very good results no close to clean environment

we use this softer and reach firstly we use the noisy "'cause" vectors to the

right to denoising no single other we have the first estimation of the noisy "'cause"

vector

by in knowing by calculating the difference between noisy extractors

and the output of the first log

we try to find an estimation of noise and we given this information to the

second block and we repeat

a in the same manner

to have a better estimation of

noise

to use this information and the next lot the

you need is better results

at the output we train all these plots jointly and yes

we have several datasets in our paper was and is used

four

the argumentation to train extraction network babysitting noises are used to create noisy extractors for

training and test in that are used you know it's techniques

also to are

i used for training is excellent for most of them one is that augmentation is

used to train because next unit for

and a combination of wants to than one and was set up to is used

to create

and he's extractors to train denoising techniques

five is a french corpus then is used as test and enrollment in our experiments

we divide the ideal

corpus and to be separated based on the based on the duration of

files to calculate the results for different durations

here you can see in this steps

that we followed in our experiments firstly we trained a "'cause" mixture

designed a recipe

to train these network reusable to that one augmented in these models are

a new use the training data for in the next it

because

we usually use these nets for to create training data for denoising techniques recreate about

four million

no is a clean pairs because vectors from both so that one and both of

them to

also we extract enrollment and test extractors

a be like to five dollars speech corpus

and

a also we add noise to our test data to create a noisy version

we used d v c noise because we want to

we want to make our system

and robust against unseen noises here we use them as a to form a token

augmentation to train the expected network but in this step we use the mississippi choice

the data that the

noise files that are used to

and to create noisy

it "'cause" vectors for train r

and different from those nice is that are used for

for this

so the noises that are used in

test

is a c

they are us

after that we train p lda and we do scoring we pay lda that used

as an back and scoring technique

but before scoring when we do denoising

alone

to reduce the type of is our test files

here you can see the results

use it for error rates may take

for different experiments and the first row you can see the results for different durations

for example

and the first column

for utterances shorter than two secondary issue

eleven point fifty nine

equal error rate

when we don't have noise

and for stresses the line longer than twelve seconds we have zero point eight

in the second row we can see that impact of annoyance

for example for short utterances

and the equal error rate increases eleven to fifteen and four

utterances longer than twelve seconds

it increases from zero point eight to five point one

this results show that

it's important we do denoising before scoring

a system is no say in next call on our assumption is true and using

a denoising component be of before scoring is very important

here you can see the results obtained by

statistical except that taking

for utterances longer than twelve seconds

the equal error rate reviews from

five point one two point six

and then extra used in the results obtained after applying denoising auto-encoder and is the

expected

in the next one

we see the results that the in the

in the next around used in the results that obtained by a combination of denoising

don't think other x

in the last row you seen the results that obtained by gaussian distribution the loss

function that we used

the new can you loss function that we used in our experiments to train denoising

auto-encoder to have

and to impose on you know singleton closer to use it "'cause" vectors belong to

a gaussian distribution

here used in the results

for each state denoising post encoder

and therefore a strong you can see the results we may use just two blocks

the first this second walk

as you can see that in both cases the results are better in the previous

techniques

for us france's between eight and ten and twelve and along the

twelve seconds

in the last rule using the results

for this situation that we use

it's really noisy auto-encoder exactly the same architecture that is shown in these speech

in this case we have no

in almost all cases

we have

better results than previous techniques

in our paper we showed that it's important is that augmentation and the learning techniques

to achieve and noise-robust

speech recognition system but it's like you know

we and we are in because vector space we can obtain better results if we

use

denoising techniques

we show that simple statistical matters like i know that used in i-vector space can

be used because that's nine also

after that we showed that averaging the advantage is a statistical and that and denoising

also includes event and give better results

finally we introduce and you and you'll technique called the extent you know is not

think of the that

tries to find information about the noise and use this information in deeper stacked in

a single thing colours by using these techniques

really in this technique we achieve that but not the in almost all cases we

achieve better results than statistical technique like iona or system all conventional denoising out a

encoders

text for your attention