and
she's
computer science from having university
france
the title of our work is a single thing quarter for robust speaker recognition
in this work we try to focus on and that even noise
to compare and additive noise
for speaker recognition systems that use it "'cause" vector and endings
firstly we discuss about the problem of additive noise and the effect of additive noise
speaker it is
specifically in exactly framework
after they do that we have us from the lattice of our previous works that
are known
to compensate for additive noise and different levels
after that
are we discuss about
different denoising techniques that we used for
compensating the additive noise and extract an army
here you can see the name of denoising techniques that to be
you know just
like i that and you know using those encounters the are all techniques
and goes denoising go change in culture and the text you know a single chain
codes that are
new architectures and but we introduce them in this paper
after that i speak about the experiments experimental protocol and the results that the achieved
by denoising coding colours
in noisy environment
here you can see their problem of additive noise
here is a new techniques
that are used for speaker model like
deep learning techniques
that use that argumentation to their information about the noise reverberation and so on to
create a robust system
in noisy environments
if and when we use their state of the art speaker modeling system like extractor
if we see
new noise there are not seen in data alimentation
we can see that the results not dramatically different
this problem
and motivates us to do compensation
for additive noise and because the i-vector framework
because it's speaker recognition system we are looking for
for clean signal and we just want to have
with performance of recognizing
speakers
we can do i think noise compensation in different levels
i mean slim was like signal
like features
for example doing noise compensation on mfccs or
we can deal with higher levels like it "'cause" vectors speaker modeling level
to do noise compensation in our research we try to do compensation in it "'cause"
i-vector level because
in this level the vectors have a gaussian distribution and working in this level is
lower dimensions is easier
in previous works we can see that
there are some researchers
in signal level for example
in the first rule you can see a paper from nineteen that
different techniques one convolutional and b l this you know are used
to the noise on features
a like log my magnitude spectrum and
stft
in another paper you can see that
the
denoising is done on raw speech
in previous research
and that was not in i-vector domain
several statistical a neural techniques
proposed for denoising for example
wideband for the and two thousand at proposed i in that
to map from noisy to clean i-vectors
and there are also some other techniques like enjoyment i that and you know is
a denoising both encoders that
that i in the same manner and try to map from noisy i-vectors talking i-vectors
based on that
because the noisy techniques you would do you lose results and i-vector domain we can
propose the previous techniques
in extractors is also or
we can make the proposed and you techniques would you noise in it "'cause" i-vector
space
the first second that is used for minimizing is i that is the statistical techniques
that is used for denoising in i-vector space
i amount
we estimate that was because vectors clean and noisy i-vectors are
be like to dollars distribution there is an assumption
and the decision the noise random variable is the difference
between clean and noisy
here you can see these a probability it
what is zero
given x
what is your
is and noisy it "'cause" vector
and x is clean exact
we use
and i mean why table two
i estimate x zero
it can version of
it "'cause" it clean version
or a denoising version of exact reuse and the estimated to do that
here who casting the final solution for these
formula
a signal and is the covariance of
a noisy
vectors that are used for training
i mean is the average of
noisy because vectors
cy x
is the covariance a because vectors that are used for training and you know it's
is the average of
is the average of clean because vectors
this set second technique that is used in our paper for denoising is conditional denoising
of encoders
conventional everything also encoders tries to minimize l x and f y where l is
the loss function
and why is this portray why is distorted extractor and why is
the output of denoising coding condor
organized because vectors
a
frankly
a denoising opting for the rest right to minimize the distance between noisy "'cause" vectors
and p x vectors
we use this architecture in our research here you can see that in an input
and output layers
we use five hundred real to all the nodes
we used a linear activation function the number of
notice and this layers it is the thing because you want to exactly map and
noisy
because vectors
we want to have exactly same dimension is organized expect to and output layer
i in here down
or you know there we use one thousand two training for knows we use non
linear conjugate hyperbolic a few iterations activation function
function
here you can still error rate
the loss function that is used
for denoising in this paper is doing a square
our dinners at encoder is strenuous stochastic gradient descent
in these are to be mentioned that we used one thousand and twenty four nodes
in hidden layer because if you use
a small number of nodes and is there may have lost of information and it's
better to use lord knows in hidden layer
another technique that he's used in our paper is combination of denoising auto-encoder i'm
here we call i'm not as expensive because
a be used it for i-vector system
in this architecture we have noisy "'cause" vectors we try to deny these vectors firstly
by denoising auto-encoder then we do the output of denoising auto-encoder two
excel because
by doing these step we impose on
our system to
to achieve to extract a that have no statistical distribution
in another technique that we introduce and we called gaussian denoising auto-encoder really given noisy
exact resources and do we impose on you know is altering order to give those
in
a distribution for the noise
it "'cause" vectors here you can see the loss function that
put some the impose restrictions on the output of you know using three
here you can say again the mute and is the average sum rate because vector
c and i is the covariance of
and noisy because vector new x
is the average of
clean because vectors and co using likes
is there
of kleenex vectors
their final technique that we used in the stack you know is noting whether
this type the in denoising something closer tries to find an estimation for noise
by estimating the noise
a week and a
however better results because we did an experiment that we gave exactly the information about
the noise and we
at a very good results no close to clean environment
we use this softer and reach firstly we use the noisy "'cause" vectors to the
right to denoising no single other we have the first estimation of the noisy "'cause"
vector
by in knowing by calculating the difference between noisy extractors
and the output of the first log
we try to find an estimation of noise and we given this information to the
second block and we repeat
a in the same manner
to have a better estimation of
noise
to use this information and the next lot the
you need is better results
at the output we train all these plots jointly and yes
we have several datasets in our paper was and is used
four
the argumentation to train extraction network babysitting noises are used to create noisy extractors for
training and test in that are used you know it's techniques
also to are
i used for training is excellent for most of them one is that augmentation is
used to train because next unit for
and a combination of wants to than one and was set up to is used
to create
and he's extractors to train denoising techniques
five is a french corpus then is used as test and enrollment in our experiments
we divide the ideal
corpus and to be separated based on the based on the duration of
files to calculate the results for different durations
here you can see in this steps
that we followed in our experiments firstly we trained a "'cause" mixture
designed a recipe
to train these network reusable to that one augmented in these models are
a new use the training data for in the next it
because
we usually use these nets for to create training data for denoising techniques recreate about
four million
no is a clean pairs because vectors from both so that one and both of
them to
also we extract enrollment and test extractors
a be like to five dollars speech corpus
and
a also we add noise to our test data to create a noisy version
we used d v c noise because we want to
we want to make our system
and robust against unseen noises here we use them as a to form a token
augmentation to train the expected network but in this step we use the mississippi choice
the data that the
noise files that are used to
and to create noisy
it "'cause" vectors for train r
and different from those nice is that are used for
for this
so the noises that are used in
test
is a c
they are us
after that we train p lda and we do scoring we pay lda that used
as an back and scoring technique
but before scoring when we do denoising
alone
to reduce the type of is our test files
here you can see the results
use it for error rates may take
for different experiments and the first row you can see the results for different durations
for example
and the first column
for utterances shorter than two secondary issue
eleven point fifty nine
equal error rate
when we don't have noise
and for stresses the line longer than twelve seconds we have zero point eight
in the second row we can see that impact of annoyance
for example for short utterances
and the equal error rate increases eleven to fifteen and four
utterances longer than twelve seconds
it increases from zero point eight to five point one
this results show that
it's important we do denoising before scoring
a system is no say in next call on our assumption is true and using
a denoising component be of before scoring is very important
here you can see the results obtained by
statistical except that taking
for utterances longer than twelve seconds
the equal error rate reviews from
five point one two point six
and then extra used in the results obtained after applying denoising auto-encoder and is the
expected
in the next one
we see the results that the in the
in the next around used in the results that obtained by a combination of denoising
don't think other x
in the last row you seen the results that obtained by gaussian distribution the loss
function that we used
the new can you loss function that we used in our experiments to train denoising
auto-encoder to have
and to impose on you know singleton closer to use it "'cause" vectors belong to
a gaussian distribution
here used in the results
for each state denoising post encoder
and therefore a strong you can see the results we may use just two blocks
the first this second walk
as you can see that in both cases the results are better in the previous
techniques
for us france's between eight and ten and twelve and along the
twelve seconds
in the last rule using the results
for this situation that we use
it's really noisy auto-encoder exactly the same architecture that is shown in these speech
in this case we have no
in almost all cases
we have
better results than previous techniques
in our paper we showed that it's important is that augmentation and the learning techniques
to achieve and noise-robust
speech recognition system but it's like you know
we and we are in because vector space we can obtain better results if we
use
denoising techniques
we show that simple statistical matters like i know that used in i-vector space can
be used because that's nine also
after that we showed that averaging the advantage is a statistical and that and denoising
also includes event and give better results
finally we introduce and you and you'll technique called the extent you know is not
think of the that
tries to find information about the noise and use this information in deeper stacked in
a single thing colours by using these techniques
really in this technique we achieve that but not the in almost all cases we
achieve better results than statistical technique like iona or system all conventional denoising out a
encoders
text for your attention