i speech
that's going to present our
i files a odin counters i
i-vector space for speaker recognition
well and
i
presentation
that let me start from the
motivation or activation
cool for fireworks
down i would like to the and details are the only thing
a particular i will focus on
i've and
and the
few words
will be made out that they can't and scoring
well the next section of the dedicated to
and improve denoising thing or is this so i mean you're probability
we tried to apply
we tried to apply
this technique
and a deep
our conjecture will be considered in this section
next denoting comforting for the system in the domain mismatch
scenario will prevented
and the finally i will conclude
my presentation
okay let me start for all
our motivation and goals last year published our work about implementation of it you know
it engulfed encoder
for the speaker verification task
and the this system
based on
t aec still showed
some improvements
compared to the commonly used baseline system i mean ple on the raw i-vectors
well and this motivated us to
two
for the investigation to detailed investigation
and the
and i'll go also used to study the proposed to solve and i in the
i-vector space
to analyse different straight edges all units as a nation and training probably big back
and parameters
to investigate about and to explored a different deep architecture
we
we offer
and to investigate
the a basis to increase or domain mismatch conditions
well to the
the dataset and experimental setup we used in our work
as you can see for the
training data as a training data we used a telephone channel recording from the nist
is the re
corpora for evaluation we used and used
ten sre protocol condition five extended
and to our results presented in terms of four
equal error rate and minimum detection cost function
and now to our front end tent
i-vector extractor
as you can see we used to
uhuh
mfccs and the first and second to do it is just the county where from
well are what structural was based on
the nn posteriors
with the eleven frames why thing
we used
two thousand and that's a silence at one hundred three phone states with the twenty
non speech state
and
instead of
using
hardwired decision
we try to use soft one solution using the nn outputs
well you can see this formula i
we
try to apply
cepstral means you mean and variance normalization
in this way in the statistics space
well and the you can see that all e
triphone states corresponding to the
speech
states are used to calculate
a sufficient statistics
finally a four hundred dimensional i-vectors were instructed for
our first experiments
well
few works about the det system and the
the a training procedure
to their own devising transform we're
use
do noise are pre-training generative pre-training speech
with the contrastive divergence algorithm
well
and
tool
to train our
denoising transform we
we used the
speaker session dependent i-vectors and the box
the mean four
the main for i of all i-vectors of the same speaker
i mean i s
well and we modeled
joint distribution of
this
i-vectors
and then after training but we unfold they are and
a finds you and you two
two
to obtain a
a denoising out in order
well
on the next slide i have a back to prevent
our system
under consideration
well as you can see we used
convention the lda based system as our baseline
with whitening and length normalisation
a pre-processing
well
the next system is based on
are a out to import or also with a whitening and men normalisation
a pre-processing and the finally
are where
next system is a det based a
well it's just and
l two in order which is
find fuel from the army and this dashed looked at all means fine tuning procedure
well
and the
a ball the hero or about the parameter transmission or substitution
i will focus on that on the on my neck slides
it is very important
right it just turned out to be important in our system
well
for
we used two covariance model for scoring it's can be viewed as simple case of
the lda and the score can be a
expressed in terms of
between speaker and within speaker covariance matrices
well
few words about the parameter substitution
during our experiments
in our work we figure out that the
the best performing the best performance of the a base based the system
is performed so well we a substitute
why whitening and p lda back-end parameters from they are bm system
to the eight based system
denoting crafting for the basis
well it's empirical fun
but it's it is wearing important
for this system
let me show you our first results
well with just the system
on the nist as the retail
protocol and to
as you can see
the gain
we're observed again
over the baseline system when we applied our da a based system with parameter replacement
both four
commonly used in nist sre ten protocol and our second
corpus called rest rooms telecom test got stuck on the on the results
and
so
some information about the
a risk telecon corpus can perform and the by the slide
well
and
to the analysis of the det based system we decided to use cluster variability criteria
e g
it is also called for can not criteria
well it is based on
we since began between speaker covariance matrices
and if you're
take a look at this figure and you can see that there
odin quarter based projections have more stronger clustered variability
about unit is well and the in this case we didn't apply and normalization for
our bn and
d e bay super projections
well i mean about normalization i mean to know whitening
were applied to d r b m and v
well
additionally
are we decided to use cosine scoring
as an independent estimation
or to assess the
the properties of our projections
you can see from this result
that no weight in the that da based system achieves the
the good performance among the
all the system
by the way we try to use
and simple
out in order to
two
to try that it's in speaker recognition if you
but it shot out to be the
not so would is the e bay system
well
and now to the white in can length normalization
when we apply this parameters for the r b m and g u based projections
we obtain those results
and i
that we can see the
the lines are very similar
and that close to each other
in this situation a where we applied
it di da a based
whitening
one of the four
forty it based system
it's turned out to be
not so who
for the system
and the
now on the next slide
we applied parameter substitution so we decided to use the parameter whitening parameter from our
em system
and the
in this situation we achieve good performance of the system
yes you can see
one baseline
and the
to the figure
i
you also can see at the to
the discriminative properties
or was the in this case
is
a more stronger for the a basis projection
to summarize altogether i prepared
all table we we'll terrible with the all common result
and the among the
the system the a based system with a are very important the substitution i mean
whitening
at you the best performance
well
and no to the
p lda based scoring
well
in this table
you can see that our results we obtained a opted different experiments in different configuration
of our system
and again
at the last line
the table you can see that the
good improvement would be in
can be achieved by using
parameter substitution from there are bm system
but the question
why it's happens is still open for us and we didn't manage to until it's
question
well
no i will
we will discuss some improvements for the a based system
and first we decided to apply to apply
dropout regularisation
for both our em training
and the
for fine-tuning
well as you can see
dropped out helps
to improve the system
when we used the it's a in
the orange where
our em training stage
r be improved training
but unfortunately apple a plan to produce the stage of discriminative fine tuning wasn't couple
for us
well to the jeep our conjecture we try to use the two schemes
first
you can see the first one on the slide
it is cold stating audience
well
after training the first are
it's out what can be may be used as a as an input for the
next are
and then we try to find t one
each altogether you
jointly
well but it does not
asked to improve the system
about the second that scheme
which is named stating bias
manage to obtain good results
but in this scenario we need to
to you and or two
substitute whitening parameter again probably are bm system
some big generative pretrained system
and the we get a little bit improvement from that
and the
next question i would like to focus is
the domain mismatch tonight
we investigated our da a best system in
in the domain mismatch conditions
well we used domain adaptation challenge that a dataset
and setup
it's a back end we use cosine scoring
two covariance model record s
to as the lda and simplify the lda with
four hundred dimensional speaker subspace
referred to
as the only
it should be noted that in our experiments we absolutely ignore label so the in
the main beta we used
we use it
one way to estimate whitening and the
whitening parameters or the systems
well and not to the results
you can see
for the baseline
system when we use in domain data for training
we obtain both results for
cosine scoring and you can see that the in applying a to do when the
wind di da a based system
before was focus i in only a scoring
but so when we
used out-of-domain that the data to train our systems
or with a you can see the degradation
for both for cosine and you'll be scoring
and in the
find
this table
you can see it the improvement
when we used whitening parameters from
in the mean data
the same results but for the
a simplified field v scoring
well i just little bit
better
then you'll be
and i'll to conclude ones
we present to
the study of denoising grafting order
in there
i-vector space
we figured out that the i
but
i'm sort be performed on the t or tdoa based system is you two
you by employing can parameters directly from the rear are beyond i'll put
the question is still open why are beyond transform provide better bacon parameters for this
set
well dropped about helps to improve the results but when applied to do our em
training stage
and that helped when we implemented in fine tuning
different project share in the form of stated denoising crafting quarter provide a few further
improvements
well and all our findings
regarding speaker verification system in my conditions
called so true in
mismatched condition case
and
the last one it's and the you think whitening parameters for the target domain along
the
the a it train twenty out-of-domain set
else two
the weights avoid significant
performance gap
goes by domain mismatch
that's it
top questions
michael
in this late it's when d you show the and the stacked
in tennessee note and can then
digits right more than two layers
yes but in this
in this we need to inject whitening conflict summarisation between the wires it is the
this has five
five i want to with whitening and length normalization injection
i mean
and that when you when you use it to like us to
to denoising of the encoders
you improve the results
so that you use your tie the third one
what's
what do you know more than one at each other than the corrected where a
whole
i see
well we i
we decided to
two
through might not able to for the
goal deeper in this because of four we find out that this result is very
similar to the you know our first one based on only one
simmons
although we probably have discussed this issue about your question why copying the p lda
and the and the long length normalization variables from b r p m rather than
they
final say stage gives better performance
where it should be initial maybe of a over feeding you do the back propagation
but you're doing since you're using the same set
maybe therefore let's say via residual matrix that were using be lda becomes artificially small
in terms of strays let's say
so how to check maybe the traces of the two matrices
the one that you estimate from r b m and what i guesstimate after to
see maybe
the covariance matrices are sufficiently small
might be a result of overfitting
well this and now assumption and we try to check out chip it calculates after
as the meeting our paper our paper was submitted but we figure out
it was it does not the reason because of we try to
to split our datasets in two parts and the to use separate data to train
a lda based and so but they can parameters but
the results
schultz
shows that
and is not the rate
it is not a repeating
occured while we trained the system on the same data
well
al so try to
explain the situation by
using a house bill option assumption well i mean
after
det projection we can obtain
no more or less
goals and but less torsion the
and that can be the
this
in this case but
seems to us
but also it is not the answer
this time for another question jumps a
just to construe on the first step of your system but i think it will
to be spot you say that you are using twenty
non-speech states i don't quite amazed both this huge number could you say something about
that
you mean huge number of non-speech states but
we have
we use this
standard caldera see from our where speech recognition department
and they the
you fast
and a twice to use this configuration all these system and the we train
ours the d n and in this way
and the
well it's provide food
voice activity detection for our system
and we are also it's a
mentioned we also used to
this
capabilities to
a to a black soft one solution
also what decision in this statistic space
well i mean we
we have done
cepstral mean shift normalization in the statistics space
by excluding a non speech
well non speech is the problem our consideration
that's to the speaker again thank you