hello i run decision at a really a psd student problem this kind university'll background
represent the work place that's later it is in your network based i-vector system for
speaker verification
a set thing to all the c two thousand and twenty
this work proposes to incorporate it is a neural networks you to automatic speaker verification
systems to improve the system generalization ability
in this presentation of a firstly introduced is he systems and strategies to as a
very in developing by a these systems
this is followed by some related works these days and learning process so use of
the buttons days of modeling in machine learning community
then talk about our approach including the motivation and how to apply based entirely you
to is this systems
next our experimental setup and results will be restricted to where five the effectiveness of
our approach
this is followed by the final
congruence
automatic speaker verification systems and i've confirming the spoken utterance is the speaker identity claim
we have zero i ever increasing use of nist systems why don't data lacks
including was iteration you electronic devices
you banking association and so on
there are three most the represented q frameworks for developing is these systems
i-vectors speaker-invariant is the systems were proposed to know those speaker and channel variations the
better
and the user speaker discriminative back end work experience
benefiting from the partial discriminative ability of the neural networks
speaker embedded in this distance or proposed to extract speaker discriminative repetitions is utterance
this could choose the state-of-the-art performance
is the development of and upper test testing
many research is also focus on constructing a s p system
and to and manner
its head and z zero four lacy systems development is then this nice feature in
the training and evaluation data
so i says the speaker population is nice and the variations you channel and environmental
background
a speaker or blsa use the for training and evaluation commonly how no overlap
especially for practical applications
cool work on this is nice your data pairs
the strictly the speaker representations to generalize well on these bigger data
this i know and every environmental variations most the most only it is the in
practical applications
where the training and evaluation data are collected from different types of recorders
and environments
is this nice is also have a high demand for the model general the idiot
times today
to address this is you
previous efforts have applied it was fishing you to elevate the channel and parametric variations
from a christian by any
is the pros as the improvement will be affected you elevating the effects
of channel and environmental mismatch used
are you guys in the consider the speaker population size that could also lead to
the system performance degradation
in this or
we focus on the ice vectors system and try to incorporate it is a neural
networks
to true
the systems generates is it at
across all these three and so from these nine she's
the baseline and you of course as the initial and would be effective to improve
the generalisation ability of discriminative training you p and systems
in the machine learning community
barbara et al proposes
and you've patient variational you or is there is useful based in your networks
but i the lid or propose a novel
propagation
compatible algorithm for learning the network parameters of zero discrete
distribution
in this is this area
reefs and a whole
propose to your ball be based a neural networks use these recognition
so and or stuck a bayesian learning of kidding you need contributions for speaker identification
chatting you also applied to it is interesting to me into language modeling
i we introduce our approach is the most efficient would be personally talk about
traditional d extractor system
a system parameters estimated and the maximum likelihood strategy
i p for me mistake i showing you think about
it has to
our feet when given limited training data are you moment they are
is lasts is nice speech in the training and evaluation data
in the case of a nice you speaker population
the overfitting the model parameters may result in speaker representations all you must i
distribution
to come or score supporting a speaker identities
however this you can not to generalize well on
i think speaker data
the cases of channel and environmental nice
a similar for instance for channel is nice the orthopaedic the model parameters may be
partially rely on the channel you formation
go classify speakers to wear a suit recorders for different speakers in the training data
are more ones analysing to channel mismatch the evaluation data
the original channels to be really system is broken and to train the relies on
channel information cleanly to misclassification
and all have so on
that the extract a speaker representations from outside first is the
still contain a the speaker and related information just a channel
transcription an utterance long
is the information to the fact that the verification performance especially on the nist sre
evaluation data
is a neural network shares economy nist a great interest would be a problem this
t
posterior distribution as shown in figure two
is probabilistic parameters could have what is an additional data
to address this speaker population is niceties you
it is clear that have could have some of the distributions of speaker representation for
better generalization and since you're data
to apply the mismatch is caused by chain i mean variance there probabilistic parameter model
mean go the reduce the risk
overfitting on channel information based more thing parameters to consider archer possible values
that don't rely on channel information for speaker classification
the boundaries you want to be used work to incorporate
we place a neural networks into a the i-vector system by replacing the layers to
improve his general system abated
acts like the system consists of two parts of france and use the following shaking
utterance level speaker in banking and of our vacations calling back
if right hand compresses these utterances of different amounts into a fixed that ms in
speaker-related ratings
based on this inventing different scoring schemes kind we use the projection whether two utterances
you don't close to kristin on that
in this work we focused on the reversal the print and choose probabilistic linear discriminant
analysis and of course task only has two
given by hand for the performance evaluation
that's right for extractor is a neural network
to the other speaker discrimination task as a new people three across this all frame
level and bathroom is levelling structures
and a frame level several layers of
of time delay neural network are used to model have you burned corpora
curve right okay characteristics
of acoustic features
then
then
statistics to relay a ivory is all the frame level all those from velocity d
and they are i don't confuse the army and standard deviation
that compute his that a case space are propagated through or several embedding layers and
the panelists of my output layer
the cross entropy is used to find for the crime interest unit sheena states
in the testing stage
even though acoustic features of that utterance mean value layout too easy extracted as e
x vector
is a neural networks
during the parameters posterior distribution p of w given t to model week after dingy
and the right legally enables and you need
number of possible model parameters to physically is that she and they have be
is the data i third and he modeling has the most model
parameters and the make the more the generalize well as in the
during the testing stage
the model can choose the occluded
well i even they include x
and making i x
the expectation
or what awaits posterior distribution
p w
"'kay" of audio unity i shown you creation one
i work better that they write the i think nation l p and out of
that unit but intractable for neural networks of irony right was that
yes the number of possible ways values could be you data
so the variational approximation is commonly adopted to estimate the posterior distribution
the variational
a poor approximation i theme is a set of parameters see that you're for distribution
to all of that you
to approximate the posterior distribution p of w unity
this is issued by minimizing the callback labour
i divergence between these two distributions and so and you creation two
from you creation to equation four
we applied the ester two and job the constant term low key of the
low key of p
that that's of by the minimization no was it actually
you creations problem of for to say
benesty that just means that is
increasing could be decomposed into two pass
one is
the kl-divergence speech e
the speech in the approximation distribution q of w and the posterior
and the prior distribution p of w on the page
the one is
the one is
the expectation of the log likelihood of the training data over the approximation distribution q
of topic
increase mistakes is used as the loss function to be really nice in the training
process
as commonly adopted to be assumed that both variational approximation of that you and the
prior distribution p of that you follows telcon or cost and distribution these a printer
side data to composed of new q and the map you
and six is see that the controls of new p and c marquee respectively
the two class you know loss function of the last is gonna be formulated as
you kristen's seven and eight respect it because it ain't useful in relation to apply
a model car was some three
two approximates the integration
processed
finally can
concatenate increases seven and eight we have the final loss function actually is not
this news
we see you be the directly use the are watching imprecise
order to evaluate the effectiveness of operation any of a speaker verification you both
so and a long utterance conditions
we performed experiments on two datasets
to solve utterance condition we consider the book set of one side
totally
wow hundred and forty
eight thousand
six hundred and forty two utterances from one thousand and twenty two hundred by
site agrees
we adopted a four thousand
four thousand eight hundred and they seventy four utterances from forty speakers for evaluation
and the remaining utterances are used for junior
yes you system parameters
for the long utterance condition a card has thing in beanies the speaker barry is
to be correct recognition evaluation can use the for benchmarking i won't motives
but is synthesized we adopt the previous
sre corpora sense these four
in total be how wrong
sixty miles thousands recordings from six thousand
and of the hundred speakers indigenous this site
we evaluate the general system benefits that
based on included three and a
evaluation of different miss nine degrees
we performed only and also to me evaluations
which in and has two stages i think really on the same
dataset the in domain evaluation
well executed on different not size are also be evaluation
so if you dimensional mel-frequency spectral coefficients are adopted i so closely features our experiments
extracted mfccs onion normalization them voice activity detection filters all non-speech frames
that's right drawing structure configuration is shown in table one
linear discriminant analysis is applied to reduce the extractors dimension
to make a fair calibration that based extractor system is configured to be is the
same architecture of the baseline system
except
the first the t v and later use replace the bad their business the number
or units
so that is the gradient descent and is a great i as you optimize rd
machine evaluation metrics adaptively increase or other commonly used equal error rate and minimum you
understand
cost function
here is that you need only evaluation results we have their own that
you calamities
consistently decrease after incorporating the basin running on both sides
on this dataset be considered i was right you glower we degrees
across close a and the lda back-end
on the
looks at the one that i with a few i enquiry decrease from place to
extract or system is two point six days point process
and the fusion system quoted surefire to our wrists radio or are we increase that
so on to four presents
and then he's that i sign skin database until you varying degrees is to one
thirty h
is the three two percent for
based on extractor system and three point
eight a stands for the fusion system
we also consider the consistent improvement in detection cost function performance after applying bayesian learning
and that the stooges just the
is observations where five improve the general system ability of the client base a neural
networks
figure four ulysses
the details at work feed off curves
all systems these the cost and by can win benchmark almost set of one side
is shows the proposed space is just a model from the baseline for all operating
points
and the fusion system couldn't show further improvements to trigger
complementary advantages of the baseline and based in system
k is the off total knee evaluation regions
the model to now centered one was evaluated on these the sre ten
and vice versa
system performance costing significantly due to the last term is my speech in the training
and evaluation data
from the table be of the died
systems could benefit more from the generalisation calibration in your
we also consider the average radio equal error rate degrees across course and real case
calling back end for performance evaluation
in the experiments evaluated on nist i sign ten database right you equal weight
increase is
for one six nine cents and the six point
well three percent over the baseline system and the fusion system respectively
for the experiments evaluated on the wheel set of one dataset are always right you
equal ridiculous yes three point o seven percent for the base tax vectors this the
and the fusion system as true father
for the average review equal error rate degrees all six point
for
one a sense
the latter value equal error rate decreases compared to be is that the only evaluations
just respect bayesian learning could be
more beneficial when larger miss nicely it is
between the training and evaluation data the last column in the table shows the corresponding
you
detection cost function performance
and we also can see consistent improvement by applying bayesian learning and with the fusion
system
similar to that of the variation in figure four
the detection error tradeoff curves in figure fell so consistent improvements by applying bayesian learning
and a few this system
for all operating points
in this work we
we incorporate the base in your network utility
i extractor assistant when you produce
models generalisation ability
our experimental results verify the bayesian routine enable a consistent
generalizes the ability you improvement over extractor system both
sort and alarm rates conditions
and the through the system which used for the improvements nor overall system score the
latter improvement problem would be is already and all of complete evaluation results as s
is around you makes
my personal and the doctor nice it is between the training and evaluation data
possible future research will focus on
incorporating the bayesian learning improve the and ran a speaker verification systems
then for a listening