however
this is the transition from clean laugh
and the indian institute of science bounded off
and i think presenting a work
and b or the u wouldn't but a model for speaker verification
the corpus this people actually as someone g
actually automatically
let's look at the roadmap of this presentation
for this and we into using what a speaker verification task consists of
no one to the motivation behind our work
i don't talk about the front end one that be used
discuss bus approaches to back and modeling
before describing
the proposed new will be lda and the lda model e
and then
some experiments and results before concluding the presentation
let's look at what a speaker verification task consists of
when you're given
a segment of and alignment recording of a particular target speaker and a test segment
the em audio objective of the speaker verification system
is that it only
whether the target speaker is speaking in the test segment which is the alternative hypothesis
or
if is not speaking
which is the null hypothesis
i think can see here
this and one minute recording differently by x e
and test recording denoted by x t
these are given
as an input to the speaker verification system
this system outputs a log likelihood ratio score
this score is used
indeed only
by the
the test segment belongs to the target speaker on a non target speaker
most popular state-of-the-art systems for speaker verification
consist
off and you will be celebrating extractor the most popular ones in the last few
years have been the extractor models
this is followed by a back a generative model such as the probabilistic linear discriminant
analysis or b
there are some discriminative backend approaches
like the discriminative but and svm
what we propose
is
one neural network of roads
which is discriminative as was to generate a
for back end modeling in speaker recognition and speaker verification tasks
let's look at the front-end model may be used
as i mentioned
most popular model the last few years have been the extra to extract us
we had they are extracted extractor
on the wall salem got what
which consisted
of seven thousand three hundred and twenty three speakers
this was clean
using
thirteen dimensional mfccs
from twenty five miliseconds frames shifted every ten miliseconds using a twenty channel mel-scale filterbank
this fan the frequency range twenty hours a seven thousand six hundred hz
a fine for augmentation strategy
which included
all mandarin data using things
like babble noise music
to generate or six point three million training segments
the architecture that we used to train on extractor model was the extended d v
and in architecture
this consists of well hidden layers
and value nonlinearities
the model is trained to discriminate among the speakers
the forest and hidden layers
already at the frame level well the last two layers already i'd a segment level
often training the and bearings are extracted from the five hundred twelve dimensional affine component
of the level cleo
that is this the forest adamantly a after the statistics for you
the and bindings extracted sure are the expectations
let's look at a few approaches
a back and modeling
the most popular one
in speaker verification systems is the general don't have gotten really order to be really
once the x well there's are extracted
there are a few steps of processing done on them
that is their standard well then mean is a more
the transform using lda
and the unit length normalized
the beheading model on this process extra go for a particular recording
is given in equation one a
but you dog
is the extra for the particular recording
make a describes a latent speaker factor which is a coalition
file
characterizes the speaker subspace matrix
and epsilon arc is of caution decision
now
first warning
and of these extra those
there is one from the enrollment recording
denoted by t e
and one from the test recording
denoted by county
use with the leading really in order to compute
the log-likelihood ratio score given in english
english and two
is that what i one
and b and q are dying to make this is
you all other approaches
backend modeling on the discriminative but
and pairwise abortion by
the discriminative be lda rdb lda
users
and expanded by the
in order to a by five of indulging or might be by entirely online you
got be represents art right
this computed using a quadratic on it
which is given in equation three
the final be a log-likelihood ratio score is computed
as the dot product of all weight vector
and this expanded vector
file viewed i can might not be
the pairwise collection backing
models the pairs of enrollment and test extra goes
using gosh distribution but i mean thus
these bottom lead us
i estimate
but computing
the sample mean and covariance matrix is off the target and non-target trials
in the training data
along with
the and really a model that we propose
we reported on results on the generative gaussian really their be clearly and pairwise gaussian
backend
no slow and the proposed new wouldn't but what and but architecture
what we have your
is a pair-wise the siamese time discriminative network
as you can see the green portion of the network corresponds
to the enrollment and ratings
and the being portion of the network response that s and brings
we can start the preprocessing steps
in the generated of course but as layers
in
the neural network
the lda
the first affine layers
unit length normalization as a nonlinear activation
and then the bleu centring and diagonalization as
another the affine transformation
the final vad of airway scoring
which is given in equation two
is implemented as a quadratic here
the bottom does of this model are optimized
you were saying
and approximation
all the minimum detection cost function which is known as the mindcf or semen
as the model
optimizes to minimize the detection cost function
we report results
on the mindcf metric
and the eer might
the normalized detection cost function or dcf
is defined as seen on all be done on a bigger
which is you will to be miss of being a pleasant be done times p
fa of data
where b
is and application basically
p miss and b f e
at the probability of miss
and false alarms respectively
on this
is when the model but it's a target trial to be a non-target one
that is the model believes
that the enrollment and test come from different speakers
whereas of false alarm is when non-target trial is wrong ready to as well
p miss and b if a computed by applying i'm detection threshold of peter to
the log-likelihood ratios
how p miss and b if we are computed
given in equation five
here
s i is the score all the log-likelihood ratio output by the model
e i
is the ground truth variable for by i
that is equal to zero if the right i is a target i
i d i is equal to one
if it doesn't non-driving k
one
is the indicator function
the normalized detection cost function is not as a function of the bad only those
due to the status continues by the indicator function
and hence
it cannot be used as an objective function in your electro
what we do all work on this is propose
okay differentiable approximation
of the normalized detection cost but approximating the indicator function but what sigmoid function
so the integration is integration six
i've
the
approximations of the normalized detection cost
given by p miss soft and be a face off a soft detection costs
g r e i is the client for index i s i is the system
output score or the log-likelihood ratio
signal a denotes the sigmoid function
by choosing a large enough value for the wall in fact that i phone
the approximation
can be made arbitrarily close
the actual detection cost function but a wide range of thresholds
before we diving the designers
let's look at the datasets used in training and testing the background model
we sample about six point six million trials from the key inbox alive set
and i don't put anything within five from the augmented boxes that say
for testing we report results
on three datasets
the speakers in the white but i second you go core test condition which consist
of around eight hundred thousand trials
the voices development set which consists of all work four million by as
and the wisest evaluations
which okay and consisted of roughly on the and a half million trials
the demon your words
the results on the sat w goal wise as development and one evaluation sets on
various models
like the gaussian really going back in the yearly and approach was divided but
along with the soft detection cost
we also that our experiments
with binding cross entropy as the loss which is denoted in the table i c
vc loss
we observe relative improvement in terms of mindcf
of around thirty one was in twenty percent and eleven percent
for s id w was is development and wise evaluation respectively
the best scores for slu w local is and eer of two point zero five
percent
and the mindcf of point two
for the wise as development began a best overall one point nine one person to
sdr and point do the best mindcf
for the voice is evaluation
you get six point zero one percent eer as the best
the other school and point four nine as the mindcf
the improvements observed in then you wouldn't but a more consistent
where data augmentation
as well as
for the eer metric
all that on this soft detection costs for a and b performs even better than
the binary cross entropy or b c loss
to summarize
the problem was model is the step and exploding on discriminative neural network model for
the task of speaker verification
using a single elegant back and more than just targeted to optimize the speaker verification
lost the and maybe a model uses
extract that and weightings directly to generate the speaker verification score
this more shows significant performance gains
on the s id w and was is datasets
we have also observed considerable improvements on other datasets
like the nist sre data six
a standard as well
two and do and model
where the more to optimize is not just from the expected and weightings but directly
from acoustic features like mfccs
this work was accepted and interspeech twenty
these are some of the difference is the reviewers
thank you