i
this is a special edition
from the lab id indian institute of science battle
and i think presenting a paper
system for assigning twenty nineteen cts challenge improvements in data analysis
the goal of those of this paper
actually i someone g
log online so
but using
actually on band
let's go
and the total number of this presentation
and introduce a brief overview
off how speaker recognition systems well
discuss
sre nineteen challenge performance metrics
talk about
the front end and back and modeling in a system something
discuss the results of these systems
and then some analysis of post evaluation results before concluding the presentation
this is a brief overview of how speaker verification for speaker recognition systems well
the first phase
we have the raw speech and extract features like mfccs from that
these features
well then
processed with some voice activity detection and normalization
then these features are given as an input to train a deep neural network model
parameters
the most popular neural network based embedding extractor and the last few years have been
the exploration of mars
once the extractor training phase is done
we enter the lda training phase
these extracted extractors
have some processing done on them
like sending and lda
the other then unit length normalized
before cleaning up but a more
most popular state-of-the-art systems
use a generative gaussian but more
for the back end system
in the verification phase
we have but right which consists
off and domain utterance under test functions
and the objective of the speaker recognition system
i don't know whether the test utterance belongs to the target speaker or non-target stego
thus once you extract extra and ratings for the enrollment and test utterances
we compute
log-likelihood ratio scores using will be lda back-end more
and then
using these scores we did only
if the trial is a target one
or a non-target one
let's look at the sre nineteen performance metrics
then this assigning challenge in twenty nineteen
consisting
of two tracks
the forest
one speaker detection one conversational telephone speech or c d s
and second
was no multimedia speaker recognition
a work was on the forced that the cts challenge
the normalized detection cost function or dcf
is defined
as in equation one
just seen on all be done on my data
is equal to be missed of the you'd are pleasantly times p fa of t
right
in this and be a fee
the probability of miss and false alarms respectively
on this is when the speaker recognition system
the database a target trial as a non-target one that is the system wrong
though and alignment and best
to be
of the same speaker
of false alarm
is when non-target trial is it on you ready to as a target right
in this and be a free
computed by applying detection threshold of the ego the log-likelihood ratios
the training cost mentally all the nist sre nineteen for the conversational telephone speech
is given
by equation two
you're to be done one
is equal to ninety nine and be done to is equal to one eighty nine
the minimum detection cost was alone
as mindcf or semen is computed using the detection thresholds that minimize the detection cost
you creation three
ins to minimize you wish to
on the threshold you know one and the two
the equal error rate eer
is the value of p fa and p miss
computed at that actually read p fa and b m is equal
we report the results in terms of eer
semen
and c primary for all of a systems
the assigning nineteen evaluation set consisted of or two and a half million trials from
fourteen thousand five hundred and sixty one segments
let's look at the front-end modeling in a systems
we obtain
t expect all models with different subsets of the training data that i described in
the next slide
we used the extended time delay neural network architecture
the extended d v and an architecture consisted of twenty hidden layers and value nonlinearities
the model
mostly to discriminate among the speakers in the training set
the forest and hidden layers
all three i-th frame level by the last two already at the statement level
that is a one thousand five hundred dimensional statistics putting your between the frame level
and the same and several years
it computes
the mean and standard deviation
after training and ratings are extracted from the five hundred and twelve dimensional affine company
of the lemon clear
which is the forced alignment label layer
these and weightings are the extra cost use
this table
describes the details of the training and development datasets
used in the assigning nineteen evaluation systems
x p one
well extract the one model
i was trying valiantly
on the wall syllabic or whatever
x lead to
you was mixer six
and vts sat process
x p d
was the full extent a system
which are staying on the little box ella
and previous sre data sets
the data partitions
use in the back end martyrs of the system in individual systems submitted are indicated
in the table two
now let's look at the background model
once the popular systems in speaker verification
use
the generated of course in the lda bungee nearly as of that in modeling approach
once the extra those that extracted
there is some preprocessing done on them
they are standard are the mean is a model
the transformed using lda
and are you wouldn't like nonetheless
the bleu model
on this process extract of a particular recording
is given
by equation four
but you do i
is the extra for the particular recording
well make our
this kind of only can speak of five do we just go origin
five
characterizes the speaker subspace matrix and axes on a is a collection procedure
now the scoring
well bad of expect that was one from the enrollment recording be noted your diary
and one
from the test recording denoting show but you don't e
are used
when w can be lda model or to compute the log-likelihood ratio score given in
equation five
english and five is of course that one and b and q
alright in many cases
along with the g vad approach
we propose
when you wouldn't be lda model what and the lda model
for background modeling
what we have you are
pairwise discriminative network
the bayesian portion of the network
corresponds
to the enrollment and ratings
and the pink portion of the network correspond
the test and really
we construct
the preprocessing steps
in the generated a gpu
as layers in the neural network
lda
as the force affine layer
then unit length normalization as a nonlinear activation
and then be is entering and diagonalization as another affine transformation
the final pairwise
but is scoring
which is given in equation five in the previous slide is implemented as a quadratically
the by having those of this model
are optimized
using an approximation of the minimum detection cost function
or seen in
no less than i'd are submitted systems and the results
the database your
shows det is about the seven individual models that we submitted
and a couple of fusion systems
the best individual system
was the combination of the x t which is the for the extra extractor with
the proposed and b idea more
for the s i eighteen development set
it had a score of five point three one person ser and pointing to a
signal
and the best scores for the assigned nineteen evaluation
was
for one nine seven percent
and for the and point four two
for semen
the fusion systems
are some gains on the individual systems
all that all
the for an extra system just actually three
performs significantly better than the walks l images extreme one
and the x s i next week two systems
for any choice of backing
systems be which is trained on and vad
just in the c include a system f
and it is observed that a model support in domain and out-of-domain data better than
the collision be lda
let's talk about some post evaluation experiments and analysis
one of the factors
then we found that we didn't to optimally
with calibration
in our previous work for sat
we propose
an alternative approach to calibration
but the target and non-target scores will model
as a gaussian distribution with the shape variance
as assigning nineteen did not have an exclusively matched development dataset provided
the aforementioned calibration
using the sre eating development dataset when applied on assigned nineteen don't know to be
ineffective
this was done for all of us operating systems and thus the calibration
was
not as optimal as you want to
the graph on the right
shows
how exciting
development and assigning nineteen evaluation datasets are not matched
and
the threshold instantly opening
well not optimal for are selected systems
we perform some normalisation techniques to improve a score
we perform the adaptive symmetric normalization well yes non using the sri meeting development unlimited
say as the core
and be achieved
twenty four percent relative improvements for the x p one which is the voxel of
extract the system
and twenty one percent relative improvement for the full extract the system actually the on
the sre eighteen development set
you got comparatively low but consistent improvement of about fourteen percent on an average
in all of us systems for the sre nineteen evaluations yes
the table
shows the best values
there we go out for the exciting development and the sre ninety eight evaluation
you got and eer of four point seventy question
and assuming all point two seven as best scores for deciding of love me
and eer also point five one
and semen of point thirty six and the c by many of pointy nine for
the sre ninety evaluation systems
to summarize
we k t extractor extract was and background models on different partitions be available data
sets
we also explored a normal discriminative back end model quality and the lda which is
inspired from be neural network architectures and the generated of be a dog key idea
more
we observe that the and view stuff only this of the system or g p
lda for with his datasets
the errors that will cost by calibration
with the mismatched development datasets are discussed
but also significant performance gains that were achieved by using
cohort based
yes non adaptive score normalization technique for various systems
these are some of the references that we use
thank you