i don't
i am in the centre from a i'm research about the computer science institute which
is affiliated to can be set and to university of one of size you know
judy
the work i'm going to talk about today was done in collaboration with me h
one car from the startup
ut sri international
so let me start we describe the one of the most standard speaker verification pipelines
these days
and the pipeline is composed of
three stages
we have first the speaker but in extractor which is meant to transform the sequences
in the two trials into fixed-length vectors x one x two here
then we have a stage that thus lda followed by mean and variance normalization
and then we next normalize
and those resulting vectors x one x two are then processed with a the lda
stage which computes a score for the trial
which can then be threshold it to make the final decision
so that the lda scores
are computed s and rs log-likelihood ratios
and their state of gaussian assumptions
the form of the llr is these
it's the logarithm
of the racial between two probabilities which are the probabilities of it
two inputs
given that the speakers are the same
and the probability of the inputs given that the speakers are different
and these in an r
given the gaussian assumptions in the lda
can be computed with a close form which is a polynomial units one x two
you can find a for mean in the paper
so
the problem is that in most cases what comes of a purely eye are scores
that are very nice kind of rate is means that
no the we computed unless and an hour's data really are not and ours
and the cost for these mismatch a is that
they assumption that we may can be lda not really much they're real data
so
is calibrated scores have the problem that they have not probabilistic interpretation this means that
in consequence we cannot
and
used unless absolute values we can use them relative to each other
so we could run examples of trials
but we cannot interpret the
so let's say for example that you get a score minus one for certain system
for certain trial
you would only be able to tell one these minus one means a there you've
seen a distribution
or
some development data that has gone through the system
so once you see
this emotion and then you can interpret this minus one
properly and you could actually threshold the score and decide the thesis the target samples
okay so we would like scores to be equally weighted because
then
they have these nice property that they are in an hour so that we can
interpret their values
and we can also use based rules to make a
decision on the threshold
without having to see a development data
so
but calibration is done and generally with an affine transformation
there is trained using logistic regression so let's say you and all your score some
is calibrated
then what you do these
train these alpha and beta which are the two
parameters in the affine transformation
so value maximize the cross entropy
that's the logistic regression
objective function
and then you get at the output
properly calibrated and
okay so basically what these means is that we take these by applying we had
are we just at one stage
the global calibration
now the problem is that if this doesn't really solve the problem
and
in general so we are only solving the problem with this global calibration for
the extract set
for which we train the calibration parameters
if the calibration
the calibration set doesn't match our test set
then we will still have a calibration problem
and these results illustrate this so
the wearable one this sets are for now well explained them later but for now
what's important is then i'm showing three different be lda sets
that are
really a systems
that are identical to the calibration stage on what the first is
what training data was used to train
the calibration parameters
though so that the
red bars
i
one it's important here is to compare the height of the bar which is the
actual c and the lower for each of the systems
and the black line
which is the meaning of the llr
for that system
so if the difference between the two
is smaller than it means that the system is well calibrated if it's be it
means it is not what kind
so what we see here
is that the performance the actual c in an hour is very sensitive to reach
set was used to train the calibration
well
so for example
box
necessary to switch or which is
mostly box in this case
it's very well the speakers in the wild dataset
so it gives very good calibration but horrible for sre
and similarly the say the rats data is very good much more lasers but is
not so good for exactly sixty
so basically this means we cannot get
a single global calibration model that we work
well across the board
alright so the goal of this work is based digital but system that doesn't require
these we calibration for every new condition
it's quite ambitious goal
and
we basically want to speaker verification system that can be used out of the box
without having to lead to have been dataset
okay so
one back to the by line a the standard approach
in the by pinata showed
is to train each of the stages separately maybe you reach the previous stage
and when the
we they put data
that comes out of that stage train the next state
with different objectives so the first one this speaker media extractor is trained with
speaker classification also what object the
lda on the lda is used is trained to maximize the likelihood
and then finally the calibration stage is trained to
optimize minor cross entropy which is a speaker verification
now
one simple thing we can do is just integrated three stages in the market we
may think this is
some solution to the calibration problem and you may actually sol our initial of needs
calibration across conditions
so
what we do is basically keeping the same exact functional form
passing the standard pipeline
but instead of training them with different objectives
separately
we just trained them jointly using stochastic gradient descent
for this of course when integrating batch is that are trials
my budget of trials rather than samples
and we simply just
what we do is
randomly select speakers for each speaker select
two samples
and then
from that list of samples to create all the trials all the possible trials
across those samples all tool pursues older since all samples
so we know we can compute the
the binary cross entropy and we optimize that
so this is not the first time that something like this
is proposed of course i to solve the mean m and we'll get and others
what was something very similar
at the time of the actually
train the
but kind of with the svm what we linear logistic regression
is that of stochastic gradient descent but basically that the concept is saying
and more recently now there's been a few papers than two and to have a
speaker verification and they use some claymore of these
idea where the training data but can't which is usually very similar formats this tandem
again in a discriminatively
the of this paper is actually here you know these
and i'm sorry finest only in the upper
so this paper is actually report improving discrimination performance
but i don't usually report calibration performance which is one we care
in this work
and what we actually found in our previous paper is that this approach of just
trained discriminatively
at the lda back-end
is not sufficient to get good calibration across conditions
and that we know from our previous papers so
it means this is not a these architecture and training jointly is not e
so what n
what is the problem
in this basic form
and we
we show before the calibration stage is a global
well anyway
same as in the standard white nine
and it seems that this is not enough flexibility for the model to adapt to
the different conditions in the date
even if you train a small with a lot of different conditions you will just
of that to the
my jewelry the condition
so what we propose to do is to i and branch
so these model
so we keep the speaker verification range the same
and then we added a branch that
is in charge of computing calibration parameters as a function
both input vector sets one and x two
and the form for this branch is starts the same as the top one
it's an affine transformation
that's length normalization of course the parameters of these something transformation on different
on the top ones
then we do dimensionality reduction
i we go to very low dimensional seen in that paper we use of dimensional
five
to compute the mean vectors which are
and we call
side-information vectors
and then we use these vectors to compute an alpha and beta using and very
simple form which is based similar to the be lda form here
at so
when we and that is we had two branches one is in charge of computing
the score and the other one is its actual computing the
calibration parameters
for each of the sample c and
so i'll show the results now so let me
talk about the data
we have
a bunch
i had a whole lot of training data
we used books and of one and two
sre data speaker recognition evaluation data from
two thousand five two thousand twelve
blast mixer six
and switch for all of that it is actually share we
and the embedding extractor training data
we just use half of what we use one but in extractor training just for
expediency the experimentation
and then we have two more sets but source
which is telephone data in that would just other non-english for different languages
and then if it's just trying to which is forensic voice comparison
we just the very clean data set
it's a studio microphone anything
i australian english
and then for testing we use sre six sixteen sorry eighteen speakers in the while
the then on the ml
and lasers which is a bilingual
set recorded over several different microphones
and a forensic voice comparison
the chinese version so the
recording conditions of these two are very similar
but the language is the
and ask that sets
and we use the that part
all these three sets aside a sixteen sre eighteen of speakers in the way
i with that we do all the parameter tuning
we choose the iteration best iteration for each of the models
stuff like
okay so here we use a rear their results
and
the
rand bars have the same ones
as in the previous figure they showed
and i didn't the blue
bar which is the system we propose
we each as you can see
you know training rules
most cases over the best or that the global calibration model
so
we basically achieved what we want it which is to have a single model that
kind of that to the test conditions without that's telling them
what the test conditions are
the only exception is these lpc cmn case
which is not well calibrated idle
and in fact there is one global
the lda model that is better
than the one we propose
is still applied
but is better than ours
and
and the problem with that set
is basically that it's
it's a condition that is not seeing
in combination
during training so
we have clean data in training
but is not in chinese are we have training but is not key
so the model doesn't seem to be able to
learn
how to properly calibrated a that they
unfortunately so this just means
there's to work to be done we haven't really achieve that ambitious goal that i
mentioned before which was to have
a completely
general
out of box system
okay so before to finish i i'd like to describe a few details so how
this model is trained because they are essential to get would performance
so one important thing is to
do an
non random initialization so
what we do and
many of the papers than two and two and training do similar things
is
initialize the speaker brunch with the parameters that a standard the lda baseline
that's very sing
and then for this
side information much we
this first stage we initialize if we the bottom
and components of this anyway lda transform that we trained for
the speaker match
that means that what comes out of here he's
basically the words you could do for speaker i e
we should be
the best you can do for conditionality
so we're trying to get from the input
they condition information
then these matrix here
which doesn't have any recent level before value
we just initialized randomly anyway
and these two
components here we initialize them so that what comes out of here
are the global parameters
at the first iteration you portray
so
basically at the initialization what the scores that them out of here are the same
that would come out or a the lda
standard p only a by i
here the results
it comparing three different
initialization approaches
random
then
a one star partial which means
what i described before but without
initialising bees
stage with the lda what on components just one only
and then the louise
what is correct
so the blue is the best of the three
so it means it's worth the trouble two
take the time to find a initial parameters this marking
so
another important thing is to that we train them only two stages
so the first stage uses all the training data to train the formal all the
parameters
and then the second stage
we freeze the lda mp lda blocks
i'm trying to on the rest of the parameters using
domain balance
data
and this is important because if the data is not about and then
most of the trials in you a novel batch would be from one the mean
and then we would just be optimising things for that only
that something that has more samples
finally the convergence of the model is kind of a big issue
validation performance jumps of one from batch to batch and a lot
so you see that curve of optimization in
com one much to the next i in can change significantly
so what we do is basically choose the best iteration using the validation sets that
i mentioned before
and the good thing is that these approach seems to generalize well to other sets
even two sets that are not very well matched to the limitations
and we tried a bunch of tricks to smooth out the validation performance and they
do set sitting smoothing out the validation mccormick like regularization
sloane everybody
but they actually make the minimum
worse so we
keep the while when initial curves i'm just choose the mean
and well so
and say
did how repository
we the exactly these
model
implemented for training them for evaluation at you just want to have a pre-computed and
endings
and have an example we then bindings that we provide
three to use a modified let me know we could find box
i'll be how to respond questions and comments
okay so
conclusion we developed a model that achieves excellent performance across a wide variety of conditions
and it integrates different stages in a speaker verification looking into one stage
and trains the whole thing doing c
you also integrates an automatic extractor of side-information then in then uses to condition calibration
parameters
and these chips our goal of getting and good performance across different conditions
of course there are many open issues with like temporal
training convergence i don't think we are done with that i would like to see
it easier to
optimize model
and of course we'd like to plug in these small with the mle extractor and
training and
okay thank you very much
if you have any questions please by two need to be solved resort to the
ldc platform be more detail
thank you