this is very short about the topic
we are apply we are working with a probabilistic linear discriminant analysis and it has
previously been proved by discriminative training
previous studies now use a loss functions that essential to focus on a very broad
range of applications so in this work we are trying to
train the p lda in a way that it becomes
more suitable for
narrow range of applications
and we observe a small improvement in the minimum detection cost by doing so
so as a background
s when we use the speaker verification system we would like to minimize the expected
cost
from our decision
and that this is a very much reflected in the detection cost of the of
then use
so we have at the cost for false rejection and false alarm and also a
prior which we can say together constitutes the operating point of our system
and which of course depend on the application
so the targets here is to yield a application specific system that is optimal for
one or several
operating point rather than one wall
it is more specifics of the same
and there so it can only idea some already been explored force score calibration
in interspeech paper mention that
however well so score calibration with score calibration we can reduce the gap between actual
detection cost and minimum detection cost
but we cannot be used the minimum detection cost
part
by applying these channel five use some earlier stage of the speaker verification system we
could how to reduce also the minimum detection cost
so we will apply to
discriminative p lda training
we use this method that has been previously been developed for training well for discriminating
ple training
and the only kind of thing we need to do here is that the this
well the log-likelihood ratio score of the period more data is
you've done by this kind of for right here
and we can apply some discussion discriminative training criteria to these
parameter
here
well only you should be in of the i-vectors
which i same out that the still we basically take all possible pairs of my
make those in the training database and minimize sound loss function l possible with some
weight
and also be applied so on
regularization term
but this
have been
so
when we need to consider a
one which operating point we should data
talk about how we should target a system to be
suitable for a certain operating points we need to consider
the part we have here and gmm weights b that which is different for well
depends on the trial
in essence it will be different for target and non-target trials
and we also have a loss function and to say very simple the
that will depend which operating point they're targeting whereas the choice of loss function chime
decide how much emphasis would put on surrounding operating points
so
well just a bit short about the forest be that
well as probably several you know we can we'll in some applications where approach
these three parameters probability of target trial two costs we can rewrite it
where we can
we have an equivalent cost which will have a
a loss
in the training or evaluation that is proportional to this
first application so
we can as well consider this
well
such kind of application is that and to minimize
well as and we will as such we make sure that the
we have a
our system will be able to also
for that are breaking points are looking at
so essentially we need to be scale
every trial so that we get the retard the
percentage of target trials in the
evaluation
evaluation database we consider all data
because we consider can compare two
the training database
so regarding the choice of a loss function
previous studies you for discrimate bp lda training use a logistic regression loss or the
svm hinge loss
and the logistic regression scores which is essentially the same as the cmllr loss to
justify the eer application independent the
and
evaluation metrics so you could be suitable as a loss function if we want to
target a very broad range of applications
well what we want consider here is to
c by targeting a more narrow range of application of up of operating points if
we can give better performance for such operating points
and
well the most
i think that would call course one exactly to one detection cost would be that
zero one loss
and that we will also consider one which is a little bit broad one loss
function which is a little bit broader than that
zero one loss button bit more marilyn logistic regression loss which is the be a
loss
and well
to
explain why about this the case i can report that the speech paper which is
very interesting
so
i'm showing various the picture of how these different things slopes
and the blue one would be the logistic regression loss which is there
complex but a
and say comes
because on that this could also be sensitive to outliers because
for some
maybe a new show a trial so this would be by the way they look
for a target trial metric for example
for so one also we have
some cost here and then of the past the threshold which is here you that
no cost
what basically this one can be very large for some
when you change point in our database so
our system may be very much adjusted to one of my
degree one
targets the real loss and we are the zero one loss here as i said
with a couple of approximations that we will later use
we use this sigmoid approximation
in order to do optimization under which includes the parameter i'll show that makes it
more and more similar to the zero one loss when you increase it and we
have that for
well
well one ten hundred
so
there are a couple of problems though the real zero one loss is not differentiable
so that slightly use this one function
and
we also or a real as in the same one loss or non-convex so we
do one approach here where we can of gradually increase the non complexity and
for the sigmoid loss it means we start from the logistic regression model
we also tried from the ml model but it's better to start from the logistic
regression well
and then increase of five gradually on there is another papers to doing that's for
other applications
we do something similar for the radio lost what we start from the logistic regression
model and then train the sycamore loss with our for it was the one loss
finally
so
regarding the experiments we didn't we use the main telephone trials and
we use
of a couple of different databases and we used as development set there is this
research six which is which using one
the regular session but i
and then use this series zero eight and that's it is it intended for testing
and this one cannot standard datasets for p lda training
and
an engineer the number of i-vectors and speakers with or without including is there is
a sixteen goes this we use the fast development set but sometimes we included in
the training set off to react decided on the parameters to get the little bit
better performance
and we conducted the for experiments
okay i should also say that we target is the operating point mentioning here which
has been standard in which the operating point in several nist evaluations
and
for you need for experiments one is just considering a couple of different normalisation regularization
techniques because we limit on sure about what is the best although it's not really
related to the topic of this paper
the second experiment we just compare the different the loss functions that use of are
discussed
and then the underlies also the effect of calibration finally we address were tried to
investigate little bit
well
the choice of be a according to the formal idea before is actually suitable or
not
so well for regular stations there are two options that are popular i guess and
one in this kind of
topics so we can do regular size and regularization to see her which would be
most remote and warranted or station towards ml and icexml i mean normal generating trained
and because logistic regression is also in that sense
maximum likelihood approach
and to compare also weddings within class covariance for just whitening with
full covariance total covariance
and maybe we found that in terms of mindcf and eer
using just as
covariance the phone call total covariance and regularization towards a likelihood you lead to better
performance we use that
the remaining experiments
so comparing loss functions
well first we should say that there is given to training schemes that their actual
detection cost than the standard maximum likelihood training but that is kind of expect that
because
they at the same time do calibration
however not great calibration
which we will discuss
make the wrong
and
but the for calibration it's is that the matrix they're model is very competitive
but we can see some improvement by
these the application specific loss function compared to logistic regression minimum detection cost any all
three and
for sre silly there is no such that
so
maximum likelihood standard maximum likelihood model and a bit worse calibration but
since we can
take start by doing calibration
we will
we in order to a fair comparison we will
also consider that here
so what to do it we need to use some training someone some portion of
the training data that's really tried i hear you see that fifty cent defined ninety
or ninety five percent of that
training data for p lda training and the rest for calibration
and we use to see an alarm loss here which is essentially the same as
logistic regression
and used operating point that we are targeting
and in these experiments we assume zero six is not include
so the result looks like this and the first thing to say is that the
applying the calibration model
will be better results than discriminative training without calibration
the second thing is that the
distributed training here also
benefited from calibration which must be explained by
the fact that they're using a regularization
top
and are also maybe overall can say that seventy five percent of the different training
and the rest for
using the rest for calibration was the optimal
and the also
we notice that the logistic regression
performs quite bad for the very small amount of training data using is the fifty
percent
and the whereas the real also zero one loss
we perform better
but the this is probably course
and those two loss functions
chan
if i can go back to this
figure here
for example the zero one loss
we do not make so much use of a in the data but that's a
score like this
but as the logistic regression would and
that means that the regression loss will use more of the data
so what happened here
i think stuff that
since we do regularization towards the ml model
simply also one most remotes leads
changed so much change the model so much in the state of the model when
we used a really great
so also it is
choice of a
use
optimal that sounds assuming that the one that
trials in the database all channels which is still not the case because we have
made up the training data by
carrying all the i-vectors
and also of course it also assumes that the
training database and evaluation based have a better kind of similar properties which probably is
also not really case
so the optimal beat the could be different from
this according to the form
so i and i that looks at a bit strange but basically we want to
check a couple of different to not use for that and which means that the
effective prior p n
so we just trying different we make some kind of parameters section which make sure
that the
we use this parameter gamma which one mace zero point five
we used a standard that the
calculated
effective prior according to the for all and when i am is equal to one
we will use
and effective prior one which means people way to the target trials
and when it's zero we will use
this one this section we make sure that we use weights of the non-target trials
we used real also in this experiment
and also do not include
it's a zero six
so this the figures a little bit interesting i think
and it seems first like it's much more important to for the actual detection cost
simple minimum detection cost but remember also here we didn't the by calibration of the
way
and
it is clear that the best choice is not
that one we can see that was calculated to formalize of the thing
that's very interesting and media
area that should be more explored
and i should probably have said also that
it's very actually goes up a little bit that which is very noticeable and i'm
not sure why and
because that is actually that
the
right
the prior
effective fire okay the recording for which we used in other experiments
but anyway we can see that pattern really regularization towards the ml model
this is relaxation ones
and
very interesting
thing is that it seems that from name detection cost that actually goes down a
little bit here
which means that we
which is the cases where we used for just for training data or target right
you one way to target trials or we give a way to non-target trials
and it but i think this the results when the so this is that
well
this should really not work but
because we do regularization towards the ml model it i just a very close to
the and the model for such kind of a system that was actually be something
good
so we can also can not need any you wanna be included
the results for regular station towards here where we can see that this is not
the case
so in conclusions
we can see that sometimes can improve the performance but quite often there is not
so much different
and we tried different optimization strategy is
and the
what that should say about that it's that starting from the ml a from the
logistic regression model is important to the starting point is important but this kind of
gradually increasing the complexity of course
the non-convex of was not so
effective actually but they didn't discuss the details about it
so the optimization is something to consider and also
since it seems to be the has really the weight be that's
some kind of importance what is not simply area
well what we should do we probably shouldn't consider a better estimate its of it
may be something that depends on other factors and just whether it's of target and
non-target trials
and
since
also
the discriminative training
criterias for the two is connected it is trained models
needed calibration i think we could be interesting to
mate the regularization towards as well
parameter vector where we have built in the regularization parameter so we do calibration of
the ml model and then kind of
put in the parameters from the regularization into that
a from the candidate calibration into the regular stations we actually do
regularization towards something that's calibrated
okay so
or something
might be opposed to
what optimiser could you use a used to be yes algorithm
a little attention
okay so
you mentioned there was some issues with a non complexity of your objective so
hidden in the work that are just like
note that this morning
i also had issues with non-convex of t
and
we have to its was a problem
of course basically
the gift use forms a rough approximation to the inverse of the haitian right and
if you do we have two years probably that here soon matrix is going to
be positive definite
what's so
we have two s can see
the non-convex of t right and that it can do anything about
so it's probably i think you that's a good point and we should consider some
better optimization algorithm
or we can come from that we have reduced the value of the objective function
quite the significantly but maybe
we are
we could have done much better in that aspect well using something more sure i
think so
in my case there was simple solutions are could calculate the full history and a
inferred that without problems because i very few parameters and i could do an eigenvalue
analysis and then go down the steepest negative eigen vectors that can be out of
the non-convex regions right the for you
very high also
you could perhaps to some other things but it's more this week why i'm afraid
but thank you
which does
i
okay well basically because we are not the doing calibration we doing discriminative training of
the ple
okay