and i work the
to discuss in this tutorial is we'll start with some background and definition
and then discuss some alternative training procedures
and then talk about the motivation for end-to-end training
and continue with some discussion of end-to-end training
talk about reviewing
existing work on end-to-end speaker recognition
and then we will wrap up with some summary
I would like to give some acknowledgement to my colleagues from
I've discussed these topics a lot
so let's start with recognition
a typical recognition scenario
we assume we have some features x and some labels y
and we wish to find some function which is parameterized by theta
second let's say
features predicts
some label or predict the label
which should be close or equal to the true label
to be more precise
we would like the prediction to be such that some loss function which compares
predicted label with the true label is as small as possible on unseen data
and the loss function, for example if we do classification it can be
zero if the predicted label is same as the true label and one
otherwise - this is the error rate
of course ideally what we want to do is to
minimize the expected loss on unseen test data which we could calculate like bass
and here we use capital x and y to denote that they are unseeing random
but since we don't know the probability distribution of
x and y we cannot do this
exactly or explicitly
in the supervised learning problem we have access to some training data which would be
many examples of features and labels we can complete not the most set
p check the average loss on the training data and we are trying to minimize
and then we hope that this we
this procedure here means that we will also get a low loss on
unseen test data
and this is a call empirical risk minimisation
and use expected to work uses
the classifier that we use this not to our four
the to be precise something would be to dimension should be if units and it
also requires that the distribution of the loss
not to have attained but to for typical scenarios this
really into it improves in your is expected to work
so then let's talk about speaker recognition
as probably most
in the audience here knows we have these three some tasks of speaker recognition
it's speaker identification
which basically is used to classify close to all speakers of this is a very
i recognition
scenario and then we have speaker verification where we deal we
open set as we say
so the speakers that we may see in testing
or not the same as we have access to in training when building the model
and our task is typically to say whether two segments utterances are from the same
speaker or not
and then there's also speaker diarization which is
to assign basically you know in a long recording each time you mean you need
to a speaker
so here i will focus on speaker verification because the speaker identification task is
quite easy you know at least conceptually
and the speaker diarization is card and then approaches are still in very rarely station
or although some great
stuff as has been done
it's maybe too early to focus on that you know tutorial
if a classifier
not a heart the heart prediction like it's this class or in this class but
rather probability of different classes
so we would like some
classifier that uses an estimate of the probability of some label given the data
in the case of speaker verification with are rather prefer it all put the log-likelihood
because from that we can
the probability of a class given the labour i classes here is just target over
but we can
do this based on a specified prior probability
so it uses a bit more flexibility in how to use this
but some talk about and training
and my impression is that it's not completely or well defined in the literature
but it seems to enable
these two
first all parameters of the system
should be trained jointly and that could be anything from feature extraction to producing some
speaker inventing
to the back in the comparison of speaker and endings and increasing the score
a second aspect is that
and then system should be trained specifically for the and
intended task in which in our case would be verification
one could go even more stricter say that it should match to extract evaluation metrics
but we are interested in for example in right
in this tutorial i will try to
these criterias are or what is it can be
to impose this criteria or what doesn't mean if we don't do it
let's look at what would
typical and when speaker verification architecture
look like and
well i process first i know this was first attempted for speaker verification in two
thousand sixteen
in the paper mentioned here the mortal
it will be some so we start with some
enrollment utterance so as
here it's three and we have some test utterance
all of these goes through some embedding extracting neural networks
reducing in many different architectures there
we produced and bindings which are fixed size
utterance representations
one for each utterance of in three now enrollment and endings and one test reading
and then we will create one and rollers model by some kind of pulling for
example taking the meeting
of the and warm of them buildings
and then we have some similarity measure and in the and
a score comes out that says
the log-likelihood ratio for four
the hypothesis that these
test segments
it's from the same speaker as this enrollment segments
all of these models should all these parts of the speaker model should be
to be a bit fair and maybe a for historical interest we should say that
this is
new idea
we had it's already in nineteen ninety three maybe that's their list i'm aware of
and the one paper at the time was about
handwritten signature recognition and another paper was about the fingerprint recognition
but they used exactly this idea
okay so we talk about and
training and modeling
so what would be the alternative
one thing would be
generative modeling so we train a generative model
means a model that can generate the data both the observations x and
labels line and it can you was
it can also give us
probability of or probability density for such a observations
me typically training with maximum likelihood and if the model is correctly specified for example
of the data really comes from a normal distribution and we have assumed that
in our model are then
with enough training data we will find the correct parameters but the
that is no
and it's may be worth pointing out that
and the lars from such a model is the best
we can have its
so to have access to the log-likelihood ratios from
from the model that really generated today that is
then we can make the model decision for classification verification is a long
classifier would have was more
problem with this is that when the
assumptions are not correct then the parameters we find with maximum likelihood may not be
optimal for classification
and sometimes maximum likelihood training is also difficult
other approaches will be some type of discriminative training so and then training can be
seen as a where is a lot one type of discriminative training but other discriminative
approaches we can tries to train the neural network where the embedding extractor for speaker
identification which seems to be the most
popular approach right now
and then we will use output of some intermediate layer as somebody and train and
back end on top of that
then there is this a course of the metric learning which
mean kind of train the embedding extractor together with a distance matrix with sometimes can
be simple
so in principle the inventing and kind of distance metric or back end
trained jointly
but typically not for the speaker verification task
so this is kind of and then training according to the first criteria but not
according to the second
when we know that we will
why the end-to-end training would be preferable
we had two things one is that we should train models jointly and the other
thing is that which are trained for the
intended task
in the case of joint training is actually quite obvious selects the consider
system consisting of two modules a and b and we have fit that a which
is the parameters of model a and b which is the
only there's of what would be if we just first training module a and then
module b
it is essentially like doing
one iteration of
coordinate descent or block coordinate descent
so we train model
and we get here we train one ubm we get here
but we will not get for them that's not to the optimum which would be
so of course we could trade continue
do a few more iterations
and we might end up in the
optimal and this is actually kind of in principle equivalent to a joint optimization
when we have right kind of a non-convex model as one we may not actually
get the same
right optimum but as if we did
all the parameters in one go what would happen also depending on which optimize the
we used so
in principle
this is
why or so joint training would be like
really make sure that you find the optimal
also both
models and that's clearly better than just training one
first one and then the other ones
that the these part of and then training is justified
the joint training of for more details
the task specific training the idea that we should training for
the intended task so if we do
you our application we want to do speaker verification why we should training for verification
and not for identification for example
first mission say that
we have some guarantee that this idea of minimizing loss on training data
we need was good performance on test a the empirical risk minimisation idea
and the only guarantee we have there is
this in this case the only holds if we are training for four we for
the metric that we are interested in with the task of very interested in
if we
trained for one task and or
you can evaluate
on another source we don't really have any guarantee that
we find the optimal model parameters for this task
but one can of course ask shouldn't is really work anyway training for
and use the model for verification "'cause" it's kind of similar tasks
it does as we know
so but let's just discuss a little bit what could
go wrong
or why it wouldn't be optimal
so here is kind of toy example
we are looking at one dimensional inventing so we imagine that these have been
where rather the distribution of one dimensional and endings
so the embedding space is here and each of these colour represent the
distribution of impending is for some speakers of you is one speaker or will is
another speaker and so one
of course this is a little bit that we are
shape of the distributions i showed it alright okay kind of for simplicity
so in this kind of for example we assume that the mean of the
speakers are used a new that when you call distance like this
what would be the identification error in this case
so whenever we observe an amending we will assign it to the closest speaker
if we
observed on a bending in this region we will assign it so that no speaker
if we also observe it here
we will assign its to this end
this and
you green
and of course it means that sometimes it will be the blue speaker
when something sampled from the blue speaker will be here but we will assign its
the v is
style speaker area
so we will have some error in this situation
if we consider only the neighboring speakers the error rate will be
a twelve point two percent in this example
what would be the verification error rate
if we consider
for this type of data
we will assume that we
have speakers
which are you can be installed is to muted
like well
these stars
now the target trial we will sample
and bending from one speaker
and see if they are closer to each other than some threshold
based happen to the optimal special for this iteration
and if the
they are after that first we that i think so that you
and for nontarget trials
here in this image we could see
it would have an error rate of fourteen percent
again i'm only actually considering that the non-target trials are from neighboring speakers
that's why they're rate is high
i'm only changing this is to use a little bit
the within speaker is to me you show so
as before
the speaker means are on the same distance
like this
we have made them little bit more narrow here the within speaker distribution a little
bit more broad here
the overall variance the within speaker variance this the same obtain a little bit different
and we will see that identification error has increased to thirteen point seven percent
whereas the verification error is that there
more extreme situation we have made them
the distributions equally sake or broad
do those two mixtures
now id and the means speaker means are all the same distance
like this
but the within speaker variance is
well in the within speaker variance is also the same as before
and here it would actually get
identification error
but you will have worse
verification error or in any of the other example and it's because
if you sample a target trial you we very often have
and endings that are far from each other and similarly
for a non-target trials will very often have weddings that are close to each other
so this
should illustrate that
the within speaker distribution that is optimal for identification is not the same is not
necessarily the distribution that is optimal for verification
okay so
as another example
let us consider triplet loss which is another popular
could i
so it looks like this that
each training example you have
and bending for some speaker which we call the anchor invading
and then you have an embedding from the same speaker in which all the positive
example and animating from another speaker we should call the
negative example
and basically we want the distance between the anchor and the positive example can be
and the anchor between the at the distance between the anchor and the negative example
to be big
if this distance is bigger than
this class and
then these loss is gonna be zero
this is not
ideal the an ideal criteria for speaker verification and two show this i have a
rather complicated feed your here the illustrates
three speakers
and the embedding some three speakers in a
two dimensional space
so we have
the speaker may
with and buildings
distributed in this area
speaker be with the meetings in this area and speaker c with them endings in
this area
we are using some and go from speaker to a the worst case would be
to use it here on the border
and then the biggest this test for a positive example would be to have it
here on the other side
and the biggest the smallest this there's to a negative example would be to take
something here
so simply we want this
and distance with the positive example
here class some margin to be smaller than the distance from the
negative example of anchor
so it's okay
in this situation
consider then speaker seen which hasn't b
is the fusion of data now if we have i'm gonna here
we need
distance to the next speaker the closest speaker to be
be here then the internal distance
class some margie
and that's the case in this figure so that replied loss is completely fine with
this situation
but if we want to use
we do
verification on data that is distributed in this way then we should
at all well if we want to have good
performance of target trials from speakers t
we need to accept
trials as target trials whenever we have a smaller distance then this otherwise we will
have some error or for target trials of speakers e
but this means that if we have a threshold like this year we will have
would be in confusion between
speaker a and b
again of course they could be ways to compensate for this environment or another but
it's just to show that like to sign
metric is not
gonna lead to optimal
performance for
so if we try to summarise a little bit about the idea of task specific
minimizing identification error wouldn't necessarily the minimal verification error or
but of course i was showing these on kind of toy examples and the reality
is much more complicated
usually don't optimize classification error but they're all the cross entropy
or something like that
and we may use some loss to encourage more jean
between the speaker and endings
and maybe these assumptions that the made about the
distributions here are
well to compute more realistic at all
so the maybe not completely clear
what would happen we knew test speakers that were not in the training set as
so i one and then to say is that this should not be interpreted as
some kind of proof that other object is would fan maybe they would even be
really good
yes to use training data be that it's not really
completely just defined to use them
and this is of course something that ideally should be studied much more
in future
and so we discuss that the and then training has some and good motivation
but still it's not really the most popular strategy for building speaker recognition systems today
at least in my impression it is my impression is that the multiclass training is
still the most popular
why is that well there are many difficulties with the and when training
it seems
he's more prone to overfitting
we have additions we statistical dependence of training
trials which are we go more into detail in
they're also maybe questionable how to do how should be trained based in the system
when we want to
and many enrollment utterances also to be mentioned of it
but one
the issue
one of the issues with using a cane of verification objective let's call it that
when we are comparing draw
two utterances and wondered say whether it's the same speaker or not
is that
the day that
we e
statistical independence i mean same y
so this is
generally these idea of training of minimizing some training also assumes that
the training data
are independent samples from whatever distribution comes from
and this is often the case i mean we have data that has been independently
in speaker verification
the data
a pair also happens then roll utterance and the testing utterance and the label is
indicating whether it's the target trial or a non-target trial
why equal one for target trial and one equal minus one for nontarget trials
the issue here is that
typically at least if we have limited amount of training data
we create
many trials
from the same speaker from the same utterance of each of the speaker and utterances
are used in many different right and then these
date time is not
these trials are not which is the training data
is not
statistically independent
which is something that the training procedure assumes they are
this can be a problem exactly how big the problem is
i think it's still something that needs to be investigated more but let's elaborately to
be what about what happens
here i brought adjust the training objective that we would use in the for a
kind of a verification loss when we train the systems and in verification
so it looks
complicated than being but it's not really anything special is yes the average training loss
target trials here and the average training loss of
non-target trials here and they are weighted with a fact or
probability of target trials and probability of non-target trials which are
some parameter that we use that to
dear the system to fit
better for the application that we are interested in
what we hope is that this would minimize the expected loss
target trials and non-target trials
weighted we these
probability of target trials and non-target trials
on some unseen data
this loss function here is often the cross entropy but could be other things
so what are the desirable properties of training objective
we have
are hat which is the
and directional for training the loss
since the training data
can be assumed to be generated from some probability distribution this or have is also
a random variable
and we won't these
to be close
to the
expect that
where the expectation is calculated according for the true probability distribution of the data
and for every value of
in that case
the expected loss is this black line here
well let's say we are
we have some training set the blue one
and we check the average loss as a function of data
it may look like this
another training set it may look like this the red line and the third one
would be
the power of one so the point is that it's a little bit random and
it's not gonna be exactly like the expected loss
but ideally it should be close to this one because if we find a filter
that minimize the training loss for example here for the in the case of the
red training set
e we know that okay it will be also a good value for the
expected loss which means that the loss on things test data
so we want
training loss
for some as a function of the parameter in grammar the model parameters
can be close to the expected loss for one values of the
in order to study the effect of
statistical dependences in the training data in this context
right the
training objective slightly more general than before
use the same as before but yes that's for each trial
we have a way to be done
and if we set the to when one over and then it would be the
same as before but now we consider that we can choose some other value of
try and weights
in the training data
training trials
we won't
the training objective so the average training loss to have an expected value which is
same as the expected value
of the loss of test data so it should be an unbiased estimator of the
the test loss or the expected loss
and we also want these want to be good in the sense that it has
a small variance
well the expected value of the training loss is just calculated like this so we
end up with the expected value of a loss
and this is exactly are
what we what we usually denoted or
so in order for these to be
unbiased we simply want the sum of the weights to be one
and of course this would be the case when we use the standard choice of
meta which is one over and the number of
in the training data
the variance
of this empirical loss
is gonna look like this
it's the
weight vector or for all the trials
and so on the matrix
times the weight vector
and this matrix is the covariance matrix for the loss of all trials with the
with this little t so that easy the one for the target trials or
minus one for the non-target trials
and one could derive that
the optimal
choice of
he does that would minimize this variance
and i look like this
so this is what we can call them you training objective
a best linear
unbiased estimate
that's the meaning of you so this is the best linear unbiased estimate of
test loss
using the training data to estimate what
well the test loss would be
details about this is that we don't really need covariance between the most of the
raw the correlation
we assume the diagonal elements in section matrix is
then it turns out like this
and in practice we would assume that
and lennon's in this covariance matrix does not depend on cedar which
could be questioned
the objective that we discussed is not really specific the speaker verification in this is
that whenever you have a
dependence is in the training data can you could
use this idea
but for
the structure of this the covariance matrix
between the training which describes the covariances of the loss of the training data
that depends on the problem the specific problem that you're studying
so now we will look into how to
creating search a matrix for speaker verification
so here
we will use
i two denotes the
i utterances of speaker x
so we will assume that
correlation coefficients
hands on what trials i mean comments so for example
the here we have
trial of speaker a utterance one speaker to a utterance to and some loss of
that and the all several also speaker eight utterance long speaker eight
utterance three and some loss of that
and they have some correlation
it because
they involve the same speaker
so we assume there is a correlation
coefficient denoted c
at least eight here
so in total we have these kind of situation in verification if we consider target
there you could have the situation that's
well okay let's look here
to target trials which have one utterance in common this is speak a target trial
of speaker eight
and here we have buttons one of those two and here you have buttons one
utterance trees is also has a long using both
trials there is some correlation between these trite
there is no common utterance but the speaker still the same and this is as
opposed to this situation where
you have
trial of speaker a and the trial of speaker a they have nothing in common
so we assume here the correlation is zero
for such trials
for the non-target trials you have more complicated situation but all possible situations are listed
for example
you may have that
the speaker is you have one
utterance in common
so we have this utterance in common and in addition to that
these speaker is in common that's what they mean with this notation here
and so one
and if we have such weights one can derive
the all the words such correlation push coefficients we can drive the optimal weights for
a speaker with this many utterances
is gonna look like this
the exact form is maybe not so important but just
we should note that one could
the right
how to
given the way to these speaker and it depends on how many utterances
the speaker s
for the non-target trials to formalize more complex
it would depend on me if the trial involves speaker names p can be it
depends on how many
utterances speech to speaker as
then comes they show how to estimate correlation coefficients one could look at some recorrelation
of some trained model
or we couldn't
learned them somehow
or which we will mention briefly later or we can just make some assumption and
into neat so for example one simple assumption is the set
this for score coefficient of target trials are five and this one which we assume
should be smaller so i'll four square
and then
to an affine this range and similarly for the non-target trials
just to get some idea of how we would change the weight for the target
for target trials
we see here that this is the number of utterances for the speaker
on the y-axis here we have their corresponding weights
and for different values of these correlations so if the correlation is
a small
even when we have many utterances up to twenty here we will still give reasonable
way to each utterance
but if the correlation is a large
then we will not give so much weight to
but each utterance when a speaker as many utterances
which means that the total
wait for this speaker is not gonna increased a much even if it has a
lot of
in the past i was exploring little bits how
these kind of correlations really are
this was on the i-vector system with clearly a and the scores
here in the first
i in this
column here
it's a
okay lda model trained with em algorithm and then the score samples and instigated system
i find calibration
and the other column here is for discriminatively trained p lda
so the main thing top so here is that we
to have
correlations between trials that's how for example an utterance in common answer one
in correlations can be quite large in some situations
so these
problems seem to exist
and doing this kind of correlation composition main goals this is like again on the
kind of discriminative
clearly a
e does have a bit
so it's something
possibly take into account
the course of ssl it's four db lda but the where we train a p
lda model
using all the trials in the training set
that can be construct and then training set but of course the same
problem with the dependence exist all seen and system
no some problems that the we could encounter if we tried to do this
well mister the
results or the
compensation formless that we derive
was assuming that
all trials
stuff can be created from the training set or used equally often which is the
case if you train a backend likely p lda
discriminatively and you use all the trials
a we
well we train a kind of and system with involving neural networks
we use media bashers so one could achieve this situation by
making a
list of trials
then we just sample trials from years okay here is a trial is this speaker
compared to this final trial is the speaker compared to this one as a long
and this is
long list of all trials that can be formed and then we just
select some of them into the mini batch
the point is of course that if we have these speakers like this
in the mini batch and we compare this one with this one
this one we this one and so long
we are not using all the trials that we have
we have for example not comparing this one with this one in the mini batch
recall and that's maybe a bit the waste because we are anyway using this deep
neural network to produce them paintings and so once we can just as well
produced and reading or will use all of them in the in the scoring part
as well
well then
we will have a little bit different
of the trials
globally compared to what we had before
so the former lastly that we derived wouldn't be exactly valid in this situation
question then it is if we do decide that all the segments
can that be extract them ratings for
that we have in the mini batch if we want to use all of them
was in the scoring what you how are we gonna select
the data for the mini batch
they can be different strategies here
we could consider for example
strategy a
select some speakers
and then for each speaker we take all the day the segments that they have
let's say that these rates speaker has
three segments and these yellow speaker has
for speaker for segments
and then all
we can consider only five so we can have
segment one of the red speaker scored against segment to segment one scored against segment
three as a long
we don't use the diagonal because we don't consider
try segment scored against themselves
and the course here is just the same as here
a scoring segment two
i guess segment one
this would be one way another way would be constructed you be
select speakers but then just select to utterance for each speaker in the mini batch
you will have just one target right for each speaker
it differs here is that
we have
we are gonna have
fewer target trials
overall in the mini batch but one of them will be from different speakers and
we will add target five from more speakers
not exactly clear what would be the right thing but some little bit informal experiments
we have done
so just of this strategy b is a better
then again the formulas that we'd right before how to weight strives on not completely
the they were not the right on the assumption that we are doing like this
so they are not
and they need to be modified to be it and i mean come to that
in a minute
the second problem that can occur in and when training is that
in respect of these issues is that
we do want
to use
what we do want to have a system that can deal with the session enrollment
and it
of course of the session trials can be incorporated
it work can be handled with dances and system as we discussed in the initial
by having some pruning armour enrollment utterance
but how to create a training date time is again a little bit the
already in the case of single session tries we had a complicated situation how many
different kind of dependent system can occurrence along and in them with the session case
it's gonna be even more
complicated because you can have situations like
for example these two could be the enrollment and this is the test and another
trial where
these two are the enrollment
this is the test then you have one optimizing common here
we're gonna have a more extreme situation where both enrollment utterances
in to try to solve the same but the test utterance is different
so the number of possible a dependence is that can occur is way more complex
and i think it's
very difficult to derive some kind of formal or how the trials should be weighted
so to deal both with the mini batch the fact that we're using mini batch
as and to move the session trials and to estimate proper trial weights
for that maybe one strategy can be to learn them hand this is not something
i tried i just think it's
something that maybe should be tried
so we can define
i training loss
again as average of losses over the training data with some weights
and the we also neon use a development loss with some
which is an average over
another set of the average of most over the development set
and these weights here should depend only on number of utterances of the speaker
or speakers involved in that right
then one can imagine some scheme like these
we send both training and development data through the and then we get the neural
network and we get some
training loss and some
and development lost
as usual be estimate the
the grand here we take the gradient with respect to the model parameter off
for the training lost
and it
right in is not a function of the weights the trial weights
and we can update
the model parameters still keeping in mind that these are then value is a function
of the
the trial weights
the training try and weights
and then
we can
on the development sets
the gradient
with respect to these training weights
and then
use this to update
the training try and weights
a second
to explore
or like a final note on these
depend statistical dependence issue is that
we just
discussed some ideas for balancing the training data the training trials for better optimization
but for example in the case when all speakers have the same
number of utterances
this rebalancing has no effect
still of course there are dependence is there is a one would think shouldn't we
do something more than just we balance the training data
and one possibility that i think would we will worth
is to
we assume the following
the covariance of
to what's a scores of the
of a trial of speaker at
which has
one utterance
in common should be bigger than
the covariance between two trials
of these
speaker which has
no often as in common
which should be bigger than the covariance between
target trials of different speaker this should be zero actually
so one could consider two regularized the model to be in that way
so now
after discussing the issues with
and hence training
then i will briefly mention some of the
eight pairs
or some papers
on and trend
training and i this should not be considered as i kind of literature review or
describing the best architectures or anything like that
it is
just a few selected paper that illustrate some point source on them
some of which and some good take away messages about and find training
so this paper called and point text dependent speaker verification as follows i know was
the first the paper on and ten training in speaker verification
and it also networks like this or some architecture like this feature goes in the
throes on
and neural network and in the end we are doing
we this network is gonna say
is it the same
speaker or not
the important thing here is that
input is fixed
so the inputs to the neural network as the feature dimension times the number of
the duration that is
and there was no temporal pooling which is
the done in many other situations
and this is suitable
when you do text dependent speaker verification as they did in this paper
so because this means that
the network is kind of aware of the word and phoneme order
i would say that the main conclusion from this paper is that
the verification loss was better than the identification lost
especially when you have been the amounts of training data for small amount of training
data guys
not as big difference
and the one can also say that t-norm could
too large extent to make these two things
this colossus more the models trained with these two moses more similar
but i still won't say that this kind of suggested verification loss is beneficial
if you have large amounts of training data
so this is another paper
there wasn't doing in
text-independent speaker verification and here
different from the other is that they do have a temporal pooling layer
that would kind of remove the dependence on wonder of the input
the to some extent at least and is maybe a more suitable architecture for text
independent speaker verification
and this was compared to i-vector p lda baseline down here to it was found
that really large amount of training data is needed even to be something like an
the lda system
and this is
some study that we did and
it was
use also again text independent speaker recognition or verification
but trained on smaller amount of data and to make it work we instead constrained
these neural network here this big and time system to behave
something like a
another i-vector and p lda baseline so we cannot constrain did not to be two
different from the
i-vector purely a baseline
we found there that training model blocks jointly with their verification also was improving
so as can be seen here
little bit regrettably we data as a separate
clearly whether that improvement came from the fact that we were doing joint training
or the fact that we were
using the verification loss
another interesting thing here is that
we found that
training we verification most requires very large batches
and this was an experiment done only on the
scoring art and of course lda discriminatively lda
so if we train is gonna be p lda with
a and b if yes using full batches
so not i mean you match
training scheme
you achieve some
like this on the development set
and this dash
blue line
whereas if we trained with adam with mini batch just for different slices front end
up to five thousand
we see that we need really be batches to actually
get close to be of q s
trained model which was trained on full marshes
so that kind of little bit suggests that you really need to have many trials
within the mini batch for you know what of four
training these kind of
system with a verification lots which is a bit of a problem and maybe a
challenge to deal with
in future
this is some more recent paper and the interesting point of this paper was that
they didn't train the whole system
all the way from the waveform is that this from features as the other
but it was
i couldn't to
understand completely the improvement came from the from the fact that they were
training from the waveform or if it was because of
the choice of architecture and so one
but it's interesting that
systems and going
all the way from waveform to the and
can work well
and this is paper
for this year's
in their speech it's interesting because
it's one of the more recent studies that the really proposed or showed some good
performance of using verification loss
here it was a joint
but i can have more details training so they were training using both identification was
and verification lost
and that's actually something i have tried to another and any
benefit from we but one thing they did here was to
start with a large weight for that it is indication of austin gradually
increase the weight for the verification will also make this is the interesting and maybe
actually the right way to go
i'm curious about it
now comes just little bits summary of this talk
we discussed about the motivation for and two and
we said that it has some good motivation
we show that's on
we will refer to some
experimental results the of also another first
which shows that it seems to work quite well for text-dependent task with large amount
of training data
in such case it's probably prefer able to preserve the temporal structure to avoid
the temporal pooling
in text-independent benchmark one would need to strongly like a regular station or a mix
the training objective in order to benefit from
and when training and typically we would want to do some temporal pooling their
one couldn't guess that and twenty training would be preferable choice in scenarios where we
have many training speaker with few utterances we have less of the statistical dependence in
something that to me seems to be or button questions is and which would be
great if someone it explore
it is difficult actually to train and then system especially for the text independent
tell us
so this is because of overfitting so training convergence this dependency issue we discussed
it's not really clear i would say
practical question is how to adapt search systems because see this more blockwise systems we
would of the nine at the back end
well could be trained the system in a way that we don't need adaptation
and also how could we input some human knowledge about speech into these training and
we need it
something we know about the data distribution or number of phonemes or
and we discuss that maybe
training a model for speaker identification is not ideal for speaker verification but is there
some way to
to find and bindings that are good for all these tasks
another interesting quick question is
how well
the llr is that comes from
and to end
actually could simulate the true llr
so in other words what kind of
distributions could be
arbitrary accurately simulate or modeled by these architectures
so completely clear out there
okay so
thank you for your attention
hello this is you'll huh and no i really present the hassan session for that
and that speaker verification concordia
one and talk about ease
two things first
i will go through the call that are using their
most of my experiments
after that i mean how well if you can do tricks to solve the batteries
implementation issues
that i have used and
okay so
the call for and final system so this is a call that i started work
on during my forestalled a but from to those in sixteen the person time t
initially horse in the on all but the now consider a while
and idea sees the
time to switch to or a data tensor able to or like torture or something
the links of the repository is here
and most stuff in this repository is no and is more most states there are
four multiclass the weighting well mostly to use a little because training where maybe in
combination with other stuff
but the
don't know much on a
that's uses
you're and then training with the verification loss
the paper is that we're of only stores actually based on hold close to the
on the one i think it's not so much point two
maintain that are in more
but i do have a one screen here that you to the verification lost in
combination with the identification lost so that's description we will look at
and generally
well it's a this first i'm trying to point out things in this call that
i think yes certainly well known and worked well and are known and also mention
what they we show
really them differently
to maybe give so
well at least i can say from like stressful as good an allpass time
small toolkit for speaker verification
i know that i didn't see and then if we hear from and the verification
lost to that identification the most
and contrary to the paper and mentioned in the tutorial
it could be that these quite complicated scheme for changing the balance between the losses
throughout the training is really ladies this may be something i don't look at some
and this screen i think units
you want to try to instances where only you know little normal way
the in the local but you don't want running in the not here unique feel
a little bit with the intention because
cantonese in such a way that it's
here three but
some small adjustment might be needed if you actually want to run it here
i tried in these in when organising my experiments to high in the way that
there is one screen where everything that is specifically the experiment is set so that
includes which data to use and the configuration of the more balanced along
i was really i
an efficient lighting to have
input arguments to this researchers we should be to use as long because anyway you
were wireless always have to change something in this creation
for a new experiments are then you can just as long routine often a and
so on
a wrestler
but other things that a little bit more face from extend this experiment this is
just the loaded from this good
such as model on different architectures as long
so usually i use these underscore for denoted sensible variables underscore v for placeholders
so long
the kind of
models are
similar to here as models are then maybe a little bit less
fancy if you're
i didn't use here us here initially because when i started with this years ago
cares more flexible enough there were i quite agree pure only those of recruited two
neatly with this but i know it is definitely flexible enough
for example here is this is five where features are things that
things maybe some one would think is that are those on a you all remember
about a
seems anyway necessary to change things in this problem for every experiment i prefer you
their thing here
so you're somebody stole training data
how long as the shortest and a longer segments are trained on
some other patterns related to training batch size
maximum number of the box
number of bashes in an input so i don't really define
yep or as warm day a by defining that's the second number of patches that
the wine in it in a minute
also patience probably most of your familiar with it is worth mentioning
you train or
what it is score
so the next part of the screen is the bar for defining how to load
and prepare data
and here is long important points is the
so you the bashers we will
well gee chunks of feature from different utterances so randomly selected segments
if you know say that from a normal hardest and randomly select different segments from
different utterances
this will be nice too small i was to sell
so often
you can would meeting it is time varying or case at a time or can
compare a
many lashes
so that's one way he in all my service so i is the to the
data on missus the and then can be loaded as you wish feature shows can
be loaded randomly fast enough for that
so this is
good because it allows for a lot much more flexibility in experiments for example sometimes
you may want to load to segments from the same as is that what one
to go for some for some experiments
or sometimes you just want to change the duration of the segments
use our case then you have to prepare and you are case for this
so i don't say that
using is the ease
and then just load features a single going is
very good thing and as the c is really good however to invest see if
you want to
it can of experiments
i define some functions for example low fee training process given some
given and some list of finals this one we load the data and that could
so if you want remotes parcels these batteries specifically as long again
if find that here but if you want to do for example of the thing
i mentioned too low to segments from the same utterances that one then you would
have to change the function here
so this was quite the
useful way of organising is that for me at least in my experiments
i also another important thing in this for easter creates on dictionary a religious train
is sixty four conversation other missionaries of for example a closest eager not be
and thus to fine off a thing and the law
and that's
created here
and he's
no means are used to create a media batches
and a little bit later down here i create a generator for media batches and
it takes the this stationary off
mappings across a speaker mapping as a long and i have different the generators depending
on what kind of media matches i won't for example you want
randomly selected speakers and older data are going to the actual remote randomly selected speakers
and for example two apples each or something like that
so that's its shape by changing on a gender
then the next step is to
so that the modal
and here i'm using here
t v in a artificial light expect or other comics or
and it i also a det lda model
a half to the school and endings from this
or text editors still called
we should
do kind of verification
i mentioned is minor differences from the holiday architecture is that i found it necessary
to have some kind of normalization layer alter the temporal coolly better or more just
at feast elicitation but estimated on the data that supports in the beginning works fine
as well
i guess line is needed here could be because we use a simpler optimize the
we use just stochastic gradient descent as compared to colour the use that are most
of the monster
so in this conan columns
definition of the are they show like number of layers their sizes
activation functions
and so
whether we should have a normalization of features normalization all the are truly
and whether they these
or you don't face of the data being initial last
auctions for regular stations the lower
we initialize the model here and we provide
when you do this at the rate or the generator for the
they the training data and this is used to initialize to model the normalisation layers
this is something that creates be a mess and i probably wouldn't song i
differently if i work right and you are
maybe some knowingly initialization and that's around a few
in the before starting the can you just
initialize the layers the normalisation layers
you if we apply a smaller to today the which is in this place holders
what comes out will be this and endings the classifications
so or
then ratings in this particular we will send them to
in the lda model
basically here
we make some settings for here
probabilistic lda model we can get the score
and for all pairwise comparisons
it in the dash and also loss for that can provide
labels for it
so next car is to and are defined lost and train functions along
we have lost as a weighted keisha lost it has lost and a single because
the verification loss
well here's in binary and their average fits weights in the original one point five
and still one seventy five respectively
and maybe one important thing use here we these forces are normalized in there and
from be so that's
log of their probability in the case of so long we're number of speakers
i mean for around a classification of random quotes
and the reason to do this is
if the model is yes initialized or just a round of relations that the loss
maybe one or approximately well
and we do the same thing for the verification loss
you this means that all these also source data you know similar way and it
becomes easier to choose to interpolate between them
and the end of these the screen we define a training function which takes
the data for actually in school and to one article
the more
next for please for a
defining functions for a set i think parameters locating parameters for the more
i function two
change of the easy to shake some kind of validation lots of the each block
so this starts just for setting parameters and getting parameters
it can find
function for changing the validation was here
finally the training is to combine these
function here which takes these
function and therefore
changing validation loss takes many other parameters
and things that the undefined
for example in function for training and so on
so these the way we trained here is basically so
alternately she for which was defined as
alright for however bashers
and this is because we don't really have a case you just complete equal continues
every random statements
this is
as long as they won't work so there's really clear idea what is the what
is data
but anyway
we do training if he doesn't include one the one additional also be a good
try a few more times o and two patients number of times and you is
to include that we will
research around there's to the best on the whole the learning rate increase but okay
i don't know this is the best
"'kay" be seen but as for well enough for me
going on i would like to mention a few weeks
not very complicated things
it was maybe slightly difficult for me to figure out
they are related to back propagation and the things i wanted to modify their
so let's just first briefly review the back propagation algorithm
you know that the neural network is just
some serious of affine transformation followed by nonlinearity then again affine transformation and again only
the install
so that's a result in some you will be applied affine transformation
i guess is set here and then we apply some nonlinearity and
i mean yes the a that's going to and we do that over and over
and that's called a final
i'll put four and then we have some cost function
i'm on that for example cross entropy
and we if we you know function composition bit is the reading here's basically means
the compositional g and h is just like
an h on the data and energy and still then we know that we can
write the whole neural network s
applying the first affine transformation of the input
next door first the nonlinearity
all the way
but the output
it can be written like these
and is also easy to write well the
gradient of the
loss with respect to that you could point using the chain rule i is
so it's just
basically everybody will see with respect to improve this just
change like this study video scene with respect to a time period well a i'm
this dataset i install
so i have this
funny thing brackets here just and you know that these are
just covariance so the multivariate shaver looks
same as the second one just that we need to use digital us instead of
this is not normal productive
so forceful
relative lc with respect to a
this is i criterion is really right because it's a vector so
when all these elements like these here
criminal a with respect was
easy just gonna be a diagonal probably unlike is because f is the functional design
elements bias
and the other one three that you off
san interesting to a
if we look at this thing here we will see maybe a little bit for
this is just the weight matrix
so then back propagation is
okay we start by calculating the
d c
this is a i
and that's just these two
and then
we can
continue with
get it is easy with respect to some other set i by just taking that
are that we have and multiply for example we these two then we get an
extra and still
so it's
but course process like that so that yes you lost the remote people loss with
respect to include in the of what we want this of course with respect to
model parameters which is that
biases in the weights
which we have
here and here
those are given by these extensions here
for the biases is just these
a second down here
for the weights model can claim that the corresponding part of the weight matrix
this is just sorry within corresponding part of the
ye activation and a here we are interested in contributing with respect to this also
we need more like the corresponding part this
okay so no i'm talking about when we are fresh test
and here we also to really good references for these if you want to
further into it
no where we have
mentioned this
i would say
well buffy different issues that i run into their that require some
little bit of thinking in relation to this
first thing is that you see here that in order to calculate the derivative existing
weights you need the our schools of each layer is a here
and so that means that we need to see all of those memory from the
forward also needed you the main memory okay we look that passed and if you
have to be batches many utterances also long utterances this can become too much
it can go up to many gigabytes several makes sense for example
or larger batches
the no and sensible well as on printing home way of getting around this
and that is that you
or where the data
then they have the option in case of ten some for the case of the
angle you have the option to discard the
intermediate file was from the for us then maybe you that there are also you
will recalculate then when you need that so you basically just have the
in memory for one dollar score one on this time
that's the floor one have the same thing about a little bit better because data
to discard the corporate like to the cu memory which is generally bigger
there you family
in that case
or to use this we can
we you over the inputs a until probably layer and all the pooling layer we
put all these
close together so that we have now a kind of
tests or with the old adding store
and then that can be processed normally
and then you would just calculated los and ask for the right so or at
least one okay so that to think carefully
this of course also has the advantage that we can have same and different directions
well we may things like
but for a bit complicated or maybe not even possible
i'm not showing the congo
these people sees me see so many other things hours and makes is very difficult
to see what's going on
i have it does not seventeen
but the i was hoping to write some small for example but they didn't manage
to do it in time
okay so that's one three
a second
tree is related to parallelization
suppose that we have some or detection like this because feature but and then we
are probably
and then we have some processing all them things and finally scoring
no if we want to
well normally if we want to do parallelization will be training for some multiclass okay
it doesn't really a problem because we just is to give the day on different
workers each of them calculate some radians and we can actually right yes
or we can not irish the updated models
but in this case seems this scoring large when we do use the verification lost
in the scoring or we would like to have a comparison of all trials all
possible trials
so we need to do
time delay and the things on individual workers the sound of all the and endings
to the master where do this scoring
no we do back propagation a to them but he's and then we sell those
tries to each worker
and the they can continue the
the back propagation
the thing is this is not exactly and by normal to the case when you
calculated the loss here then you
a propagation but also the includes what is known has included which was just everybody's
then you basically they try to loss with respect to and endings
and how to use that s two
continue the back propagation on
the individual nodes
one single tree to do this is defined like in a sequence only a loss
like this here so i define a new loss which is yes
this is the remote zero
see the cost
with respect to the embedding elements which is what we have to change
problem most or no
times now ready or just like doesn't all probably like this
and if we know
optimize these loss you will get
what we won't be cost
let's consider right and the order derivative of these loss increased a cell to some
model parameter of the neural network
okay we apply here
just take this started in here
here is something that it has on these
there are so we are right yes here and this is i certainly exactly the
the relative that the are
off looking for so
the remote view
for these loss with respect to model or anything will be exactly the same passed
a law that we are interested e
is possible that some newer tutees has
what is actually just do this without using some tree i'm not sure that
this was as though to achieve this
final tree
related to
something the holocaust repair saturated rental units
right is the sum operation function so let us remember we have a fine transformation
formal by so
activation function and if it's the revenue proposal is one of the
problem on then
whenever the goal is always below sea able to these rental will or when everything
but this is close to the red will put zero so if or includes or
below zero then this rhino is basically never all putting anything in because it's a
and we there is also the opposite problem if they but is always a zero
then there are n is just a linear units so we really models
the includes threatens to be
in a
be sometimes
positive and sometimes negative then the railways brady units
nonlinearly and
the network is doing something interesting
so how we have these is that usually checks if read a unit has problem
like this and in that case
they will ask
some a little also
to test a
so that everybody will see with respect to set
a problem to do this in some of the standard neural network is that we
don't really we can't really we don't have an easy way to manipulate this stuff
which is used in the back propagation
so we will be set to manipulate the derivatives with respect to model parameters directly
these relations lou
we wanted us from the data that and
the derivative with respect to be easy just
is the as we were asked thing is achieved in a place you can just
at that it leads to this
do not here we usually get from model to
and similarly for the way it's is just as we also need to multiply these
articles and a because that's called it remotely calculate
so these for some small three weeks and there may be summary i can say
that is quite helpful to when you were neural network to
based on the back propagation probably so that you know what's going on
and then you can easily too small fixes like is
so that's
or well from the hands on session thank you for attention and by