okay last undo
i'm going to present well work on i-vector transformation and scaling for p lda based
recognition
and the goal of this work
is
two presents a way to transform over i-vectors so that they better fit the plp
assumptions
and the same time introduce a way
to perform some sort of dataset mismatch compensation similar to what length normalization is who
enforced on the p lda
so
as we all know the lda assumption assumes that the latent variables a portion which
with the resulting i-vectors which if we assume they are independently someone they would
follow a gaussian distribution
now we all know this is not really the case
indeed
we have two main problems personal model
our
i-vectors do not really look like they should if they were some performs a gaussian
distribution
for example here on the right
i've plotting the one dimension of the i-vectors the they mentioned with the highest skewness
i plot in the histogram and it's quite clear that
the histogram doesn't really resemble anything like a gaussian distribution but it's even almost multimodal
then the other problems that we're
a quite evident mismatch between development and evaluation
vectors
for example if we look at the left
there is a plot of the histogram of the squared i-vector models for both
our development set which is sre ten females at
and evaluation which is condition five female settles whatever sre ten
and we can see two things first of all
the distribution list pronounce or evaluation and development set are
quite different among themselves
and none of them resembles what we should expect
these i-vectors of everything sampled from a standard normal distribution
now
up to now we have
mainly to waste approach
these issues i've represented
first one was heavy tailed yesterday by patrick kenny which mainly tries to with the
non gaussian assumption
what with the gaussian assumption is that in that it removes the core channels options
and assumes that i-vector distributions are heavy tailed
and the second one is length and or
functional in our opinion is not really making things more portion what is really mainly
dealing with the dataset mismatch that we have in this between evaluation and development i-vectors
in need here i'm doing the same block that was doing on the most you
dimensional i-vectors before and after lexical and we can see that even if we apply
length on these cannot compensate since alike
multimodal distribution signal what i-vectors
it might actually compensate for heavy tailed of your that's for sure but still we
don't get things which are really
go shown like
now in this war we want to address
second the problem of doing both approximation of i-vectors so that they better fit the
lda assumption so we tried to portion right somewhat i-vectors
and that the same time we propose
way to perform the dataset compensations email toward length normalized on the difference being that
the this dataset compensation a student
for our transformation
and we estimate both of the same time
okay so
how do we perform these
this phones focus on how we
manner transform i-vectors so that they better fit the gaussian assumption
to do that stands we assume that i-vectors are sampled from a random variable feeding
which
whose pdf we don't know however we assume that we can express is unavoidable feels
a function
although a standard normal random variable
now if we do like these then we can express the pdf of this random
variable fee others
the little pdf for
samples
of samples which are transformed through f and computed over the for why class
sometimes which of the log that are we don't of the accordion of the transformation
no the good thing is that we can
due to things with this model first of all we can estimate the function f
us to maximize the lack of our i-vectors
and in that way we would obtain something which
use also the pdf of i-vectors with which is not anymore standard portion but depends
on the transformation
and the other one thing is that we can also employed this function to transform
i-vectors so that the samples which follow the distribution will fee
becomes transformed into samples which follow
standard normal distribution
no
two
no more than these unknown functions we decided to follow a
framework which is quite similar to the neural network framework
that is we assume that we can express this transformation function as a composition of
several a simple functions
which can be interpreted as layers of a neural network
now
the only constraint that we have with respect to the standard neural network here is
that we want to work with functions which i vegetables or our layers of the
same size and the transformation they
produce needs to be invertible
as we said we perform maximum like to estimate of the parameters of the transformation
and then instead of using the pdf directly we use the transformation function to map
back
i y i-vectors to
let's say well shall distributed i-vectors
here i have a small an example on the one dimensional data these is again
the most cute dimensional are almost you component of our training i-vectors
and from the top left the original histogram and on the right hyper the transformation
that we estimated
so how's you can see from the top left
if we directly use the transformation
to evaluate the log pdf of the
about one
i-vectors actually we obtain a pdf which are very closely matches the histogram of our
i-vectors
then if we apply the inverse transformation to these data points we obtain what we
c in the bottom v you hear
and what
does that show it shows that we managed to obtain a histogram of i-vectors which
very closely matches the gaussian
pdf which is portable i don't know if it's visible but there is the pdf
of the from one question which is pretty much on top of the histogram all
the transformed vectors
no
in this war
now we decided to use a simple selection for our layers in particular we have
one kind of layer which does just an affine transformation that is we can interpret
it just as the weights
of a neural network
what we call as you know it's in
let you have
which performs the nonlinearity
no the reason we chose this particular kind of an ideal is that it is
nice properties for example with a single layer we can already
represents pdfs
of the random variable which are most similar to the same in heavy tailed and
skewed with just a single layer and
if we are more like it we increase the
modelling capabilities of the program although this creates some problems of overfitting i was like
with say
later
now the other side we use a maximum likelihood criterion to estimate the transformation and
the nice thing
is that we can use are optimized on a general optimize the which we provide
the objective function and the grunt incentives guardians
can be computed we'd
an algorithm which resembles quite closely that of back propagation with mean square error of
a neural network
the main differences that would need to take into account also the contribution of the
log determinant switch
increases the complexity of the training but the training times is pretty much the same
as we what we would have with that standard neural network
no this is a full set of experiments here we still didn't a couple length
normalization and any other kind of
compensation approaches or what i'm showing here is what happens when we estimate
this transformation on our
training data and we applied to transform i wanna vectors
as you can see on top layer on the left the same histograms of the
square norm i was presenting before and on the right the squared norms of the
transformed i-vectors
of all
here i'm using a transformation way to just one not only not like
now of course as we can see the square norm is still not exactly what
we would expect from
standard normal or the distributed samples but
matches more closely our expectation and more important we also somehow
reduce the mismatch between evaluation and development squared norms which means that our i-vectors are
more similar
and this gets a reflected in the results on the first and second line you
know the lda and
the same the lda but trained with the transform i-vectors
has the same here would not
using any kind of like someone we can see that our model allows to achieve
much better performance compared to standard lda
on the last line all
we can still see that length normalization is compensating for is not as a mismatch
better which allows the lda with length normalized i-vectors to perform better than our model
right
so
the next part is how can we
incorporate this kind of preprocessed in our data of course we could try to maximize
i-vector but we can do better by
costing these
kind of transformation directly to our model
to this extent
we first need to in you but different interpretation elements alarm and the particular we
need to sting
all
length normalized the maximum like the solution of a quite simple model
well i what i-vectors are not i aid anymore in the sense that
we assume that each i-vector is sample from a different random variable has a distribution
which is normal
the it the all these time the variables channel i think down which is the
seed model
the covariance matrix but this covariance matrix is case for each i-vector by a scholar
that
this is quite similar to one maybe tailed distribution but instead of putting prior simple
zeros on this stems
we just optimized by the maximum like of solution
now if we perform a two-step optimization where we first estimate see no assuming that
the alpha terms are one
and then we fix that senile we estimate the optimal alpha times we would gonna
end up with something which is why
very similar to links norm indeed it's the links
is the squared no it's the norm of the white and i-vectors divided by the
square root of the dimensionality of the i-vectors
now why this is interesting because these
random variable can be represented as a transformational a standard random variable well the transformation
as a parameter which is like vector dependent
now if you have to estimate this
but i mean of using an iterative strategy which but of a first estimate the
sequence and the alpha and then we
well to apply the inverse transformation we would recover it exactly what we're doing right
now would length normalization
so these demos
you know how to implement a similar strategy into our model
we introduce what we call that not all eight euros scaling layer which is a
single parameter and this parameters i-vector dependence of for each i-vector where y to estimate
its much selected solution
now our transformation is the cascade of these
scaling layer and what we were proposing before saw
the
composition of a finance also there yes
that is one comment here
in order to
if you change in this thing we
still have to resort what adaptive training that is we first three why we estimate
the bottom the shared parameters that we fix the shared parameters and the optimize what
file
and one more thing that we need to take into account is that at this
time
while with the original more than we don't need to do anything as then transformed
i-vectors with this model at this point we also need to estimate the by selecting
the optimal scaling factor
however these
used as a great improvement as you can see well the first line of the
same i was presenting before
and then the last three lines are the lda would length normalization
then the one day of transformation with the out of a scaling with one iteration
of
i don't like to estimates and with three dimensional automate estimates
and as you can see
the model with three iteration is clearly outperformed the lda will end in all conditions
on the sre ten female dataset
no
so i guess we get the conclusions we
investigated here an approach to estimate of this transformation which allows modified by i-vectors
so that they better fit the plp assumptions
so we apply this transformation we obtain i-vectors which are more or shall i and
we calculating the more than a
prepare a way to perform length compensation which is similar to p s two length
norm
but is
but you want to the particular let us that we using in the transformation
this transformation is that you using a maximum likelihood criterion and the transformation function itself
is implemented using a frame or which is very similar to that
of the neural networks
we'd other said with some constraints because we want our latest embeddable in this case
of that we can compute
we can guarantee that the log that amount of our copiers a existence of one
no this approach allows to
so as to be improve the results remaining terms of this from the sre ten
data we also experiments in the paper that
i don't report here we show that used it may also works on nist two
thousand twelve data
there is one cup that's how they said before here we using a single layer
transformation the reason is that this kind of more there's ten two
overfit white easily
so our first experiments with more than one on you know layer
well not very satisfactory as in the they were decreasing the performance
now we are managing to get interesting results by changing
in the weights the first one is changing the kind of neat in only narratives
of the details
some constraints inside the function itself which you meet these
overfitting behaviour
and on the other hand we also find some structure where we impose constraints on
the parameters of the transformation which again
use the overfitting behaviour in these allows to train it was which are more players
although up to now we obtained with the results in the sense that we managed
to
train transformation which behave much better
if we don't
use the scaling down but after we have in so let's get into them and
the
all
frame or the end we more or less convincing there is also what was shown
here so do still working provide us to understand why we have this strange be
everywhere we can
improve the performance and that of the transformation itself but we cannot improve
when we add the scaling term anymore
so on
i know some questions we have are fine but
however this compared to just straight gas station
okay the
thing is how we would improvement association with one hundred fifty dimensional vectors i mean
what you got size each dimension on its own
well if you both sides it's dimensional with some we tried
something with this model which if we put cosine transformation or well the function itself
can
so
produce that kind of organ and by the way when working with one dimensional synthetic
data disk image period when many kind of different usual spot the results already much
worse
so my case is that it would not be sufficient to independently
gaussianized ml each on its own
but allows me i'm sorry miss you tried it didn't where's
no i didn't right exactly that i tried the same order like presenting here with
transformation which applied independently of each component and my experience what i'm working on a
single a single dimensional data points
you think size very well
it does not program over fitting we then if i are more like something data
with several kind of is the only reason aspen is that the gas station kernel
right exactly does inverse function it's not approximation to it
no but it makes one like it the spectral that approximation it that's what they
get here doesn't work so i guess is that the approximate the real thing with
the commercialisation would still not work
i don't use the sensitivity
this approach does not come and activation function for d n and
the justification to is shown to them to probably too well as the evaluation is
first of all the original transformation i was you think you know is the last
one which then it can be shown that we can split into several layers but
it is different probabilities first of all it can represent the identity transformation
so if our data already portion
are kept like that
then it has some nice properties which can be shown there are some references in
our paper where you can find that
this kind of
like single-layer skin color represents a role set of this shows which are both
same in heavy tailed is q
so the reason we shall this
kind of this show the overall layer is essentially because it was already shown lately
can more than what some broadside to family of distributions
well it's all
it's they have to strange question
first the is it possible to the universal parameters and try to understand what that
the characteristics
of you training set
in term of a twisty of in the most
station effect of ten effects
you mean what do you mean i mean
look at you transformation a try to understand this so you the loose enough phone
when the v
the mismatch between o training set the inside the training set you to the presence
of
said phone from them
okay that's why the s c could be applied separately on different sides
if you have some way to
more the to see what is the difference in your distribution before and after transformation
you can apply the same technique often so on my
as well
transform independently two different sets and see if this represents on the differences or not
what i have here is that
pretty much
it looks like at least if we can see that evaluation and development of two
different sets with different is usually it is somehow able to
partly compensate for that
no transformations that is partly responsible for is because these as
say maybe to have your is it allows to stretch the models which are far
from what we would expect
so in what she can also one of the middle of these used
and the other hand
there you have thing which does this processing is the scaling anyway so that scaling
is very similar to length or is it is two hundred transformation that i'm applying
for this all done blindly
and then i'm learning transformational x i-vectors but i'm estimating at the same time the
transformation into skating
okay that is the part which is in my opinion really responsible for posing due
to mismatch in the basement used in that
then another thing that i cannot is
what is what would be much
better done
we were using is really more that the speaker factors and the channel factors appear
the i for example
the problem is that
already like these it takes
several hours if not the is to train the transformation function that this time it's
very fast training is quite slow and if we move into
using it cannot be lda styles all if we wanted differently the times that i
would really explode that so computational time also this time
because we would need to consider
in cases where the i-vectors are from the same speaker or not and in that
case would grow up
you would have
similarly
something similar to what we have we uncertainty propagation where you have to do that
this time of computation of everything but much worse
okay it's just
in fact because the training needs to be but i want to try to x
exploit this much as possible you parameters and method which is related to the first
one
is it possible somewhere to use this approach to
determine if one thing though i think when i-vector
is in domain or out-of-domain
so you use the two d to detect say okay
my operationally is
probably not really i mean length normalization that is not affect you start with this
but this is not and i
and the problem with this thing is that if i of a really huge mismatch
then gets amplified by transformation itself
because the data point and transforming arnold will be should be so the weight to
as well like the non linear function
is probably going to increase my mismatch instead of using it
so i'll to some point the with respect to still work better than start up
you after some point with this but it does not been worse
mismatches datasets
thanks and disappointed
okay this like the special