hello my name is to be addressed model
and in this video i describe our work in that narrow i-vectors
this work was go out for me by can likely and the only thing on
it
we don't we are from the university of east and be learned and going i
was the time of writing any
our study proposes a new way of combining gaussian mixture model based gender the i-vector
models discriminatively train the enhanced exact speaker and endings for speaker verification task
our aim is to improve upon existing i-vector systems
and we also hope to gain some insight is what causes the performance differences i
mean
in a speaker and the things and discriminant in the in speaker and that is
our study also is that is stronger convex and the gaussian mixture models and some
of the existing
ian and holding layers
as a background for how we look for different i are considered
the last three this can start can start suggested here
are combining ideas from all i-vectors and the intense
we a special events and the jurors universal background models and i-vector extractors all these
constants
let's to be the standard i-vector
so
key components here the two gender models are there
gaussian mixture model based universal background model and
i i-vector extractor
so even and
is used together it
initial easy readers to compute the supposition statistics for that
i-vector extractor
extract i-vectors
so we know the features are rule based and the rest of the components are
gender strange
then the nn i-vectors in this construct the universal background model is replaced by
these the and dates acoustic features as an input and reduce the senone posteriors as
an hour
these posteriors are used together and of c and it's easy features can be sufficient
statistics for the i-vector extractor
so this clustering differs from the standard i-vector and the universal background models discriminatively trained
one of the audience
third system descending an i-vector system
the system combines three modules one i-vectors is then the one neural network
these manuals are features statistics
when you
statistics i-vectors when you and that are module
is responsible
score and errors of i-vectors
training of these
kind of network goes as follows
so that are then used to train these individual modules
shortly these can benefit from the
i guess the a wrong
corresponding generative models
after these modules have been trained separately and they can be combined or and then
train
so this course there
you do less is generally models in the initialization stays
well i and i will use discriminative training the whole network
therefore and the last background construct this guy
is using and the nn with a mixture factor analysis fourteen year
in this for the authors used to estimate based you know texture you start speaker
and things
what is special about this
is that they use their own in a fourteen layer
these
for the error is basically an i-vector extractor implemented inside the in
is the m f a is based on after calling may or no must be
learned dictionary hangover
i think used right learned dictionary encoder right the wrong in this alliance
so we get all the components of these last construct our discriminately discriminatively trained with
speaker targets
okay
next we belong to the proposed neural i-vectors
before explaining the cluster itself
we will need to do
prerequisites for our model
and these are the know that and
the only layers and describe these two only layers by some and how they relate
to the standard c n
so then next initialize will be quite match for most
so first then it that
and we will study
the posterior combination formula or a standard gmm
we can see how we get the
note that formalize and all of their your question from here
so okay here we have
that was the number of gaussian components and its constant component power
covariance matrix mean vector and the associated right
okay in that assumes covariance matrices
or gaussian components
we will okay this four-mora in the is for
by expanding the normal distributions
then
but no
this inverse covariance times minima there will
my god
and
then the slow terms minus the other there we see
we get
this
and this happens to be exactly
formally used in note that
paper from two thousand and sixty
so
we have basically some on the last two means their covariance matrices
and the gmms we get that
same formalize and as in that
okay in it but i
illinois or learnable parameters there
note that there are
this form of grass
she's and news
and estimating these forming class and z is has to do not what i mean
there
we see from the posterior combination formula that doesn't depend and
from the mean vectors it is quite interesting signal can and the
standard gmms
but anyway
there you can compute the posteriors
or is there
input feature vectors
we can compare the component wise
what's
or in it but layer
formalize zone
on the right side the screen
and then
well right there
we have the first order centre so some statistics
the denominator just length normalized is then
so for each gaussian component we get one
vector
and finally
is no but they are male
concatenate is
component lifestyle closed form a supervector
so this is very similar to a
standard
c gmm supervectors
and how they are form
okay next
do the same for the learned dictionary encoder
only layer
so we start be there
gmm posterior combination formula
okay this time we you know we then
is
once colour term
do we get this
by expanding the normal distributions
okay
no if we assume
isotropic
or spherical covariance matrices
this formula
we simply by
this four
and
this is the
for music in that kind of prediction reading over all in layer
although in the
original publication or is and t is
be the term was not included but it was added later on by other authors
so the key point here must the
by assuming isotropic covariance matrices the l d
formulation from the standard gmm performance
then learnable parameters of this energy will are
i is
it's on the scaling factors for covariances then the mean vectors and is
i the terms
similarly as in that we can then going to the component was a rules or
is there
so again then we write directly have the first order some for some statistics
well okay in the standard and denominator is will be different so it is a
sample
posteriors for its
each component
so this is model i and it's the traditional maximum likelihood ratio on the east
is on one and vice outputs
and then the
on the nist and form a supervector
okay
so now we have the necessary can start you
constructs explained extend the proposed neural i-vectors
so we start with
and standard
extractor architecture
and we replace that
and are willing layer
we either that or l d coordinator
and as its or from the previous bias we can use
each polling layers the extra stuff isn't statistics
so we do that
and by using this present study is this weekend frame
regular i-vector extractor and you can also then extract i-vectors from these statistics
so that's the idea
so now we can completely stable
so how our how our proposed functions are dressed differs from there
able in their roles is that the
i-vector extractor is generally
otherwise the cluster is the same
if we compare our proposed neural i-vectors we then the in and i-vectors
we can see that the
i-vector what is the same but that
users and you the in verse
no one ever then restrained speaker utterance
and also the features are obtained from a
last layer before the one in there
next
that's model and the experiments and results
so we can say that speaker verification experiments on the speakers and one evaluation
first we compare our role as the results the other i-vector systems
the single fine from the literature and these are some of the best ones
on the line we have started in this easy i may or
and in the second one may have i-vector system that is isn't perceptual linear prediction
features together with the actual in the features
and this w the a is
dereverberation system
so we can see from this results the then all i-vectors performs the best
okay
so partial but let's next
compare our results
the nn speaker and endings
so we can use the same the nn sticks there either sufficient statistics for a
narrow i-vectors
or the can extend the speaker and endings directly from the audience
so
here are all these are our results so
in the first line we have a
the and we notable dictionary encoder whoever
be into one zero two equal error rate
but then the corresponding no i-vectors
okay that is that we use the same union the extended sufficient statistics
and then bending
then trained that generally
i-vector extractor so
the roles we can do one nine three
so no that's
okay the third level we have a modification of the learned dictionary encoder
so this uses
so i dunno
covariance matrices in instead of
isotropic covariance matrices
so we got been improvements by doing
these verification
the last two lines of their
results for then applied on there
so the interesting
they here
i wonder is a
what
what courses the performance difference between the
we generalize there's and then the in the things
because these are using the same the nn
but in
so there are two possible sources for this dress
so the first we used one
is the difference between the
generated by the model and they're
thereafter holding their
so
because of the holy where there is only one layer
even for the in the in here so
only this small part seems to really
well alarms
in terms in the equal error rate
so it seems that the discriminative
training objective is better
okay there is another
possible reason for this performance difference
so there is like mismatched how we trained the
b and n
one or we how we trained in the in holding linear and how
how we use it in the i-vector approach
can see that the and we explicitly form a supervector
and in there
i-vector
roles it is not a
i-vector of proteins the base adamantly so
is
like
console how many alignments how many frames are aligned itself the gaussian components
so this is missing from based supervector a row
so this is one of the
future works so
i used in the in owning layer is that it will resemble more there
i-vector approach
so this mismatch will be going on there
another
idea for the future work is
explain here
so
instead of substance that these three extra
the errors and the universal background model on there
the posteriors from this one in there
and
by using the is
we will then
have a neural gmm-ubm system we train based scoring
so this might be useful for some
special application why our for a welder a sore race and speaker verification
before i finished i have to related announcements first one is the program goals are
available
so we have i-vector extractor and providing their systems and in addition to speakers in
the mind you have also that there's
the goal or python and by those based
well we have or ugandans the on can be more research such
the second announcement is the this study was also included in my dissertation
and or is this tradition i have been extremely nice and coming residual and its
but weeks so
anyone who wants to jordan is pretty to design and we can be found
well
here
so you there