hello my name is to be addressed model

and in this video i describe our work in that narrow i-vectors

this work was go out for me by can likely and the only thing on

it

we don't we are from the university of east and be learned and going i

was the time of writing any

our study proposes a new way of combining gaussian mixture model based gender the i-vector

models discriminatively train the enhanced exact speaker and endings for speaker verification task

our aim is to improve upon existing i-vector systems

and we also hope to gain some insight is what causes the performance differences i

mean

in a speaker and the things and discriminant in the in speaker and that is

our study also is that is stronger convex and the gaussian mixture models and some

of the existing

ian and holding layers

as a background for how we look for different i are considered

the last three this can start can start suggested here

are combining ideas from all i-vectors and the intense

we a special events and the jurors universal background models and i-vector extractors all these

constants

let's to be the standard i-vector

so

key components here the two gender models are there

gaussian mixture model based universal background model and

i i-vector extractor

so even and

is used together it

initial easy readers to compute the supposition statistics for that

i-vector extractor

extract i-vectors

so we know the features are rule based and the rest of the components are

gender strange

then the nn i-vectors in this construct the universal background model is replaced by

these the and dates acoustic features as an input and reduce the senone posteriors as

an hour

these posteriors are used together and of c and it's easy features can be sufficient

statistics for the i-vector extractor

so this clustering differs from the standard i-vector and the universal background models discriminatively trained

one of the audience

third system descending an i-vector system

the system combines three modules one i-vectors is then the one neural network

these manuals are features statistics

when you

statistics i-vectors when you and that are module

is responsible

score and errors of i-vectors

training of these

kind of network goes as follows

so that are then used to train these individual modules

shortly these can benefit from the

i guess the a wrong

corresponding generative models

after these modules have been trained separately and they can be combined or and then

train

so this course there

you do less is generally models in the initialization stays

well i and i will use discriminative training the whole network

therefore and the last background construct this guy

is using and the nn with a mixture factor analysis fourteen year

in this for the authors used to estimate based you know texture you start speaker

and things

what is special about this

is that they use their own in a fourteen layer

these

for the error is basically an i-vector extractor implemented inside the in

is the m f a is based on after calling may or no must be

learned dictionary hangover

i think used right learned dictionary encoder right the wrong in this alliance

so we get all the components of these last construct our discriminately discriminatively trained with

speaker targets

okay

next we belong to the proposed neural i-vectors

before explaining the cluster itself

we will need to do

prerequisites for our model

and these are the know that and

the only layers and describe these two only layers by some and how they relate

to the standard c n

so then next initialize will be quite match for most

so first then it that

and we will study

the posterior combination formula or a standard gmm

we can see how we get the

note that formalize and all of their your question from here

so okay here we have

that was the number of gaussian components and its constant component power

covariance matrix mean vector and the associated right

okay in that assumes covariance matrices

or gaussian components

we will okay this four-mora in the is for

by expanding the normal distributions

then

but no

this inverse covariance times minima there will

my god

and

then the slow terms minus the other there we see

we get

this

and this happens to be exactly

formally used in note that

paper from two thousand and sixty

so

we have basically some on the last two means their covariance matrices

and the gmms we get that

same formalize and as in that

okay in it but i

illinois or learnable parameters there

note that there are

this form of grass

she's and news

and estimating these forming class and z is has to do not what i mean

there

we see from the posterior combination formula that doesn't depend and

from the mean vectors it is quite interesting signal can and the

standard gmms

but anyway

there you can compute the posteriors

or is there

input feature vectors

we can compare the component wise

what's

or in it but layer

formalize zone

on the right side the screen

and then

well right there

we have the first order centre so some statistics

the denominator just length normalized is then

so for each gaussian component we get one

vector

and finally

is no but they are male

concatenate is

component lifestyle closed form a supervector

so this is very similar to a

standard

c gmm supervectors

and how they are form

okay next

do the same for the learned dictionary encoder

only layer

so we start be there

gmm posterior combination formula

okay this time we you know we then

is

once colour term

do we get this

by expanding the normal distributions

okay

no if we assume

isotropic

or spherical covariance matrices

this formula

we simply by

this four

and

this is the

for music in that kind of prediction reading over all in layer

although in the

original publication or is and t is

be the term was not included but it was added later on by other authors

so the key point here must the

by assuming isotropic covariance matrices the l d

formulation from the standard gmm performance

then learnable parameters of this energy will are

i is

it's on the scaling factors for covariances then the mean vectors and is

i the terms

similarly as in that we can then going to the component was a rules or

is there

so again then we write directly have the first order some for some statistics

well okay in the standard and denominator is will be different so it is a

sample

posteriors for its

each component

so this is model i and it's the traditional maximum likelihood ratio on the east

is on one and vice outputs

and then the

on the nist and form a supervector

okay

so now we have the necessary can start you

constructs explained extend the proposed neural i-vectors

so we start with

and standard

extractor architecture

and we replace that

and are willing layer

we either that or l d coordinator

and as its or from the previous bias we can use

each polling layers the extra stuff isn't statistics

so we do that

and by using this present study is this weekend frame

regular i-vector extractor and you can also then extract i-vectors from these statistics

so that's the idea

so now we can completely stable

so how our how our proposed functions are dressed differs from there

able in their roles is that the

i-vector extractor is generally

otherwise the cluster is the same

if we compare our proposed neural i-vectors we then the in and i-vectors

we can see that the

i-vector what is the same but that

users and you the in verse

no one ever then restrained speaker utterance

and also the features are obtained from a

last layer before the one in there

next

that's model and the experiments and results

so we can say that speaker verification experiments on the speakers and one evaluation

first we compare our role as the results the other i-vector systems

the single fine from the literature and these are some of the best ones

on the line we have started in this easy i may or

and in the second one may have i-vector system that is isn't perceptual linear prediction

features together with the actual in the features

and this w the a is

dereverberation system

so we can see from this results the then all i-vectors performs the best

okay

so partial but let's next

compare our results

the nn speaker and endings

so we can use the same the nn sticks there either sufficient statistics for a

narrow i-vectors

or the can extend the speaker and endings directly from the audience

so

here are all these are our results so

in the first line we have a

the and we notable dictionary encoder whoever

be into one zero two equal error rate

but then the corresponding no i-vectors

okay that is that we use the same union the extended sufficient statistics

and then bending

then trained that generally

i-vector extractor so

the roles we can do one nine three

so no that's

okay the third level we have a modification of the learned dictionary encoder

so this uses

so i dunno

covariance matrices in instead of

isotropic covariance matrices

so we got been improvements by doing

these verification

the last two lines of their

results for then applied on there

so the interesting

they here

i wonder is a

what

what courses the performance difference between the

we generalize there's and then the in the things

because these are using the same the nn

but in

so there are two possible sources for this dress

so the first we used one

is the difference between the

generated by the model and they're

thereafter holding their

so

because of the holy where there is only one layer

even for the in the in here so

only this small part seems to really

well alarms

in terms in the equal error rate

so it seems that the discriminative

training objective is better

okay there is another

possible reason for this performance difference

so there is like mismatched how we trained the

b and n

one or we how we trained in the in holding linear and how

how we use it in the i-vector approach

can see that the and we explicitly form a supervector

and in there

i-vector

roles it is not a

i-vector of proteins the base adamantly so

is

like

console how many alignments how many frames are aligned itself the gaussian components

so this is missing from based supervector a row

so this is one of the

future works so

i used in the in owning layer is that it will resemble more there

i-vector approach

so this mismatch will be going on there

another

idea for the future work is

explain here

so

instead of substance that these three extra

the errors and the universal background model on there

the posteriors from this one in there

and

by using the is

we will then

have a neural gmm-ubm system we train based scoring

so this might be useful for some

special application why our for a welder a sore race and speaker verification

before i finished i have to related announcements first one is the program goals are

available

so we have i-vector extractor and providing their systems and in addition to speakers in

the mind you have also that there's

the goal or python and by those based

well we have or ugandans the on can be more research such

the second announcement is the this study was also included in my dissertation

and or is this tradition i have been extremely nice and coming residual and its

but weeks so

anyone who wants to jordan is pretty to design and we can be found

well

here

so you there