i
well
i
i
i
i
oh
i
two
roughly
since
and as a student of which risky in computer science and college
i'm glad shows you
the study of the effects of it just a nation using i-vectors interesting dependent speaker
location
best i would use the main challenge in speaker verification
and then i will
the
is actually about the problem of research
and their proposal
have
then i would use the i-vector framework for discrimination model
including the from what the intersection of all the signal speech
and then i would just use the
the elements
in those
excuse in the daytime
description of speaker verification systems
and the experiment results
i don't of the solutions
and backchannels in speaker verification comes from two it first one is
extrinsic the right but
and the other one is interesting there are P G
the best alignment the associated with that is
come outside of the speakers such as mismatched channels
or environmental noise
the intrinsic variability is associated with that is that
from the speakers
such here is speaking style
emotion
speech one and state helps
and it can there are a lot of research
focus on the extrinsic drive each
but an example of research about
in this to the right which has been proposed so
in this paper we focus on the intrinsic the remote but
the one stack is fess of we use the right but
in speaker verification
the problem with focus on
on the performance of speaker verification
i'm at best yeah that the right into the remote speech
so there are two questions
best one is
how the speaker verification system before
where enrollment and testing on the and mismatched conditions between just arrived at
and the second parties
how the colleges focus on model that was over at each
okay in addressing the effects of interesting eighteen speaker verification
so wait one
yeah
would be the proposal more than the signal right but with i-vector framework and want
to say that that's
and
first we have to define the variation forms
because interested over but comes form
all the data associated with the speakers
but they are still practise
so waste best
define the base form that is neutral spontaneous
speech at normal rate and a four inch at least
for many cases
basic well
weight
you
either that you know
variation forms
from six aspects including speech rate
with the state S
speaking by
emotional state speaking style and the speaking language
for example in the speaking rate
we have
fast speech or slow speech
you think you basic skaters
oh well
clean i zero
for example the model of speech means
the speakers have a candy is a mouse
talk
in that way
the recognizer with
the other night they are to use the speech data
has a cat qualities noise
and
the speaker why don't including not so hot and whisper
in the emotional state but have happy
emotion and their own body motion
and the
the speaking style
a reading style
yeah
about the speaker which we have most chinese language recognition
so for me six aspects
we have to have
variation forms and the way
recording for the data i
for experience
then
are we just use the i-vector framework point is the variation more
and is the i-vector modeling has been successful in the application
for the channel compensation
the i-vector framework is composed to pass festivities
we can project the supervector
and
into the i-vector the total she so the total variability space
he sees the low dimensional space
the second part is that i that was
okay
we can use the cosine similarity score
to actually use the
similarity between a test
utterance and yeah
training
please
how baltimore in nineteen score i'd be please
i-vector framework
because
before they also partly
studies
about the i-vector format for modeling the
can compose compensation
channel
so
we want to see if it is derived for the
what we are interested about ivy
seconds how to label the effects of images
ratios we use a set of technologies
which is used to have to be the best soulful
channels
there were having we use to lda and this is a P
the idea behind the lda is
minimizes the within speaker variety by maximizing the between speaker for speech
we have
define the compression and the
the lda projection matrix are obtained
by
is composed also
it can batters
which is how to decrease the eigenvalue of the equation
and
within class
very well as normalization
do not it
the lowest weight the idea is that
you
exact the direction of high inter-speaker each
which she's
though the partition
the taxes in projection matters is obtained by
could cut computation so equation with
chomsky
people
the composition
i G E is
partition magic's
and the buttons as we use process
partition methods
that using it was that was since direction
so
G
partition magic's
and they use
you don't ten
finally compose the eigen vectors of the within class
covariance normalization
metrics
so i would use the experience about how to use that perform well
in the interesting
relation box
one best or we use the line junction tree which involves we have recording
yeah we went into
so i don't she for the tree and the test
then we'll description about all
so the speaker recognition system you
yeah
which use the gmm-ubm baseline system
it's just as it's the speaker recognition system and then ways you would
so we'll
i've based speaker recognition system with different
interested over verification
instance
i'm thinking of a large and then we use the expression
results
the ranges over the variation corpus that we use
these counts for
one hundred
we must use events
which she has
they to try to solve the speech chinese
yeah it used for eighteen years ago to tell you i guess
yeah
two how variation forms just a
still people
yeah
each student speaks for stream units
for each variation form
so that the
then each day what is it about two ten
parts
so each part not for
eighteen seconds that is used for training and testing
and some of that
okay resolution is a parts
and these or model soundtrack
we use the data machines in the intrinsic variation corpus
the function
have been for a specific you present to apply use for training would be a
we just thirty speakers
and fifty male physician variables that to which uses gender dependent and gender independent ubm
the last for eighteen hours and the current the trial
orientation forms
then we use thirty speaks
around six P
data to train
the total reliability space which is a much extreme
also it is not for eighteen hours
and of course we have to
we use straight
different
interesting the composition a large
lda up to their energy so have to train the projection last week's forty
and you
and speakers
which asking for time outs
for training partition a six
asked we used one speakers which included in two thousand four hundred utterance
for the task
and
all tell variation forms
and that way you five
speaker recognition systems
we use the gmm-ubm speaker
six
speaker recognition system as a baseline system
which is
the gmm-ubm is composed of
several
also
mixture
the feature volumes days thirteen on original mfcc and ubm is composed so if you
want to five hundred child gaussian mixtures
and that is a speaker verification system is
use the lp in terms of them but also with a combination of whatever you
know
and the i-vector dimension of that are these two hundred
this table
oh
you incorporate for you for each enrollment condition when testing utterance
so the total variation forms
and
for us to use the speech recognition
you to include we choose the spontaneous speech is that this case
then we have
a six aspects including speech studies that you know one speaking rate emotional state physical
state
speech and language
there are
well calibration forms
and for each variation forms
way
we use them
for the enrollment condition
and trust
this year we said well with water variation forms and we can see that E
yeah i based system
perform much better than the gmm ubm
baseline system
the best results obtained of the egg
which is a combination
of lda and wccn
and also we have
see
in what a different variation forms
we found if you used to whisper
as you won't match
then
the eer is
or not
so that perform a whole
that way
calculated avoid for speaker repeated and
yeah
iteration calls
and from this table we can see that
i-vector system i-vector be used in a speaker tracking system is better
then the gmm ubm
speaker locations
in reducing the variation corpus
and
the best results you obtained in the i-vector
based
speaker consistent with the relation okay
yeah
and
we lately
section six
as an
icsi's a det curve or a speaker system
i S gmm ubm based on this
pitch and the
so that these two
see in system with that would be a and wccn
we can see
there are three the improvements for the performance
this to this paper shows
the camera the reason between gmm-ubm system and i-vector system
you
matched and mismatched conditions
so faster we can see the first two comes is used matched conditions
the last two is for mismatched conditions
and they use
we can we computed for each
variation forms
and we can see for each variation forms
mismatched
in this to matched conditions
the huge the yard is much bigger
there is a match the conditions
and the second we can be always
can you know the gmm-ubm system and the i-vector is the system
and we can see
for example for spontaneous
margin for
the one the ones for the gmm-ubm the yellow ones for the i-vector systems and
the
there are the
when the whole whisper
version of all the i-vectors this system is that
have a
significant
we actually
oh
and
this table shows for each testing condition when spontaneous
utterance find for enrollment
when the
cost
the most are you know the whole way we speak
we spontaneously so it can see when testing with each iteration vol
"'cause"
turn moment for me is that spontaneous so if you castaways it also the spontaneous
for the yeah using it should be a small and the best results we obtained
with obvious isn't it
and
also in the past few enrolment we use it
spontaneous bombard castaways the whisper
duration and they were found that
the
yeah is it might speaker
and the whole performance
shot duration
this
speaker say that
so since the whisper variation used to
but different from the heart a very simple
so we do so we
presented is table which shows if you
norman we see whisper utterances
what about the eer for a for each testing condition
and we can see that
the results using
using become much worse
for example for the gmm-ubm system is wrong
what you
percent
then the best results are obtained in the matched recognition which she's
seventeen percent
yeah
also
for the whole picture we can see that
the i-th basis people in system
is still
perform well
the problem better than the gmm-ubm system
the combination of lda and which is an
we also performed best
so we have well occlusions that's
mismatch using you just a confederation course channel variation in speaker recognition performance
and the second these the i-vector framework one but then gmm ubm you modelling agency
the variations
and especially with a combination of four
lda an adapted and the best they can get the best results
this that the whisper utterances that much different form of the variation forms
is that brings the matched condition of speaker recognition performance
so of to work will in the model domain there will try much more useful
just iteration compensation
and also in the visual domain will
will propose some in
i don't mess between four
for example we do you
the
whisper where the whisper variation in the best results
maybe best if
after the vad
the list the
the
whisper low quality is much shorter the model
rep a speech
the second these whisper the speech she is
different
for is much different from other speech sound which involves so we can do some
work in the feature domain
to include just the performance of the speaker but he system
that's all thank you
i
yes
we will record the this database in the fatter profitable and they all use the
one
they all students and the i and why they
some
what in a paper
tell
which you
they have to act the emotion
yeah i something
target
how to act
some
i
i
yes for example
i
i
yes for example if you if you speech parameter we may be
we have to you can alter so those listed you model and motion stays at
so when we are part of the database we try to just a change you
one mation also some
some of deformities relation
so we just to try to
asked to separate are the eyes signals on
elation
assume we have
investigation
in future work
some of it
thank you