oh
oh
oh
well this is so you organise race
a that
if you serve also
third order all three here that the first author individual
yeah it solves the general idea of a source position
that's the i think right after
the previous oversee
no
and
if
afterwards after this talk you think that this is fantastic i'm going to implement this
tomorrow or next week
in you need to do is done it so
and mature
and he slides because he'd make the slides form
yeah
i'm very happy with that if only a house you think yeah afterwards i didn't
get i think of this and why did before but this is probably due to
me and all being able to convey the message
and if you before hands thing to the same thing
then you're are sort of each what we have
right
anyway
so
this is sort of you automatically generated a summary of my presentation today
which i think is kind of pointless in this particular case because
contains lots of irony is that are can be explained later soldiers
having to
the motivation to do this work and the idea is that in speaker recognition
we all phone that and you evaluations were things change from year to year we
often
get to the situation where you get new data we haven't seen for sitting here
well yeah
voicing data no you know what kind of noise maybe some people
but most of us don't know
and
well
how we're going to deal with that's
i don't
sometimes have to
locations and see what i'm going to talk about
every once a while actually to rubbish i guess because
i haven't seen this to go from
but anyway that the basic idea is that if you get conditions definitely to train
and test
are of a different kind you like to have to see
that you like to have seen this before what you do
i don't
if you're
if you know that you won't have seen
and one way of to do with that is this ideal source normalization
i'll try to explain
the basic ideas source say
oh here's some slides about i-vectors i think i i'll skip
with these two then you probably in your hands and standard much better i
the basic idea is that we review the i-vector in this particular presentation that's a
very low dimensional representation
of the entire utterance
containing
apart from speaker information other information
i essential to the idea of source position is that wants to that we do
in the standards
approach is we hear by within covariance
within class covariance normalization
for the P lda
and
that's needs to be changed
with data and in the training
the
within class and between-class
scatter matrices are
are computed
and that's where the source normalisation takes place
so here and notes that we actually need to estimate those scatter matrices
for
so this is the mathematics just to stay in line with the previous torso
to have at least some mathematics so on the view screen
this is the expression for the within
speaker scatter matrix
and this is what the source position is going to
try and estimating a better way
because what is the what is the
problem with a wccn in this particular
a matter is this issue is that
relevant kinds of variation are observed in the training data
and this is more often to if you don't have
data
so i hear another graphical representation of what typically happens here we look at a
specific
kind of data is the label of the data in mind say which is
the language so you have also english language data and every once in a while
we get some a tests
where language model english
how that in i think that
six
and before
a also two thousand eight seconds content easy
so when is you get in two thousand twelve be what you get
so maybe language itself is not so relevant for the current that is what it
is a good example of where things
change
an important
here is that even if we have some training data from here's
will not have for all speakers
the different languages so typically
the
the speakers are decoupled from the language of for some language you have some speakers
and for the language you have other speakers
so how do you know the problem where you in the end in your
recognition have to compare one segment in one second
in the other language where the case might be that it's actually same speaker
so what about shown out is why this kind of
difference in language labels going to
influence these
we can
speaker
sky within class scatter matrix
so this is one way of viewing how the
i-vectors might be distributed in this very
this way
and
is used
so
these
three big circles denote the different sources in this case of source
might be a language
with some means there's a global mean which would be yeah mean i
i guess
i don't have some speaker so for the speaker you have a little bit of
variability any comes from one source
and the speaker is the she and he comes from another source and we have
also you speaker sources in a last
you think imagine if you're going to compute the between speaker variation that you actually
i don't a lot of between source variation and that's probably not a good thing
which you want to
no we did different speakers and between source
so
the wccn is going to
do this for myself
based on this information
and related to this
is what is stacey the source variance
is not correctly
observed so the various tv sources
is not explicitly
models
so there's another problem for wccn
so
this is as follows is summarising again
what problems are
that's moved to the solution i think this is much more interesting that to see
what's how do we tackle this problem that these sources to hang around
have
globally different means in the this i-vector stage the solution is very simple is compute
these means
for every source
so here you look at the
scatter matrix
for a
conditioned on the source
we simply say i compute the mean for every source
and before computers contrary i
subtract these means
so the effect basically means that you
all these three
sources this is still going from like these two microphone
yeah and telephone data
also them for languages
yeah more
and you subtract the mean for
label per language
and then this scatter matrix will be estimated better so the mathematics then we'll say
okay
that's very nice fit within
within a class variation
we still have the between class variation
but we'll just see that as the difference that's data rate
issues
so that the other way around
but it does so the idea is that you can compensate for one
scatter matrix and because you have total variability
you can compute the other as the difference from
total variability
so this idea is to stress
in fact that you only need the language labels see records applied to language
for the development set
so you're languages are you development
and you're training your system you have all kinds of labels in your data in
this case we consider
language label
but in applying this you do not need the languages
because this is only used to make a better
transforms for these wccn that make
how can you actually see that it works well one way of doing that is
to
to look at the distribution of i-vectors
a wccn
when you
do not apply this technique source-normalization a strong left
and here in different colours U C encoded of the label that we want to
so the way in this case language you see for each language recognition
these
languages might be familiar for these people needed
was
that what you see that
languages seem to have different places
this is by the dimension a dimension reduction
two dimensions
after the incision that's just for few problems
and you see a that is language normalization this source
source normalization by language
you get that all these different labels too much more similar
force for the basic assumptions that
i-vector systems are based on
should a little better
okay in our system results because
we need to have tables
in the presentation of we're going to get some
at first what kind of what kind of experiment we can do
we use
most i databases for is that the
yeah men the training
yeah i-vector make use of
but we did at one specific database callfriend
very little database are used
oh two starts
the first language recognition so it contains
a variation of languages and twelve languages certainly
for that
right
price
and
as for the evaluation data because these two data sets and from nist two thousand
ten
dataset and two thousand
eight oh two thousand ten you might think why would you do that there wasn't
actually much different language
from english that was sense but we don't use that for purposes one for training
calibration
calibration as well
another reason is to see actually what are we do doesn't spurts
the basic english performance too much
you a case of course is going to be used as a test data
where there is a there are trials from different languages
and there are also considered
condition english only so that
we compare
do you actually hurt ourselves
this is
durations are a simple standard
the U
have seen either numbers i'd say before so there's nothing you hear
these are indians the breakdown numbers for the
per language
for the training data
i
these funny are the results now here
i'll try to explain
database
red
it means this is you
doesn't mean this is better
but both figures means is better and the first condition
shows
see
yeah
the performance on all trials
four sre eight
and measured in where it and get
does not in calibration here
a C these numbers go down so for four O eight it works if we
see some languages i believe that
okay
force is also in english
if we
oh you look at english then used to use a little bit so it does
hurt our system but it doesn't hurt it's
here
a
and
the same for
as we can
for system gets hurt
but here
the basic conclusion there
here we have a breakdown where we look at the english languages
from history of weights
where is where we look at different positions are there in the in the trials
the same language or different language
when english is
so the top row which has to be the best performance because
still contains these N yeah
many english trials
systems that works best for
so the baseline
but this includes
both english and english so if you break down
for instance where you say okay i want a different language in the trial suppose
that the target of target trials language
difference
i was four
we see that the new figures that once right
are slightly better than
the red ones
left
the background smooth
and the same respect to four
in addition so you can specifically look at them english trials
where there's otherwise restriction
it helps
for the language
trials where you actually restricts trials you say minus the same time but english
still helps to there's one condition where for whatever it does not how
so that's a big difference
this is something we don't
is that
suppose
and that's for the old english trials
where you specify that the process
different language trials
so usually
it seems to work
except for one particular
place
where it's that's
one
dish
but i say that are actually not too many trials
it's not show the graph oh very nice
if you vision
so i don't know how
accurate this measure
now i'll except for also it's calibration
and
our to carlos also the it's a this kind of experiment i
looking at
make
more robust for
for languages
and we use a better different measure
is a measure used by the keynote speaker they
as cllr and one way of looking at how
however
you're calibration is small rates is to look at the difference between the cllr and
the minimum attainable C or your in G
or oh
C miss so as to that
posts of
this kind of H
section
it's not
this is gonna
alright so you have to this school mismatched different means
mismatched and matched
and
i was actually thinking
vigilance the intensity state we might build a set of mismatched
my niched
but
that might be to heart for you guys
anyway
and the is the
that they do thing that we tried to a remote here
and black is the old approach
at both is better figures
so we see a separate from female to answer
also
we ask for
and now
generally
for this mismatch condition by big mismatch we need to calibrate english only to be
a straight answer
ten for calibration and we applied to
sre eight is
to be the other way around that we consider that way in order to be
able to calibrate english and test or
in a channel
so this particular
in addition it works
always and
in the matched condition that is only looking at english scores of this really
well calibrated english words
ten
you see that it doesn't always help factors on one condition where it helps to
do so
the miscalibration itself
so you molecules
in calibration
becomes less
see that's for calibration there is still somehow
english only
but for the arts and figures it doesn't
however
alright i hope that
i
explains the numbers well enough
your first for the managers amongst
yeah
i just easier to draw at this
the same time
dataset
calibration this is just miscalibration so this is just the amount of information by
by not be able to
produce proper likelihood ratios
increases
for
the conditions where we applied is the language normalization
oh
but for english only trials you don't notice the difference
so i have a slight
conclusions are here
used to source normalization wish to general framework and i have to say here's been
applied before
it should be machine
three or four
conference proceedings
papers
about this technique applied it to this
definition of source being a microphone or integer interview or telephone
and we even applied it
i should say by
fair
to source being know the sex of the speaker so even though i speakers generally
don't change six
and that's in this evaluations
you can use this approach
to compensate for situations where you might not have enough data
so for telephone conditions this
didn't we make much difference but for conditions
where there wasn't really much data i did how to shoot pool the male female
i-vectors and make a human same gender independent
recognition system
and apply source normalization
very sad speaker sex is the label of the i-vector and we normalize that way
and that in your recognition
you can only the labelling marcy can basically more second column of your trial based
okay
but you reply to two languages seems to work
recently
and
that it doesn't for english trials too much first
which
and also basically S
what's to go
i
i
i
i stopped speaker cases
and we do not use try to use language as a discriminating speakers
in this
research of course you can see that very well
we think you that
take it as a challenge that you should be able to recognize speakers even if
the speaker speaks a different language than seen before in the in
in the training then you
what
of course
oh
make it will be easier by saying either or different speakers
it's a speaker
she
i
i
i
oh
yeah
i
yeah
i
i
i
yeah and i remember calibration was in one of the one of the major problems
in two thousand six
where you know if you have more english
performance actually be reasonable the discrimination performance but calibrations
where
car
so sure that
that even holds for be a systems
nowadays though but a systems nowadays are
generally behaving better
i
yeah
i
oh
no i don't think that a
that
that is what we want
say i think
to say is that it
with the channel
between channel
variation
estimated one of the
very of the total variance
is used to the fact that
things have a different language
and you don't observed that's in the within speaker
variability
so the attributes within language variability
as with the
channel variability
and that is not as to K stiff
this case
languages for same-speaker
i
i