a single channel and i hope available you into two you kindness the output of
a lunch
okay
the talking i'm going to talk about is how to make uncertainty propagation run fast
and also consume less memory
my name and why max from the home component alec university
so here used a liar representation i will first to keep an overview of i-vector
p d a and x spring how the uncertainty properties and can
can
model the uncertainty of the i-vector
and how to how to make the uncertainty fabrication run faster one possible use less
memory
and we evaluate the proposed to our problem on nist two thousand trial
and
finally we keep looser okay so here is the outline of the i-vector p lda
map onto the probably all you already know this though i mean cool for go
through these i very quickly a here
we use a the posterior be not the latent factor
you to
tool
to as a low dimensional representation of the speaker so given the mfcc wetter or
phrase utterance we compute the posterior mean of the
a latent factor and recall this time at
okay and t is the total variability matrix that define the channel and speaker subspace
or you represent a subspace where the i met okay very
so here's the procedure for the i-vector extraction given a sequence of mfcc what are
we extract the
i-vector using the post your and beam of the latent factor
and because if we would like to use the gaussian the lda therefore will lead
to
suppress the lawn the non gaussian behavior of the i-vector through some preprocessing
okay for example whitening and also length normalization
and after this preprocessing step because the process that i-vector pairs
top you
and then this not you
i-vector or preprocessed the i-vector can be modeled by the t i would be
so the idea is to decrease of the phone of you have a modeling
we represent a speaker subspace
and that h i is the use the speaker factor
and that you can see for the change sets exception of the i've speaker we
only have one
latent factor h i okay
and epsilon i j work that sentiment she to that cannot be represented by the
speaker subspace
so now that to in the scoring
at time we have that
the test i-vector
we have a test i-vector w t
and also we have this target speaker i-vector w s
and they're
and we compute these likelihoods are assuming that the top us to cut but come
from the same speaker
and of the also have the alternative hypothesis where the top us and w t
come from different speaker
there are four after some mathematical at manipulation become if this very nice equation so
in this equation we only have matrix and wet
multiplication and
the nice thing use the matrix c
hence i'll can all be computed as you can see these set of creation here
at the bottom
and all these sigma a c segment total
and thus and so on can be pre-computed from the
the lda model parameter that that's explain why the
scoring p lda very fast
but one problem of these conventional i-vector p lda there
i'm not two years
that's not have the ability to work that stands the reliability of the i-vector
so whether the utterance is very long already sought we still use
low dimension i-vector to work that stands
the speaker characteristics
of the whole utterance
so this we propose a problem for sort utterance speaker verification
but not problem for very long utterance that's how you we have three mean is
or sixteen use all speech
but if we're utterances only about ten second three second and you
then the variability or uncertainty of the i-vector will be so high that's
and the plp a scroll wheel favourite same speaker hypothesis
even if the test utterance is given by a in a imposed
about the recent years you've the spectrum is very short we will not have enough
acoustic vector for the nbp estimation or in we do not have enough acoustic webster
to compute the posterior mean of the leading factor you know factor analysis model
so in the ideal i'm certainly publication
we not only extract i-vectors but also clancy the
the posterior covariance matrix q
so this i this time to illustrate the idea
this gaussian represent the posterior density of the latent factor
and to do so i-vector is that's me so it is a point estimate
and this
equation initial the procedure of computing it
okay so t c use access to cease partition of the total variability matrix but
as you can see you
if the variance of this gaussian use very large
then the point estimate we'll not be where a correct
and this happened went utterances where is not the recent years
if the utterance is very short and see which is that zero although sufficient statistic
will be very small
so use this party where is more than the whole covariance matrix l university'll be
very big
and thus the means that these variations and be large as a result of point
estimate we might be very reliable
so
and
that's why two thousand and thirteen
that any proposed ideas that the lda and certainly propagation
so you that is so to extracting the i-vector we also express the posterior covariance
matrix
the latent factor
and that represents the uncertainty don't i-vector
and with some preprocessing as i have mentioned because we want to use a
a thousand you lda for the
as the as the final stage of the modelling for the scoring
therefore we also need to preprocessed
matrix
you time school
which is if you processed version of the of the posterior covariance matrix and i
thought that we could have a the lda modelling now how can
how to a certain these corpora t
other publication come from here so you know generative model
it a generative model we have wh i press i to allow you can see
this you
use and like at the conventional
the lda model we've the eigen channels
so this is my eigen channel but instead you keep and so on
the section
so it depends on the i
speaker
and the change section of the i speaker
as a result the z is also depends on the i and j
now the trouble of this year's
for every test utterance
we also lead to compute the u i t so unlike the i can channel
situation
we only need to pre-computed and make use of which during scoring now in a
uncertainty propagation this you i to have to be computed during
a scoring time
because the ssm dependent
and do compute this u r j which was performed at a dusky decomposed system
of the posterior covariance matrix
and that's why we have these the intra speaker covariance matrix like this
so loud finally and during the score
with the p l d a u p then we have these at equation
which is very similar to the equation the actual you
all the conventional p lda right as you can see this s u
matrix and vector multiplication
but the difference yes
this time the at b c and d all depends on the test utterance
an issue can see from this a set of recreation a s t e s
t c s t and the st the all depends on the test utterance
that that's me as they have to be pre-computed
and only very small number of matrix can be pretty compute this have to be
computed this set have to be computed during scoring time
and this that have to be compute a can be computed up before scoring time
so we will thus save much computation might become use the covariance matrix
so dissatisfy summarised some summarize the computation that needed to take phone a conventional lp
lda we almost have nothing to
to compute all you need to compute these
i went and matrix multiplication
but for the plp a review g
we have to compute all these set of matrix on the right
so as you can see that so we'll a increase the computation complexity a lot
and also if we increase the memory about this place
because of for every time the speaker we need to store
this may take a p c d for every target speaker
so we propose a to a way of in a speeding up the computation and
also
a two we use the memory consumption
the whole idea come from
come from d c equation
okay come on this equation case here the posterior covariance matrix and only depends on
n c
which
and two and a testing time and c will be to zero or the sufficient
statistics of the test utterance
well
you okay so you the two i-vectors are also meeting that's integration
we assume that all we
i think okay
the composed here are covariance matrix a similar because as you can see we plot
and the mfcc audible acoustic that only
the zero all the sufficient statistic
so having this hypothesis
we and
proposals
to roll direct a according to their be activity
now can be
we find w happy that we you we used a scalar to define the we're
not be by facing for each scroll the i-vector reliability is modeled by performance vehicle
right matrix
and we obtain the posterior covariance matrix from the development data
okay so here
that i take a k stand for the
case
and i'll this u k
is independent of the section
nice to look at
well at the bottom of the slide
we have you i j i taste depends on the
section
but now if you look at this here
we successfully
make the u i j which is the session dependent
is now becomes session independent
now you've having this u k become session independent we could
do a lot of precomputation on there
so one way of doing this
used to
used to grow
used to grow
the
the i-vector
using these three approaches one is the base on the utterance recent which is intuitive
to group the i-vector based on the
at a race of because we as we believe that reason we use related to
the uncertainty of related to the reliability of the i-vector
we have also tried using the mean of the diagonal elements of the posteriori matrix
of this is a nice thing to do because
the being of the diagonal and on there is a scalar so working will become
very easy
okay and the last one we have tried is the largest eigenvalue of the reference
matrix
so this i basically tell us how to perform the grouping you for example you
uses the time access
then this one corresponding to extremely soft uncertains
go to medium length but am sort
and we're case where is
long utterance and u h group be fine one representative
okay from the k two
we're consensus the whole group
so this
or percent at u one u one times will work at santa
the posterior covariance matrix a very strong extremely short utterance
u k or u k tricycle corresponding to the posterior covariance matrix
what and certainty all that very long utterance
so now that all you really two
during the scoring time really to find
the real identity
so by using the three approach to quantify cook reliability noise gave a
we will be able to find what i the nn and so that we case
the
all the session dependent
matrix in two
am and
and c n and
not as to compare with the conventional original plp a few p
this eight easy all the session dependent because
t is the test utterance
so t stand for a test utterance s spent for the attack at six speaker
utterance
and now it's a it's two am an
and c n and the n and all these have been pre-computed already
using a development data
so as to can see that will be ice
for this computation saving my
using the pre-computed but rather than computers the covariance matrix on the prior
so again that this lie there are some more ice
the computation saving that we could have
so this is the p lda we've a fast what we've using a reference fast
scoring okay so we to only to determine the group i t m and n
but for the conventional plp a beep
and so that the publication we have to compute all this matrix during the scoring
so be performed experiments on
sre two thousand trial common condition two
using the classical sixty dimensional mfcc wetter one or two for gaussian
find the total factor in the total variability matrix
and we tried this three different way off on a single the i-vector
you know how to create a procedure of the
posterior covariance matrix
okay so this diagram of summarize the results nice okay cp lda just ultra fast
a piece of a represents the
scoring time on the back to the times the total time for the whole evaluation
on its common condition two
but unfortunately the performance is not very good
well the reason is that what we white is not very good use of because
it we use
where is the utterance of arbitrary duration so we need to do that this segmentation
or cutting utterance into sought medium sort and a long very soul
so that it is not
we are we do not use the original data for training and testing but instead
we use some of that at ones used were sought some of the utterance used
medium sought some of that when it is very long so we create a situation
be a victory
to raise and you both the training and test utterance
now the plp every u p performed extremely well
unfortunately the scoring time is also where i
and we've our fast scoring approach we successfully we used a scoring kind from here
to here
if only a very small increase in the eer
we have using a more groups okay so that developed a you know our from
you with the number of these larger we can make the eer almost the same
as the one achieved by the
p lda beep uncertainty project each so what happened use
we successfully we use the computation time but we followed increasing
the eer
as the same situation ocarina been dcf the detailed
you know paper
and also we show system three here because the performance you some two and system
three are very similar so i only saw this only show the system to
and system one space on utterance duration we want to solve this because you syllables
in three d way of doing
a memory consumption
a domain reconnaissance a i have i have in a similar trend
the lda used very small amount of memory
and
the plp a but use much of a large amount of memory
because we need to store all of the posterior covariance matrix of the utterance
and we have talk about well
what they're gigabyte here
and
this is not one videos that memory consumption almost by how
case and system to a this set the same
the memory consumption
and if we increase the number groups
or obviously not require something will increase
but you value that
number really a lot of forty five
it still use less memory and
the original plp we've and set and the propagation
so it's the det curve and
not as you can see the
paying
or the paper
this leo and we use them for to conventional lp lda report performance
about all the others system one two three and also the one with u p
one
much better
because it with the uncertainty propagation you can do the utterance of a feature integration
and what we have used that we find that system one use i c pool
then the system two and three
but system one has the largest
we that's and in terms of computation time
so in conclusion
we propose a very fast scoring map for the lda with certain people bifurcation
as the whole idea used to become people's the
posterior covariance matrix
or the loading matrix representing the reliability of the i-vector
two pre-computed
all of them
that's much as possible and you know how to do this precomputation really two
to the grouping first two in the development time
and we find three ways all performing the grouping
and all this grouping
are based on some a scalar just like a the k-means outgrow from you need
to use the distance of the way to say
it's a
criteria for a finding al
the cruel a what do you we mean by process now we use the
the be all the diagonal covariance matrix
okay or well sort then be all the diagonal elements of the posterior covariance matrix
what the maximum
eigenvalue all the posterior covariance matrix order to radiation as a way of doing this
huh set as the criteria for the grouping
and all these use a computationally light and sre so it's
as a result
the proposed f okay perform yes us a similar to the standard u p but
we only two point three percent of the scoring time
thank you
we have time for questions yes we
we do not frankly them randomly but use that set for every one second interval
we have a week rate
so for three second for second five seconds so we randomly extracted from there
that's speech data after
also when we extract every randomly extract the problem
so we durations range between three seconds and how much
well as well as long test utterance o can excel some utterances small groups of
five a traditional to therefore for different utterance we will have a different operating
my experience i wonder if you could just comment on this my experience with this
with this method
it is the i found other works well in a situation other than the specific
problem where was intended
okay if there is a gross mismatch between a enrollment and test such as telephone
enrolment the microphone channels
or a huge mismatch and the in the duration
then i found that this work well but i was a bit disappointed with the
with the performance only specific problem that you're addressing here which is the problem just
a duration variability
you fact we could be involved in our experiments we also have recently because
well we literacy generated duration mismatch in order to create a situation having a at
times a picture it duration therefore the test utterance and the target speaker utterance will
have different k
of course that we've each are very small times that
you one of the u one or two open you a
harry
the utterance will have various
but really then
excluding
so because everything random so there will be a lot of utterance with various packet
utterance which operates and also a trend which are real all that would be a
duration mismatch
a tree in the test
i be very interested to see what so what happens in the in the upcoming
nist evaluation where this problem is good is going to be in the in the
forefront of our have excellent thank
it is you know the truncation the duration will be truncated to between ten seconds
and sixty seconds
so i think we're all looking up to five percent equal error rate you know
a before we moved to chinese and no
target more go
verification trials
okay then that's like the speaker