the next presentation is not factor analysis of acoustic features you i mixture of probabilistic
principal component analyzers
moreover speaker
i
and
that is
factor analysis of acoustic features using a mixture problems
component analysis
for robust speaker very i
so in the introduction what i want to say is
so factor analysis is very popular technique when applied in gmm supervectors
and the main assumption there is
therefore randomly chosen speaker the gmm supervector lies in a low-dimensional subspace
we actually it's kind of not that the acoustic features are also represent a low
dimensional sub-spaces
and this phenomenon is not really
taken into consideration in gmm supervector bayes factor analysis
so we propose to try to see
what happens if we do factor analysis on the acoustic features
in addition to those i based cross
so just to say more about the motivation
we do not a speech spectral components are highly related to so our in the
mfcc features
we have a pca dct to detect these
a lot of work on trying to be really features
it has been shown that the first few eigen directions of the feature covariance matrix
is more speaker-dependent
so by maximizing
back into the
so what we believe is the retaining the full feature all the directions of the
eigen directions
the features might actually be harmful there might be some
directions that are not benefiting
we also get the evidence from the full covariance based i-vector system that
oh what a better than eigen system
which
so motivates us to investigate this further
so if you look at a full covariance matrix
the covariance matrix of a full covariance ubm this is how it kind of looks
and if you look at the eigenvalue distribution see most of the energy is compressed
in the forest
as in thirty two eigenvalues in this case
so they're pretty much contact
so i
i kind of thought okay that there might be a chance to
the reason to believe that there is some components the image
which are not really
so we use the factor analysis
on acoustic features so this is the basic formulation very simple
so you have a feature vector X is the factor loading matrix
why is the acoustic factors which is basically the
the hidden variables
you is the mean vector and
absolute is the isotropic noise
so this is basically a ppca
and the interpretation of the covariance is now modeled by the cuban variables
and the covariance of the acoustic features
and the residual variance is modeled by a voice model
so is the pdf of the model
and so what we try to do here is we want to place the acoustic
features by the acoustic factors basically the or the estimation of the acoustic factors
and try to use them as the features
believing that these acoustic factors
have more speaker-dependent information and the full feature vector might have some nuisance components
so a transformation matrix is derived
so it's also coming from the testing condition papers you can see first you have
to select the number of
coefficients you want to change
suppose they have six features i want to keep
forty
so he would be cost forty
and that was varies estimation is done by this also that's the remaining components in
the S this coverage
oh of the
eigenvalues
sorted eigenvalues
so the in its eigenvalue of the covariance matrix of X
and this is the factor loading matrix the maximum likelihood estimate
and it's also from the keeping initial paper
so this is how we estimate the acoustic factors which is basically
the expected value of the posterior mean of the acoustic factors
and it can be shown to be to use the
expression here so it's basically removal of the meeting and the transformation by this matrix
so what is given by this
and so it's just are the linear transformation
and if you take a this is the transformed feature vector which were like to
call it
and if you look at the mean and covariance matrix of this quantity it's a
zero-mean gaussian distributed with
a diagonal covariance matrix given by this
burgers
in the paper
i
so what to do a mixture of if it models which is basically the mixture
of ppca equation
so it's basically like a gaussian mixture models the same
but could think about this is you can
directly compute the parameters we
the fa parameters
from the full covariance ubm
and then becomes really handy the C
oh
next i'd like to talk about how we want to use the
the transformation so you have set and twenty four mixtures and to make sure has
a transformation so what you could do us a you take a feature vector and
you find the most likely mixture and you transform the feature and then
you know replace the original vector right
but what we saw is
actually it's kind of not be the optimal way of doing it because
so if you find the top scoring mixture of say your development data across the
again
so this is kind of the distribution
so what this tells you is
it's very rare that the acoustic feature is unquestionable the online
two in mixture most of times that you can get like that was like one
point four point five
so that kind of means is a
you can't really say that this feature vector comes from this mixture it kind of
the last a lot of mixtures
maybe more than one so what we want to do not keep all the all
the transformations
that are done by of the mixtures
so this is how we do it
basically
integrating the process within the total variability model
so with the i-vector system
so for speech and the ubm full covariance
and then we compute the parameters like we set the value of Q well just
a
fifty
i think
oh data we find the noise variance these are all you different pictures
for each mixture you find a
a factor loading matrix and the transformation
so how it flies is basically
directly those on to the first order statistics you actually have to by frame-by-frame so
you compute the statistics and you can just take a transformation of that estimation
so it becomes very simple you just transform the first order statistics
and actually know the transformation is completely integrated within this is
so these are differences with the conventional the t-matrix training
so the feature size becomes Q instead of D
support vector becomes in Q
and the T V image of size becomes smaller
and most importantly the ubm gets replaced by the distribution of the transformed features so
since we are not using the original features in the subsequent processing we will use
this is not really the ubm this is basically to how the parameters can place
and the i-vector expected
procedures similar
i system we have a phone recognizer based fantasy two-dimensional
six with feature
cepstral mean normalization
we have a ubm a gender dependent on ten twenty four mixtures
oh we train
we train the full covariance ubm with
a variance flooring it's the investigate parameter it's that's the
mean value of the corpus matrix to be
a fixed value
and the i-vector size was four hundred
and we used five iterations
so we have the pot a backend where we have a full covariance was model
and the only free parameters the eigenvoice size
next to the we have the fa which i just talked about we derive all
the parameters from the ubm directly
and we performed experiments on sre twenty ten basically
conditions want to find we use the male trials
so this is the initial results as we can see
we change the
P of the inside the eigenvoice size from fifteen
then we use the cubicles fifty four forty eight and forty two
our feature sizes sixteen so you can see
taking off six components and so on
so also what we can get nice improvement using the proposed technique
so here's
table showing you some of the systems
that we fused
so the baseline is sitting here
and we are getting nice improvement in all three a couple of two thousand Q
it's kind of heart to say which that would work
that's in challenge
and also
this to that kind of
it can be optimal and it can have different value in each mixture depending on
how the mixture how the covariance structure is in the mixture
i also did some work on that and
probably
see interspeech
oh
so anyway
when we fuse the systems it's too late fusion and we can see still we
can get a pretty nice improvement
by fusing
and different combinations
so these systems to have a complementary information
so these are actually extra experiments that performed after the
this paper submitted source one are shown
oh in other conditions works in condition one
oh maybe cubicles forty eight what's nicely what condition two Q was forty two words
yeah
condition
three
cubicles forty eight and fifty four
oh but in take you information for we have
maybe of the dcf
the new dcf didn't from improve
but you of the conditions
but you can see clearly that a
the proposed techniques
a technique works well it reduces
all three
a performance in this is
and after fusion you can actually see nice
a really different from all three of parameters
so here is the det curve it's on the to a condition one to five
and we just pick the cubicles forty two system
oh and you can see it's
almost all
the fa system is
better than the baseline
and with fusion we get
for the
so
we have proposed a factor analysis framework for acoustic features mixture-dependent feature transformation
a compact representation well
and we propose the be probabilistic feature alignment method
instead of hard-clustering a feature vector to a mixture
and so we show that
i provides better
oh when we integrate it with the i-vector system
and the as a kind of
nice artifact it kind of makes it faster because
you know you're reducing the feature vector dimensionality which actually in turn reduces that support
vector size and tv matrix size
and it's
you can see in this paper is discussed that V
the computational complexity is proportional to be
supervectors
so and future work
there's nothing to like
not
it can be mixture dependent basically so
we obtain colour feature dimension like say
forty eight from all the mixtures
what you can be different so one of my papers that supported in interspeech which
deals about the trying to
optimize the parameter in each mixture
and also
some of future work will be
using iterative techniques in proposed to begin bishops method
in table four mixture of ppca
most of all actually
this opens up
we have
using other transformations also in mixture wise which might also didn't in another interesting to
people where i actually a by conventional transformations and the
and
nap or other techniques
which actually sort of take
transformations in each mixture and then
yeah so
and then basically integrated with the i-vectors
so
that is all i have a given
sorry how do you can go back to the acoustic features
i
yeah
yeah
i
what we need to train the ubm from scratch
oh yeah i did i tried i've seen some papers to
i didn't think i think the way i did i thought
or
sure
you can
so
i
cluster a feature dimension you have to have some kind of measurement
usually you can find the find the mixture by oh the most
the make sure that you to the highest posterior probability
but in this distribution i'm showing that
oh it's not always a one to one mixture because sometimes if the maximum value
of
the posterior probability of the mixture is if it's giving you point to that is
there other mixtures
one
point something that means
if you take point to as the maximum mixture and use that mixtures transformation it
will be
so
yeah we can you "'cause" to do it
but i try because i just have seen this and i thought
it would be nicer generate things that make things
are
together what is
i
oh
so a number of trials
i
yeah
yeah
i
i think i normalized in a binary invariance
oh
although i
right yes
oh maybe what you're saying is true
since i get
maybe
conditions
maybe i don't know if i the folding problem
i believe
just to
well
yeah i think that