okay so i'm guy and title of my talk is a compensating inter dataset variability
in peeled a hyper parameters for bass the speaker recognition
but hope that and the presentation is much simpler than this a long title
okay so the problem setup is quite a similar a to what we have already
heard today but with some twist
so it as well as we already heard a speaker recognition state-of-the-art can be quite
accurate
when trained and a lot of mismatch will not of matched data
but in practice many times we don't
have the dumbbells to do that so we have a lot of set data from
a source domain
namely nist data
and we may have limited don't know they tied all from our target data which
is much which may be very different say
in many senses formant from the
source domain and this work addresses the setup where we don't have they connect all
formed
a from the target domain okay so it's a bit different from what we heard
so far
and of course doubt related the applications one of the one the fit the first
one is when you do have some small labeled data set to adapt your model
but problem is that that's not always the case for many of because you don't
have data at all
or you may have a very limited amounts of data and you in it may
be you want to be able to adapt probably using so it's so scouts data
now the second the am a related work is when you have a lots of
a data unlabeled data to adapt the models and that's what we already heard today
but for many applications for example for a text-dependent okay so and in other cases
you don't have it a lot should be to have unlabeled data
and also a one of the problems with all with the i met so that
we already heard today is that is that most of them are based on clustering
and clustering and maybe a tricky it's thing to do especially when you don't have
the luxury two
to see the results of your a clustering algorithm and you don't you cannot with
each unit if you don't really have labeled data from your target domain
so we don't really i don't we like to use clustering
okay so
so this work was don it started to doing dodge h u a recent jhu
speaker verification a workshop
and this is a variant of the domain adaptation challenge in
i quite that domain robustness shannon's
so that we steer is that we do not use adaptation data at all
that is we don't use the mixer for a at all even for a centring
that they dial whitening the data
and again we can see here are some based on the results we see that
if we trained our system on dexter we get the any collate of two point
four if we trained only on switchboard without any a use the of a mix
of for centring
we get the eight point two
and the challenge is to try to bridge disguise
okay so a we all hear familiar with building modeling just for notation we parameterize
the ple model by three hyperparameters by a new which is the centre or the
mean of the distribution of all i-vectors be which is that
between speaker covariance to metrics or attack was class covariance matrix and w which is
the within speaker covariance matrix
okay so just before describing the method the i want to it presents some interesting
a experiments
on the data
so if we if we a estimate a purely model from switchboard and we don't
the centring also on the search for the then i got an equal out of
eight point two
and using mixture to center the data i help to but not the that much
okay if we a just use the nist
ten data to centred
the evaluation data then we get the very large e improvement so far from here
we can see that this entering is a big issue
now however if we do the same experimental next there we see here that centring
with nist data distend data doesn't help that much
so
the conclusion of a quite next
so basically what we can say that centring does a really a it can account
for some of the mismatch but it's quite complicated a in the clinic what a
complicated way and it's not really clear it when centring will hurt or not
and because we can see that centring with mixture is not good enough for switchboard
will the but it is quite good enough for this the mixer
okay so the proposed method which is the a partly unrelated to the experiments die
or the upshot is just following so the basic idea here is that we hypothesize
that some
some directions in the i-vector space are a more important i actually account for most
of that dataset mismatch
this is the underlying assumption
and we want to find these directions and made remove this direction of this subspace
using a projection a which will find p
and the and we call this method in the dataset variability compensation or i d
v c
a and we actually presented this in the recent i cuss right but what we
did there is we just say focused on the center hyper parameter and try to
estimate everything form center hyper parameter
and what we what we do here is we also focus on the other hyper
parameters you b and w so basically what we want to do it want to
find a projection that minimizes the variation of viability of these i four parameters when
trained over different two datasets
that's the basic idea
okay so how can we find this a projection that p
well given a set of data sets a representing each one
by a vector or in the hyper parameter space they clean you b and w
so we can think of a for example switchboard there is a one point in
this space and mixed sign one point or we can also look at a different
components of switchboard and mixer it different two years different deliveries and each one can
be represented in a point in this a subspace
and the duh problem out that to dataset mismatch problem is actually illustrated by the
fact that the points
have some variation do not the same point in if we could have a very
robust system then all the points while the colliding posting can point to the that's
the goal
so the idea is to try to find the sum is some a projection like
big done in a
but approach but here is the projection is on the i-vector space
that would effectively reduce the variability in this ap lda space
and i what we do in this work we do it independently for each one
of the hyper parameters you b and w and it so for each one we
just say compute this projection or subspace and then we just a combined them all
to a single one
okay
so that one is the question is how to how can we do this if
we don't have the target data
well what we actually do is we use the data that we do have and
we hope that what we find the generalize to dancing data
of course if we have done some data we will also can applied
also and unseen data
so but we do is that we observe our a development data and the main
the main and point here is that quote of a two way what we generally
believed in the past are data is not on how mcginnis if it's mcginnis then
the this was not work
so we observe the fact that the data is not a whole mcginnis and we
try to a we try to a divided the partition it into distinct subsets and
the end we and the goal is that each subset supposed to be quite a
mcginnis in
the so in that different subsets should be a as best as a far from
each other so what is in practice what we did we just say a observe
that it's which what the data is the consist of six different it delivers we
didn't even look at the labels
we just said okay we have six deliveries it so let's and make a partition
into six subsets we also try to petition according to gender
and we also try to see what happens if we had we have only two
partitions they but here we select that the may is the intentionally as one to
be you landline the data and one to be several data
so we have these three different partitions
so now doubt on if the middle there's a is following first be estimated projection
p
what we do is we take our development data and divided into distinct subsets
now we impact in a
in generally did this doesn't have to be a the actual development data that we
tend to train the system we can just try to collect other sources of data
which we believe it will be we represent some it i in many types of
mismatch and we can try to applied on
the collection of all the data sets that we managed to get under our hands
so once we have these subsets we can estimate the ple model a for each
one
and then we find the projection i in the and that's why was shown the
nexus lights
now once we found this a projection p we will we just applied on a
lower i-vectors as a preprocessing step we do it before we do everything else before
may center eating a whitening guys like summarisation and everything
and then we could just say retrain or more lp of the model
so the hope is that this projection just cleanups
in some sense and
some of the dataset mismatch for some amount of data
okay so first i can we do it for them you hyper parameter it's very
simple we just take the collection of centers that we gather from different the data
sets and we apply pca or on that and construct the projection from that top
eigenvectors
now for the mattress s b and b and w
a it's a bit different and i we show how it's being done for w
and the same it's i can be done for b
so basically given a set of covariance matrices w sub i we have one for
each dataset
we also way to is defined the mean covariance w bar
and now let us define a unit vector v which is that direction in the
i-vector space
now what we can do we can we can they're computed a variance all five
of a given the covariance matrix w sub i along this direction or project the
projection of the covariance one this that this direction
this is a the
transposed abuse of by v
now the goal that we define here is that we want to find that such
directions v
that's maximize the variance of this a quantity normalized by a by v transpose rebar
v so we will not we normalize that the variance along each direction by that
the average variance
and we want to find directions that maximize this because if you find a direction
that maximize this quantity means that different p l they models for different datasets it
behave very differently according to the a along this direction in the i-vector space and
this we want to remove or maybe model in the future
but in the moment you want to remove it
okay so that the algorithm to find this is quite a straightforward we first white
and the i-vector space
with respect to the it to w bar
and then we just the compute this does some of the squares of w sub
i and again find the top eigenvectors in a and
we constructed a projection p to remove these eigenvectors
the proof a that it actually the right thing to do is quite a but
is in it i want go over it because we have lunch
so way
but it's in the paper and it's very simple
its immediate
okay so not now where now and just a one a one thing that may
be quite important what happens again if we want to use other data sources to
model the mismatch and it but maybe for that for those that the sources we
don't have speaker labels
a soul to estimate them you have a matter it's quite easy we don't need
the it all a speaker labels but for w and b we do need so
what we can do it that cases in those cases we can just say replace
w and b with a t is the total comments magics and of course we
can estimated without speaker labels
and what we can is c i here is that it for typical datasets where
the there was a large number of speakers t is up can be approximate it
that by w possibly
okay so
so we saved at is the case it means that if we have high inter
dataset variability in inside for some directions forty
then it would be the same direction the same a fee would be actually optimum
i'll optimal a or probe signal to model also for either w or b and
vice versa so instead of finding that these this subspace for w and four b
we can just to find it forty and it will be practically almost as good
okay so now results
first a is a few results a it with using the only done you hyper
parameter the center the hyper parameter we can see here in the blue curve a
the results using this approach for p ldc is system a trained on switchboard
we see here that we started with a point two we could wait it we
have a slight degradation because we remove the gender i is that the first we
find out that the first a it'd a dimension that we remove is
is actually the gender or a the information
and but then we start to get a games
and we get any product of three point eight
we also get nice improvements one for this you have what we see here in
the in the red curve and a black curve we see what happens if we
apply the same aid vc system method for mixer based the a bill build so
we
for that the red and in their black curves are out when we train a
purely model and mixer and still we want to a applied if c to see
maybe we're getting some gains or at least we're not losing anything so what we
can see that in general we can say that
nothing really much happens here when we when you plot you aid posi on the
mix able to at least we're not losing
okay so now the same thing is being done but it i'd of for the
w hyper parameters for the b hyper parameter
we can see here in a blue
and light blue what happens when we train a system one switchboard and apply d
v c
either on one of these hyperparameters
we see that we get very large improvements even larger than we get for the
center hyper parameter
a and it's to h i a around i dunno one hundred dimensions and then
we start to get the degradation
so if we move too much
to too many dimensions then we start getting their it degradation
and this is
for the center hyperparameter we can not remove too much because if we have for
example only twelve this subsets then we can remove only to ellie only eleven dimensions
but the but for w can be we can actually move up to four hundred
so it's a bit a i it's a bit different
and okay now in it again in black and red we see what happens when
we apply the same as a doorknob mixer based appeal a system
with here again that we get very slightly i'm not show it significantly prevent and
then we start getting degradation
and what we see here quite interesting is that after a dimension of around let's
say one hundred and fifty
all the systems they actually behave the same so my in eight importation is that
we actually managed to remove most or all of that dataset mismatch but we also
make a remove some of the a good information from the system and therefore we
get some we get degradation but
the system actually behave very roughly the same
after we moved to the mission of one hundred fifty
okay so now have a what happens when we combine everything together we started to
form the two point four mix to build an eight point to force which will
build and for different partitions we get slightly different the results between equal rate of
three and three point three
and if we just use the simplistic partition
and ogi to only to a subset and we use only a only hyper parameters
we can we can a estimate without speaker labels mu and t we say we
get to three point five so we
that the conclusions the at actually works also without the speaker labels
so to conclude a we have shown that i posi id posi can effectively reduce
the influence of a dataset variability
it is for this that particular is set up for the domain robustness challenge
i and then we actually managed to capture to recover roughly ninety percent of the
of the error
in a compared to that to a totally is switchboard the build
and also this aid posi system works well even when trained on two subsets only
and without speaker labels
okay
thank you
i wonder if you if you happen to know what would happen if you simply
projected away three leading sorry one hundred eigenvectors from very
to the training set with corpus
without bothering to train individual w one b metric
the one on from the somebody so as not just take that of the commencement
extension that that's a method recall a
a week or the two wire nap in total a probability subsystem removal we have
this in darkness paper we have the we've done that you also get gains but
not this and not the time was very
not very so very
if i understand correctly you're saying the number three
benefit comes from trying to
am optimize the within class covariance matrices are grossly datasets
trying to make fools look the same
so but the w matrix is
okay so but basically say you to you try it apparently that
that a process this is in some sense
reasonable that some directions the i-vector space are more sensitive to mismatch
data mismatch and some not
and
it's harder observe it from the data a unless you do something like we did
did you look simple it just doing per state of set whitening
so another should do the whitening you did censoring but you did we did only
whitening is it didn't change in the performance
right
well that's contrary to other sites right
i think the whitening is generally used
so one question would be and as you do that if you did whitening per
dataset
what negative the same effect in a soft form versus the
projection a way
i tried i didn't right smack in many to a like a tried very quick
experiment in it's like wrote down the total yes i don't know if it's maybe
something i have to do more carefully but
digital
just a question if you do projection data to train k lda is your within
and between became single or something
that's like the fifth time error here this question and that's why ram
so basically but it's the same only when you apply lda for example before you
always gonna build a so it was used
so you know way so either you can just and not just to the you
can actually moved without low dimension remove these dimension but everything on the whole the
low dimension
or you can do some tricks to fix it
i like add some
some quantities to the
to the covariance matrices
i think so that all can i just in you paper that you
you contrasted against source normalization and that was the parallel presented in the icassp paper
unfortunately can access it here was trying to look it up as we went along
and i b and also taken on source normalization extend that that'd be further
of the reason i bring this up is
in this context the data sets the speaker disjoint across different datasets what about the
context way perhaps i have to have a system trained on telephone speech for a
certain number
speakers and then you suddenly acquired data from those
same speakers in a different channel
what's going to happen in terms of c i mean in terms of the a
testing side of things you require more data by speakers
from a microphone channel perhaps
and then you also require microphone data from different set of speakers for training adapting
a system
it appears is that this point the within class variations are estimated independently on each
dataset
so
does that mean that the difference between those datasets is going to be suppressed
or it's actually maintained in them
an under this framework
i think it's not very sensitive and
because it looks on very broad hyper parameters so it doesn't really matter if it's
the same speaker or not
okay maybe we can discuss of on a bit
the final question
so
and in the fast
we started dealing with the channel problem
by projecting away stuff at school
and then the development due to software approaches we start modeling the stuff instead of
explicitly projecting that the way j if i really i so
can you imagine a probabilistic model what which is
includes a variability for dataset actually thanks for the question i was thinking maybe tried
another slide on that i actually tried to
two where extend the p lda model with another like plus the to add another
component which is the dataset mismatch which
behaves the be different now do not alright and a components and i meant some
experiments i got something any improvement
compared to the baseline but not as good as i got to using this nap
approach
and i have some any ideas why that is that what this is the case
but i will not be supplied someone just the that's it in different way and
gets better result
okay