okay so i'm guy and title of my talk is a compensating inter dataset variability

in peeled a hyper parameters for bass the speaker recognition

but hope that and the presentation is much simpler than this a long title

okay so the problem setup is quite a similar a to what we have already

heard today but with some twist

so it as well as we already heard a speaker recognition state-of-the-art can be quite

accurate

when trained and a lot of mismatch will not of matched data

but in practice many times we don't

have the dumbbells to do that so we have a lot of set data from

a source domain

namely nist data

and we may have limited don't know they tied all from our target data which

is much which may be very different say

in many senses formant from the

source domain and this work addresses the setup where we don't have they connect all

formed

a from the target domain okay so it's a bit different from what we heard

so far

and of course doubt related the applications one of the one the fit the first

one is when you do have some small labeled data set to adapt your model

but problem is that that's not always the case for many of because you don't

have data at all

or you may have a very limited amounts of data and you in it may

be you want to be able to adapt probably using so it's so scouts data

now the second the am a related work is when you have a lots of

a data unlabeled data to adapt the models and that's what we already heard today

but for many applications for example for a text-dependent okay so and in other cases

you don't have it a lot should be to have unlabeled data

and also a one of the problems with all with the i met so that

we already heard today is that is that most of them are based on clustering

and clustering and maybe a tricky it's thing to do especially when you don't have

the luxury two

to see the results of your a clustering algorithm and you don't you cannot with

each unit if you don't really have labeled data from your target domain

so we don't really i don't we like to use clustering

okay so

so this work was don it started to doing dodge h u a recent jhu

speaker verification a workshop

and this is a variant of the domain adaptation challenge in

i quite that domain robustness shannon's

so that we steer is that we do not use adaptation data at all

that is we don't use the mixer for a at all even for a centring

that they dial whitening the data

and again we can see here are some based on the results we see that

if we trained our system on dexter we get the any collate of two point

four if we trained only on switchboard without any a use the of a mix

of for centring

we get the eight point two

and the challenge is to try to bridge disguise

okay so a we all hear familiar with building modeling just for notation we parameterize

the ple model by three hyperparameters by a new which is the centre or the

mean of the distribution of all i-vectors be which is that

between speaker covariance to metrics or attack was class covariance matrix and w which is

the within speaker covariance matrix

okay so just before describing the method the i want to it presents some interesting

a experiments

on the data

so if we if we a estimate a purely model from switchboard and we don't

the centring also on the search for the then i got an equal out of

eight point two

and using mixture to center the data i help to but not the that much

okay if we a just use the nist

ten data to centred

the evaluation data then we get the very large e improvement so far from here

we can see that this entering is a big issue

now however if we do the same experimental next there we see here that centring

with nist data distend data doesn't help that much

so

the conclusion of a quite next

so basically what we can say that centring does a really a it can account

for some of the mismatch but it's quite complicated a in the clinic what a

complicated way and it's not really clear it when centring will hurt or not

and because we can see that centring with mixture is not good enough for switchboard

will the but it is quite good enough for this the mixer

okay so the proposed method which is the a partly unrelated to the experiments die

or the upshot is just following so the basic idea here is that we hypothesize

that some

some directions in the i-vector space are a more important i actually account for most

of that dataset mismatch

this is the underlying assumption

and we want to find these directions and made remove this direction of this subspace

using a projection a which will find p

and the and we call this method in the dataset variability compensation or i d

v c

a and we actually presented this in the recent i cuss right but what we

did there is we just say focused on the center hyper parameter and try to

estimate everything form center hyper parameter

and what we what we do here is we also focus on the other hyper

parameters you b and w so basically what we want to do it want to

find a projection that minimizes the variation of viability of these i four parameters when

trained over different two datasets

that's the basic idea

okay so how can we find this a projection that p

well given a set of data sets a representing each one

by a vector or in the hyper parameter space they clean you b and w

so we can think of a for example switchboard there is a one point in

this space and mixed sign one point or we can also look at a different

components of switchboard and mixer it different two years different deliveries and each one can

be represented in a point in this a subspace

and the duh problem out that to dataset mismatch problem is actually illustrated by the

fact that the points

have some variation do not the same point in if we could have a very

robust system then all the points while the colliding posting can point to the that's

the goal

so the idea is to try to find the sum is some a projection like

big done in a

but approach but here is the projection is on the i-vector space

that would effectively reduce the variability in this ap lda space

and i what we do in this work we do it independently for each one

of the hyper parameters you b and w and it so for each one we

just say compute this projection or subspace and then we just a combined them all

to a single one

okay

so that one is the question is how to how can we do this if

we don't have the target data

well what we actually do is we use the data that we do have and

we hope that what we find the generalize to dancing data

of course if we have done some data we will also can applied

also and unseen data

so but we do is that we observe our a development data and the main

the main and point here is that quote of a two way what we generally

believed in the past are data is not on how mcginnis if it's mcginnis then

the this was not work

so we observe the fact that the data is not a whole mcginnis and we

try to a we try to a divided the partition it into distinct subsets and

the end we and the goal is that each subset supposed to be quite a

mcginnis in

the so in that different subsets should be a as best as a far from

each other so what is in practice what we did we just say a observe

that it's which what the data is the consist of six different it delivers we

didn't even look at the labels

we just said okay we have six deliveries it so let's and make a partition

into six subsets we also try to petition according to gender

and we also try to see what happens if we had we have only two

partitions they but here we select that the may is the intentionally as one to

be you landline the data and one to be several data

so we have these three different partitions

so now doubt on if the middle there's a is following first be estimated projection

p

what we do is we take our development data and divided into distinct subsets

now we impact in a

in generally did this doesn't have to be a the actual development data that we

tend to train the system we can just try to collect other sources of data

which we believe it will be we represent some it i in many types of

mismatch and we can try to applied on

the collection of all the data sets that we managed to get under our hands

so once we have these subsets we can estimate the ple model a for each

one

and then we find the projection i in the and that's why was shown the

nexus lights

now once we found this a projection p we will we just applied on a

lower i-vectors as a preprocessing step we do it before we do everything else before

may center eating a whitening guys like summarisation and everything

and then we could just say retrain or more lp of the model

so the hope is that this projection just cleanups

in some sense and

some of the dataset mismatch for some amount of data

okay so first i can we do it for them you hyper parameter it's very

simple we just take the collection of centers that we gather from different the data

sets and we apply pca or on that and construct the projection from that top

eigenvectors

now for the mattress s b and b and w

a it's a bit different and i we show how it's being done for w

and the same it's i can be done for b

so basically given a set of covariance matrices w sub i we have one for

each dataset

we also way to is defined the mean covariance w bar

and now let us define a unit vector v which is that direction in the

i-vector space

now what we can do we can we can they're computed a variance all five

of a given the covariance matrix w sub i along this direction or project the

projection of the covariance one this that this direction

this is a the

transposed abuse of by v

now the goal that we define here is that we want to find that such

directions v

that's maximize the variance of this a quantity normalized by a by v transpose rebar

v so we will not we normalize that the variance along each direction by that

the average variance

and we want to find directions that maximize this because if you find a direction

that maximize this quantity means that different p l they models for different datasets it

behave very differently according to the a along this direction in the i-vector space and

this we want to remove or maybe model in the future

but in the moment you want to remove it

okay so that the algorithm to find this is quite a straightforward we first white

and the i-vector space

with respect to the it to w bar

and then we just the compute this does some of the squares of w sub

i and again find the top eigenvectors in a and

we constructed a projection p to remove these eigenvectors

the proof a that it actually the right thing to do is quite a but

is in it i want go over it because we have lunch

so way

but it's in the paper and it's very simple

its immediate

okay so not now where now and just a one a one thing that may

be quite important what happens again if we want to use other data sources to

model the mismatch and it but maybe for that for those that the sources we

don't have speaker labels

a soul to estimate them you have a matter it's quite easy we don't need

the it all a speaker labels but for w and b we do need so

what we can do it that cases in those cases we can just say replace

w and b with a t is the total comments magics and of course we

can estimated without speaker labels

and what we can is c i here is that it for typical datasets where

the there was a large number of speakers t is up can be approximate it

that by w possibly

okay so

so we saved at is the case it means that if we have high inter

dataset variability in inside for some directions forty

then it would be the same direction the same a fee would be actually optimum

i'll optimal a or probe signal to model also for either w or b and

vice versa so instead of finding that these this subspace for w and four b

we can just to find it forty and it will be practically almost as good

okay so now results

first a is a few results a it with using the only done you hyper

parameter the center the hyper parameter we can see here in the blue curve a

the results using this approach for p ldc is system a trained on switchboard

we see here that we started with a point two we could wait it we

have a slight degradation because we remove the gender i is that the first we

find out that the first a it'd a dimension that we remove is

is actually the gender or a the information

and but then we start to get a games

and we get any product of three point eight

we also get nice improvements one for this you have what we see here in

the in the red curve and a black curve we see what happens if we

apply the same aid vc system method for mixer based the a bill build so

we

for that the red and in their black curves are out when we train a

purely model and mixer and still we want to a applied if c to see

maybe we're getting some gains or at least we're not losing anything so what we

can see that in general we can say that

nothing really much happens here when we when you plot you aid posi on the

mix able to at least we're not losing

okay so now the same thing is being done but it i'd of for the

w hyper parameters for the b hyper parameter

we can see here in a blue

and light blue what happens when we train a system one switchboard and apply d

v c

either on one of these hyperparameters

we see that we get very large improvements even larger than we get for the

center hyper parameter

a and it's to h i a around i dunno one hundred dimensions and then

we start to get the degradation

so if we move too much

to too many dimensions then we start getting their it degradation

and this is

for the center hyperparameter we can not remove too much because if we have for

example only twelve this subsets then we can remove only to ellie only eleven dimensions

but the but for w can be we can actually move up to four hundred

so it's a bit a i it's a bit different

and okay now in it again in black and red we see what happens when

we apply the same as a doorknob mixer based appeal a system

with here again that we get very slightly i'm not show it significantly prevent and

then we start getting degradation

and what we see here quite interesting is that after a dimension of around let's

say one hundred and fifty

all the systems they actually behave the same so my in eight importation is that

we actually managed to remove most or all of that dataset mismatch but we also

make a remove some of the a good information from the system and therefore we

get some we get degradation but

the system actually behave very roughly the same

after we moved to the mission of one hundred fifty

okay so now have a what happens when we combine everything together we started to

form the two point four mix to build an eight point to force which will

build and for different partitions we get slightly different the results between equal rate of

three and three point three

and if we just use the simplistic partition

and ogi to only to a subset and we use only a only hyper parameters

we can we can a estimate without speaker labels mu and t we say we

get to three point five so we

that the conclusions the at actually works also without the speaker labels

so to conclude a we have shown that i posi id posi can effectively reduce

the influence of a dataset variability

it is for this that particular is set up for the domain robustness challenge

i and then we actually managed to capture to recover roughly ninety percent of the

of the error

in a compared to that to a totally is switchboard the build

and also this aid posi system works well even when trained on two subsets only

and without speaker labels

okay

thank you

i wonder if you if you happen to know what would happen if you simply

projected away three leading sorry one hundred eigenvectors from very

to the training set with corpus

without bothering to train individual w one b metric

the one on from the somebody so as not just take that of the commencement

extension that that's a method recall a

a week or the two wire nap in total a probability subsystem removal we have

this in darkness paper we have the we've done that you also get gains but

not this and not the time was very

not very so very

if i understand correctly you're saying the number three

benefit comes from trying to

am optimize the within class covariance matrices are grossly datasets

trying to make fools look the same

so but the w matrix is

okay so but basically say you to you try it apparently that

that a process this is in some sense

reasonable that some directions the i-vector space are more sensitive to mismatch

data mismatch and some not

and

it's harder observe it from the data a unless you do something like we did

did you look simple it just doing per state of set whitening

so another should do the whitening you did censoring but you did we did only

whitening is it didn't change in the performance

right

well that's contrary to other sites right

i think the whitening is generally used

so one question would be and as you do that if you did whitening per

dataset

what negative the same effect in a soft form versus the

projection a way

i tried i didn't right smack in many to a like a tried very quick

experiment in it's like wrote down the total yes i don't know if it's maybe

something i have to do more carefully but

digital

just a question if you do projection data to train k lda is your within

and between became single or something

that's like the fifth time error here this question and that's why ram

so basically but it's the same only when you apply lda for example before you

always gonna build a so it was used

so you know way so either you can just and not just to the you

can actually moved without low dimension remove these dimension but everything on the whole the

low dimension

or you can do some tricks to fix it

i like add some

some quantities to the

to the covariance matrices

i think so that all can i just in you paper that you

you contrasted against source normalization and that was the parallel presented in the icassp paper

unfortunately can access it here was trying to look it up as we went along

and i b and also taken on source normalization extend that that'd be further

of the reason i bring this up is

in this context the data sets the speaker disjoint across different datasets what about the

context way perhaps i have to have a system trained on telephone speech for a

certain number

speakers and then you suddenly acquired data from those

same speakers in a different channel

what's going to happen in terms of c i mean in terms of the a

testing side of things you require more data by speakers

from a microphone channel perhaps

and then you also require microphone data from different set of speakers for training adapting

a system

it appears is that this point the within class variations are estimated independently on each

dataset

so

does that mean that the difference between those datasets is going to be suppressed

or it's actually maintained in them

an under this framework

i think it's not very sensitive and

because it looks on very broad hyper parameters so it doesn't really matter if it's

the same speaker or not

okay maybe we can discuss of on a bit

the final question

so

and in the fast

we started dealing with the channel problem

by projecting away stuff at school

and then the development due to software approaches we start modeling the stuff instead of

explicitly projecting that the way j if i really i so

can you imagine a probabilistic model what which is

includes a variability for dataset actually thanks for the question i was thinking maybe tried

another slide on that i actually tried to

two where extend the p lda model with another like plus the to add another

component which is the dataset mismatch which

behaves the be different now do not alright and a components and i meant some

experiments i got something any improvement

compared to the baseline but not as good as i got to using this nap

approach

and i have some any ideas why that is that what this is the case

but i will not be supplied someone just the that's it in different way and

gets better result

okay