Speech Transcript - Unsupervised Clustering Approaches for Domain Adaptation in Speaker Recognition Systems

and so what i'm gonna talk about is also what we did at in last

summer a at hopkins and again this work between myself and steven a dog

doug reynolds the annual and l

actually

i really hope

that

well okay

four years ago

was my first a odyssey

ever and does my first conference presentation i made a joke

at the start of the slide that are not to be far more memorable and

then the presentation itself

i wanna do the same this time but i do the on couldn't come up

with any good stories so

i was gonna give you this picture in which i hope none of you will

look like by the end of my presentation

okay

all right now

so what i meant talk about is on the remote clustering approaches

for domain adaptation in speaker recognition systems

and first off i guess the titles a bit of a handful so i'm gonna

break it down and explain kinda each piece one at a time

so domain adaptation

one in which

where

most

current statistical learning techniques assume like someone incorrectly rather that

the training and test data come from the same underlying distribution right

and

so what we know in general is that labeled data may exist in one domain

but what we want is a model that can also perform well in a related

but say not necessarily identical domain

and labeling data

in this particular the new domain may be difficult and or expensive

and so what can we do

and to leverage the original labeled out-of-domain data one building a model to work with

this in domain data

so is nothing new here everything we've heard before in the previous presentation so speaker

recognition systems on the blaster once again used for all rather familiar with the i-vector

approach

and that's

clearly just you know you're standard summary length of segment length independent

low dimensional vector based or some representation of the audio

and what we've done what the i-vector allows us to do is to use large

amounts of previously collected in labeled audio to characterize and exploit speaker and channel variability

that's right and usually that entails the use of you know thousands of speakers making

tens of calls each

unfortunately it is a bit unrealistic to expect that most applications will have access to

such a large set of labeled data from matched condition

and so

well here's you know that anatomy of that standard i-vector system that's very similar and

almost actually identical to what counted shown and z are yet again the thing that

the no is that

you know the your ubm your i-vector extractor and your resulting mean subtraction and like

normalisation is are does not require the use of labels

what does require some labels are

on your within class and across class covariance matrices

and that's where the labels come in so

that's what we've got now the first thing and like to do sort of just

the like paints

at the a larger picture of what we've done

on this to demonstrate that mismatch right

on between are two domains

on similar to the deck curve plot that hounded shown on what we start with

this one role in score on sre two thousand ten

and what one denote as the in domain set is that of the sre data

i mean that's all the telephone calls from mixer o four five six and two

thousand eight collections

now the mismatched out-of-domain data is all the switchboard data

which are all the calls from that from those collections

so in general

some summary statistics there

what we're basically looking at is that

number of speakers

a number of calls an average number of calls per speaker and number of channels

that you speaker spoke on a relatively the same and that help with that as

a visualization

that's kind of the normalized histogram of the distribution of

the number of utterances per speaker between the two

between the two sets of data in blue well that the pretty much all overlap

but in blues the switchboard and read is that of the sre

so what we can say is that we would not expect a large performance gap

between these two sets of data if indeed

are

our ability you know are training where

dataset independent and are robust across datasets

what we found obviously i is that this is not the case which is why

we ended up having a summer workshop

on it and so it just take

to give in summary of

equal error rate results the wrestler talk just be using equal error rate to provide

a summary set of results and

what we what we have is are

inbred i believe denoted just

the portion of the system that actually requires

labels on and have also

on the shown what we what we had at hopkins of the summer and what

replicated at mit

as well so you can see that for use all switchboard on to train everything

we'll get a set of results around seven percent equal error rate

and if we just use all of the sre

we will get around two and half percent

so now if we start varying the ingredients that we used to actually train these

systems

in particular we just say we just switch these two here we go from switchboard

to the your whitening parameters that's the mean subtraction et cetera

and you switch it to use sre

you get a little bit of again you get you go down from seven percent

of five

and subsequently if you

stick with

a switchboard to do your

ubm

and i-vector extraction

then

and also

keep the sre as are whining and use the sre labels

basically here then you get down to

under two and a half percent which is actually better than the last row here

not gonna try and explain

what happens there but

or we decided from then on is that we were obviously can focus

on the performance gap

between the sre and the use of switchboard labels for our compare a within and

across class covariance matrices

that's what will continue one and

so basically the source be the baseline that we've got and this will be the

the benchmark that we're trying to hit or even to better that

the rules for this what we call the domain adaptation challenge task is that's were

allowed to use switchboard all the data and all of the labels

and

where allowed to use the

sre data but not of that labels

and obviously we're gonna okay and evaluate on a the twenty ten sre

so before we actually jump into that though on well we'd like to do perhaps

is to be a mix for the domain mismatch i got a lot of questions

like what actually is the difference between these two datasets that might cause such a

gap

in there

and so we

big n and did a little bit of a rudimentary analysis of actually what was

going on

and well some the

clear questions that you might wanna

our might think of as well as at the speaker age right

or is it perhaps the languages spoken in particular switchboard contains only english and it's

collected from a over a decade and

and it's a over a decade that preceded that of the sre on and the

sre contains twenty more than twenty different languages right so the question is whether or

not

that might have caused some of the shift and variabilities that be that we see

are the difference in performance in some of this

this work

there was previously also export believe by columns back and twenty two l

and what we found however was that there was like absolute there was really no

affect of either h

and

eight or language spoken

and so

with that

well

one

the next step then was to look at something else

which was

that of the switchboard

itself on a what we found was well we realised first off that they're switch

but was collected in different phases over approximately a decade and so what would happen

what happens whether when we use on different subsets

we just use different subsets to build our models

and so

well we ended up finding

was

the following if you if you take on switchboard cellular both parts and those of

the most recent ones

you actually get a starting baseline so the previous starting baseline was five and a

half percent

you actually get a starting baseline percentage of four point six which is a little

bit better and now if you also at in switchboard phase by three you can

actually start all the way down at three not have percent

and then but then as you keep adding these i guess you could say maybe

older

the older portions of switchboard on you might you'd start actually doing a bit worse

and

and that's we found in and i think are similar i work was also on

done in presented by high guy on during the summer and over it i can

ours is a slightly different take on it but

that's kind of what we what we noticed on as we're trying to analyze the

mismatch that

it basically the differences within switchboard itself on selecting out some of those particular subsets

might actually

affect the baseline performance

so then the next question then is alright so it should be actually just

continue with other three graph

and

also secondly can you actually

just

find some automatic way of selecting out the out-of-domain data that you actually wanna end

up using okay

to do your initial domain adaptation

or to not even to just like selected the labeled data that you want to

use that best matches the in domain data that you have right

and so what we

did again was and it's just a couple of ninety that's for exports were experiments

were set alright

if we

did an automatic subset selection so

in particular

first are this is the three no half percent of equal error rate

on that you get from the cellular and

and the faces that's the best we did

and this here on the five and a half percent is approximately what

you what if you use all of the data all the switchboard and started off

there so instead if you

these two lines let's focus on the blue for a second that's if you select

the proportion of scores

or proportion of i-vectors that's are

in the in at the highest that you the prop highest probability density function value

with respect to the that the sre so you select the switchboard

a subset of the switchboard automatically that were closest in the likelihood onto the sre

marginal

and you increase the proportion a how would you do in terms of the baseline

performance

and similar the and lda

but is if you took switchboard and

and

sre and you try to

learn just a simple

one dimensional linear separator between the two the ones and i take the ones that

are closest to

the sre data and i reckon that way so

and how well can i do the and basically what we can see is obviously

if you use all of the discourse and

you've done nothing different

but you know as you as you as you use just the some proportion of

the likelihood

are proportion of these top ranking scores

you can actually do a little bit better than our baseline however

you never approach

this three half that seem to be set by this particular this magical subset on

that was not

so that was the initial exploration of the domain mismatch that we did

now

covered most of the set up most of the problem

and

now i can continue one with the rest of a work

the bootstrap remark that i'm gonna go over one more time on it's pretty standard

for the domain adaptation we begin with our prior across class and within class hyper

parameters

and then we use

p lda to confuse and pairwise affinity matrix

on the sre data

subsequently will do some form a clustering on that are pairwise affinity matrix to obtain

some hypothesized cluster labels will use these labels to obtain another set

of hyper parameters

and then be linearly interpolate

as alan showed and then potentially we iterate on the me

just to make this look better it

between mac and windows so that's actually have that slide supposed to look

basically that's the set up and we'll just run into some clustering algorithms and output

unsupervised in parentheses "'cause" you know all clustering other algorithms have at least some parameter

that you can to right

so you start off a mobile find later on is that hierarchical clustering on really

does do the best

however

in light of you know the stopping criterion that you choose or the cluster merging

criterion those are kind of up to the user to choose but we find that

with some reasonably appropriate choice on hierarchical clustering does do the best the two algorithms

that we also explored pretty extensively on word some graph based random walk algorithms

and i and that's known as in format and of markov clustering i'm not gonna

go into the details about those but on feel free to ask me offline or

at the end of on the presentation

and those do you know you basically have a graph work each node is an

i-vector and then you have some edges on that a contain

perhaps and edges and then you do some clustering on those edges

so our initial findings this is no really no different from what i wanted shown

previously but mainly is that what's mainly true is that the in the presence of

interpolation

an imperfect clustering is in fact forgivable

this here

is just the plot that says we took a thousand speakers subset

and this shows a cluster error just some thing of cluster error

and

these are the solid lines in a green and red

are if you

new

the

the cluster labels

if you new cluster labels are pure in didn't have to do any automatic clustering

and then the rest of these two lines here are a in dotted lines are

basically

what you would have you would do if you

clustered or stop your clustering at different points of a

at different points of the hierarchical tree okay and basically what the thing is that

this ball is incredibly flat okay

and this and also the last thing is that

alpha star itself is basically the best adaptation parameters so much whatever just talked about

however

one thing is that we that we kinda glossed over so far is that alpha

itself needs to be estimated you can do it improves on via like more principled

way be as a the counts of

of the relative dataset size is or you can look at it empirically and you

can separate you know you can do your alpha for a within class differently from

the alpha of your across class and

and that's

that seems to be an empirically the case the better ones seem to be this

way and so you can see we be range across the elephants on both sides

for the within class and you across class and find that this is approximately the

best on for a one particular subset of a thousand speakers however

like and it seems like

alpha star itself is an open an unsolved problem but actually it's not so bad

because if we rescaled is plot to within ten percent of this optimal on equal

error rate and we can actually find that

there's actually a range of values

that would you a range of values for alpha that would actually you'll the pretty

good

good results

so results so far without parsing drum running on a bit out of time but

basically the best you can do is you roughly around fifteen percent of the absolute

best you can use the best we can do with automatic methods is on it

close that gap by about eighty five percent

so that a calm ideas for now is that given interpolation an imprecise estimate of

the number of clusters is okay

there is a range of adaptation parameters that would yield reasonable results and the best

automatic system on gives us within fifteen percent of a system that has access to

all speaker labels

now that fourth that between allan's talking mine

we wonder well

i mean this telephone the telephone domain mismatch simple solutions work already

and

and we'd like to

and what we been working on is to explicitly identified the sources of this mismatch

and that's kinda ongoing work at the moment but the question just like mitch brought

up a couple seconds ago are at the end of alan's five

what can we do about telephone to microphone domain mismatch i did the work independently

actually did not know that a

alanna daniel had done this and this about what i'm about to show is that

is not in the paper itself but

it's a little just a little at all

and lastly what else you can talk about is out of domain detection like what

when

do i actually when maybe when what is system knowing that it actually needs

some additional

albeit unlabeled data on but you know that it cannot perform at the level it

usually doubts so that's perhaps an instance of like outlier detection or something like that

that we can also we will look into on that something sort of a future

work kind of thing

what i will really quickly show is a quick visualization using some low dimensional embedding

is actually

and basically what we're gonna start with is

if you have switchboard

and sre and those are these are all the i-vectors in there and i'm gonna

collapse

a lot i-vectors into a very low dimensional space which is why just looks very

cloudy at the moment

it's harder to

a fit a lot of points into

into a into a small space and still have them preserve their to their relative

distances

however this is

if i try to learn first off and i using unsupervised

and betting that

it just takes all the data and learns on some low dimensional visualization here

and then i apply the colouring is to the spline

so what it shows here is that we have switchboard

in blue and we have the sre data in red and you can kinda see

that there is a little bit of separation

you

perhaps right but their the can also a little bit on top of each other

now to be a just one other point set it talked about earlier

if we just took that subset of

at that magical subset the gave us that three not have percentage that magical subset

of switchboard we get this in green and we have the sre in the red

as well and so they're pretty uniformly distributed a round the sre data itself

right

on the other hand

if you just

if you just the remaining amount of data

and we leave it in blue the old switchboard stuff

they're actually like a little farther away then the rest of the sre itself so

that kind of maybe that gives some idea of how things work r y

a what performance was as once and

however if you take a look at

telephone and microphone

if you do same

it's a it's you same kind of an embedding

then

you will

get it completely different a slight a much more separate sort of

visualization and that sort of just illustrate that i think telephone and microphone

can be a harder problem however i guess initial results have also shown that is

actually not as bad as maybe this visualization shows some other stop there

and take any questions

you said that you have found that the language is not the cost of these

domain mismatch how to find that

let me think so

but like basically

well i basically hold

like the different languages out

note that the various different languages out of

of the sre and of the sre data and just try to basically see whether

that was

that would be like distinctly different from that of the

sorry

sorry no the one that i basically on looked at it and saw

whether or not so on

the different languages are clustered together in a sense

that's a in general that's how we what about trying to tease apart whether or

not the languages

where

at a source of a domain mismatch

so you look at t s and you can just like that's on now

let's talk offline about that i'm actually for getting some of the details of that

it of the language experiment exactly at the moment but

what soft aligned about that

sorry

this is in no the beginning of the two you have table issue the did

you know if you the

of what used for training u v and also

it did you try this which is

put in the training switchboard in a city

that yes we did originally and there was one terribly different

there does very just about the same okay thanks there's really no difference

so sweet little then mix to zero were collected over a wide range will use

so maybe the your easy dependent variable shows the evolution of the telephone network and

how

speech is transmitted of the telephone that the now compared to the in nine and

ninety nine

absolutely no it totally on that's actually one of the that's almost exactly a sentence

at it like that we wrote in a and yes

and that's a

a potential like a hypothesis that

i'm certainly willing to leave thanks

even a related question

the p lda has the within and between speaker covariance parameters so

which of those most need to be adapted with moving from switchboard two

the mixer a think that shown

go with

this one right

the one that most needs to be adapted would be that within class

variability relative to

the across class at the it shown in so that we just

the speakers the speaker distribution

so i more this constant exactly but the left channels

so that we which is what you need more weight within

it's very even and

Unsupervised Clustering Approaches for Domain Adaptation in Speaker Recognition Systems

Speaker Modeling II

Stephen Shum, Douglas Reynolds, Daniel Garcia-Romero and Alan McCree