and so what i'm gonna talk about is also what we did at in last
summer a at hopkins and again this work between myself and steven a dog
doug reynolds the annual and l
so
actually
i really hope
that
well okay
four years ago
was my first a odyssey
ever and does my first conference presentation i made a joke
at the start of the slide that are not to be far more memorable and
then the presentation itself
i wanna do the same this time but i do the on couldn't come up
with any good stories so
i was gonna give you this picture in which i hope none of you will
look like by the end of my presentation
okay
all right now
so what i meant talk about is on the remote clustering approaches
for domain adaptation in speaker recognition systems
and first off i guess the titles a bit of a handful so i'm gonna
break it down and explain kinda each piece one at a time
so domain adaptation
is
one in which
where
most
current statistical learning techniques assume like someone incorrectly rather that
the training and test data come from the same underlying distribution right
and
so what we know in general is that labeled data may exist in one domain
but what we want is a model that can also perform well in a related
but say not necessarily identical domain
and labeling data
in this particular the new domain may be difficult and or expensive
and so what can we do
and to leverage the original labeled out-of-domain data one building a model to work with
this in domain data
so is nothing new here everything we've heard before in the previous presentation so speaker
recognition systems on the blaster once again used for all rather familiar with the i-vector
approach
and that's
clearly just you know you're standard summary length of segment length independent
low dimensional vector based or some representation of the audio
and what we've done what the i-vector allows us to do is to use large
amounts of previously collected in labeled audio to characterize and exploit speaker and channel variability
that's right and usually that entails the use of you know thousands of speakers making
tens of calls each
so
unfortunately it is a bit unrealistic to expect that most applications will have access to
such a large set of labeled data from matched condition
and so
well here's you know that anatomy of that standard i-vector system that's very similar and
almost actually identical to what counted shown and z are yet again the thing that
the no is that
you know the your ubm your i-vector extractor and your resulting mean subtraction and like
normalisation is are does not require the use of labels
what does require some labels are
on your within class and across class covariance matrices
and that's where the labels come in so
that's what we've got now the first thing and like to do sort of just
the like paints
at the a larger picture of what we've done
on this to demonstrate that mismatch right
on between are two domains
on similar to the deck curve plot that hounded shown on what we start with
this one role in score on sre two thousand ten
and what one denote as the in domain set is that of the sre data
i mean that's all the telephone calls from mixer o four five six and two
thousand eight collections
now the mismatched out-of-domain data is all the switchboard data
which are all the calls from that from those collections
so in general
some summary statistics there
what we're basically looking at is that
number of speakers
a number of calls an average number of calls per speaker and number of channels
that you speaker spoke on a relatively the same and that help with that as
a visualization
that's kind of the normalized histogram of the distribution of
the number of utterances per speaker between the two
between the two sets of data in blue well that the pretty much all overlap
but in blues the switchboard and read is that of the sre
so what we can say is that we would not expect a large performance gap
between these two sets of data if indeed
are
our ability you know are training where
dataset independent and are robust across datasets
so
what we found obviously i is that this is not the case which is why
we ended up having a summer workshop
on it and so it just take
to give in summary of
equal error rate results the wrestler talk just be using equal error rate to provide
a summary set of results and
what we what we have is are
inbred i believe denoted just
the portion of the system that actually requires
labels on and have also
on the shown what we what we had at hopkins of the summer and what
we
replicated at mit
as well so you can see that for use all switchboard on to train everything
we'll get a set of results around seven percent equal error rate
and if we just use all of the sre
we will get around two and half percent
so now if we start varying the ingredients that we used to actually train these
systems
in particular we just say we just switch these two here we go from switchboard
to the your whitening parameters that's the mean subtraction et cetera
and you switch it to use sre
you get a little bit of again you get you go down from seven percent
of five
and subsequently if you
stick with
a switchboard to do your
ubm
and i-vector extraction
then
and also
keep the sre as are whining and use the sre labels
basically here then you get down to
under two and a half percent which is actually better than the last row here
not gonna try and explain
what happens there but
or we decided from then on is that we were obviously can focus
on the performance gap
between the sre and the use of switchboard labels for our compare a within and
across class covariance matrices
so
that's what will continue one and
so basically the source be the baseline that we've got and this will be the
the benchmark that we're trying to hit or even to better that
so
the rules for this what we call the domain adaptation challenge task is that's were
allowed to use switchboard all the data and all of the labels
and
where allowed to use the
sre data but not of that labels
and obviously we're gonna okay and evaluate on a the twenty ten sre
so before we actually jump into that though on well we'd like to do perhaps
is to be a mix for the domain mismatch i got a lot of questions
like what actually is the difference between these two datasets that might cause such a
gap
in there
and so we
big n and did a little bit of a rudimentary analysis of actually what was
going on
and well some the
clear questions that you might wanna
our might think of as well as at the speaker age right
or is it perhaps the languages spoken in particular switchboard contains only english and it's
collected from a over a decade and
and it's a over a decade that preceded that of the sre on and the
sre contains twenty more than twenty different languages right so the question is whether or
not
that might have caused some of the shift and variabilities that be that we see
are the difference in performance in some of this
this work
there was previously also export believe by columns back and twenty two l
and what we found however was that there was like absolute there was really no
affect of either h
and
eight or language spoken
and so
with that
well
one
the next step then was to look at something else
which was
that of the switchboard
itself on a what we found was well we realised first off that they're switch
but was collected in different phases over approximately a decade and so what would happen
what happens whether when we use on different subsets
we just use different subsets to build our models
and so
well we ended up finding
was
the following if you if you take on switchboard cellular both parts and those of
the most recent ones
you actually get a starting baseline so the previous starting baseline was five and a
half percent
you actually get a starting baseline percentage of four point six which is a little
bit better and now if you also at in switchboard phase by three you can
actually start all the way down at three not have percent
and then but then as you keep adding these i guess you could say maybe
older
the older portions of switchboard on you might you'd start actually doing a bit worse
and
and that's we found in and i think are similar i work was also on
done in presented by high guy on during the summer and over it i can
ours is a slightly different take on it but
that's kind of what we what we noticed on as we're trying to analyze the
mismatch that
it basically the differences within switchboard itself on selecting out some of those particular subsets
might actually
affect the baseline performance
so then the next question then is alright so it should be actually just
continue with other three graph
and
also secondly can you actually
just
find some automatic way of selecting out the out-of-domain data that you actually wanna end
up using okay
to do your initial domain adaptation
or to not even to just like selected the labeled data that you want to
use that best matches the in domain data that you have right
and so what we
did again was and it's just a couple of ninety that's for exports were experiments
were set alright
if we
did an automatic subset selection so
in particular
first are this is the three no half percent of equal error rate
on that you get from the cellular and
and the faces that's the best we did
and this here on the five and a half percent is approximately what
you what if you use all of the data all the switchboard and started off
there so instead if you
these two lines let's focus on the blue for a second that's if you select
the proportion of scores
or proportion of i-vectors that's are
in the in at the highest that you the prop highest probability density function value
with respect to the that the sre so you select the switchboard
a subset of the switchboard automatically that were closest in the likelihood onto the sre
marginal
and you increase the proportion a how would you do in terms of the baseline
performance
and similar the and lda
but is if you took switchboard and
and
sre and you try to
learn just a simple
one dimensional linear separator between the two the ones and i take the ones that
are closest to
the sre data and i reckon that way so
and how well can i do the and basically what we can see is obviously
if you use all of the discourse and
you've done nothing different
but you know as you as you as you use just the some proportion of
the likelihood
are proportion of these top ranking scores
you can actually do a little bit better than our baseline however
you never approach
this three half that seem to be set by this particular this magical subset on
that was not
so that was the initial exploration of the domain mismatch that we did
now
covered most of the set up most of the problem
and
now i can continue one with the rest of a work
so
the bootstrap remark that i'm gonna go over one more time on it's pretty standard
for the domain adaptation we begin with our prior across class and within class hyper
parameters
and then we use
p lda to confuse and pairwise affinity matrix
on the sre data
subsequently will do some form a clustering on that are pairwise affinity matrix to obtain
some hypothesized cluster labels will use these labels to obtain another set
of hyper parameters
and then be linearly interpolate
as alan showed and then potentially we iterate on the me
just to make this look better it
between mac and windows so that's actually have that slide supposed to look
so
basically that's the set up and we'll just run into some clustering algorithms and output
unsupervised in parentheses "'cause" you know all clustering other algorithms have at least some parameter
that you can to right
so you start off a mobile find later on is that hierarchical clustering on really
does do the best
however
in light of you know the stopping criterion that you choose or the cluster merging
criterion those are kind of up to the user to choose but we find that
with some reasonably appropriate choice on hierarchical clustering does do the best the two algorithms
that we also explored pretty extensively on word some graph based random walk algorithms
and i and that's known as in format and of markov clustering i'm not gonna
go into the details about those but on feel free to ask me offline or
at the end of on the presentation
and those do you know you basically have a graph work each node is an
i-vector and then you have some edges on that a contain
perhaps and edges and then you do some clustering on those edges
so our initial findings this is no really no different from what i wanted shown
previously but mainly is that what's mainly true is that the in the presence of
interpolation
an imperfect clustering is in fact forgivable
this here
is just the plot that says we took a thousand speakers subset
and this shows a cluster error just some thing of cluster error
and
these are the solid lines in a green and red
are if you
new
the
the cluster labels
if you new cluster labels are pure in didn't have to do any automatic clustering
and then the rest of these two lines here are a in dotted lines are
basically
what you would have you would do if you
clustered or stop your clustering at different points of a
at different points of the hierarchical tree okay and basically what the thing is that
this ball is incredibly flat okay
and this and also the last thing is that
alpha star itself is basically the best adaptation parameters so much whatever just talked about
so
however
one thing is that we that we kinda glossed over so far is that alpha
itself needs to be estimated you can do it improves on via like more principled
way be as a the counts of
of the relative dataset size is or you can look at it empirically and you
can separate you know you can do your alpha for a within class differently from
the alpha of your across class and
and that's
that seems to be an empirically the case the better ones seem to be this
way and so you can see we be range across the elephants on both sides
for the within class and you across class and find that this is approximately the
best on for a one particular subset of a thousand speakers however
like and it seems like
alpha star itself is an open an unsolved problem but actually it's not so bad
because if we rescaled is plot to within ten percent of this optimal on equal
error rate and we can actually find that
there's actually a range of values
that would you a range of values for alpha that would actually you'll the pretty
good
good results
so results so far without parsing drum running on a bit out of time but
basically the best you can do is you roughly around fifteen percent of the absolute
best you can use the best we can do with automatic methods is on it
close that gap by about eighty five percent
so that a calm ideas for now is that given interpolation an imprecise estimate of
the number of clusters is okay
there is a range of adaptation parameters that would yield reasonable results and the best
automatic system on gives us within fifteen percent of a system that has access to
all speaker labels
now that fourth that between allan's talking mine
we wonder well
i mean this telephone the telephone domain mismatch simple solutions work already
and
and we'd like to
and what we been working on is to explicitly identified the sources of this mismatch
and that's kinda ongoing work at the moment but the question just like mitch brought
up a couple seconds ago are at the end of alan's five
what can we do about telephone to microphone domain mismatch i did the work independently
actually did not know that a
alanna daniel had done this and this about what i'm about to show is that
is not in the paper itself but
it's a little just a little at all
and lastly what else you can talk about is out of domain detection like what
when
do i actually when maybe when what is system knowing that it actually needs
some additional
albeit unlabeled data on but you know that it cannot perform at the level it
usually doubts so that's perhaps an instance of like outlier detection or something like that
that we can also we will look into on that something sort of a future
work kind of thing
so
what i will really quickly show is a quick visualization using some low dimensional embedding
is actually
and basically what we're gonna start with is
if you have switchboard
and sre and those are these are all the i-vectors in there and i'm gonna
collapse
a lot i-vectors into a very low dimensional space which is why just looks very
cloudy at the moment
it's harder to
a fit a lot of points into
into a into a small space and still have them preserve their to their relative
distances
however this is
if i try to learn first off and i using unsupervised
and betting that
it just takes all the data and learns on some low dimensional visualization here
and then i apply the colouring is to the spline
so what it shows here is that we have switchboard
in blue and we have the sre data in red and you can kinda see
that there is a little bit of separation
you
perhaps right but their the can also a little bit on top of each other
now to be a just one other point set it talked about earlier
if we just took that subset of
at that magical subset the gave us that three not have percentage that magical subset
of switchboard we get this in green and we have the sre in the red
as well and so they're pretty uniformly distributed a round the sre data itself
right
on the other hand
if you just
if you just the remaining amount of data
and we leave it in blue the old switchboard stuff
they're actually like a little farther away then the rest of the sre itself so
that kind of maybe that gives some idea of how things work r y
a what performance was as once and
however if you take a look at
telephone and microphone
if you do same
it's a it's you same kind of an embedding
then
you will
get it completely different a slight a much more separate sort of
visualization and that sort of just illustrate that i think telephone and microphone
can be a harder problem however i guess initial results have also shown that is
actually not as bad as maybe this visualization shows some other stop there
and take any questions
you said that you have found that the language is not the cost of these
domain mismatch how to find that
let me think so
but like basically
well i basically hold
like the different languages out
note that the various different languages out of
of the sre and of the sre data and just try to basically see whether
that was
that would be like distinctly different from that of the
sorry
sorry no the one that i basically on looked at it and saw
whether or not so on
the different languages are clustered together in a sense
that's a in general that's how we what about trying to tease apart whether or
not the languages
where
at a source of a domain mismatch
so you look at t s and you can just like that's on now
no
let's talk offline about that i'm actually for getting some of the details of that
it of the language experiment exactly at the moment but
what soft aligned about that
sorry
this is in no the beginning of the two you have table issue the did
you know if you the
of what used for training u v and also
it did you try this which is
put in the training switchboard in a city
that yes we did originally and there was one terribly different
there does very just about the same okay thanks there's really no difference
so sweet little then mix to zero were collected over a wide range will use
so maybe the your easy dependent variable shows the evolution of the telephone network and
how
speech is transmitted of the telephone that the now compared to the in nine and
ninety nine
absolutely no it totally on that's actually one of the that's almost exactly a sentence
at it like that we wrote in a and yes
and that's a
a potential like a hypothesis that
i'm certainly willing to leave thanks
even a related question
the p lda has the within and between speaker covariance parameters so
which of those most need to be adapted with moving from switchboard two
the mixer a think that shown
go with
this one right
so
the one that most needs to be adapted would be that within class
variability relative to
the across class at the it shown in so that we just
the speakers the speaker distribution
so i more this constant exactly but the left channels
so that we which is what you need more weight within
it's very even and