so as naked mentioned this is work mainly from last summer and continuing on after
the end of last summer
primarily by daniel my colleague at johns hopkins
but also with steve issue will be talking next about a little bit different flavour
and when you go and carlos
so bear with me level then you'll like but a lot of animation into slides
and use the and take them out but again so much in these that i
really couldn't get model
so i will try and do it with an animation style which is not natural
to me so we're trying to build a speaker recognition system which is state-of-the-art how
are we gonna do that well it depends what kind of an evaluation we're gonna
run we wanna know what the data looks like that we're actually gonna be working
on
and since normally in for example in sre we know what that data is going
to look like we go to our big pile a previous data that the ldc
has kindly given us generated for us
we use this development data we typically use very many speakers in very many labeled
cuts
to learn our system parameters
in particular what we call the across class within class covariance matrix using the key
things we need to make the lda were correctly
and then we are ready
to score our system and see what happens
so the thought here for this workshop was
what if we have this state-of-the-art system which you we have built for our sre
ten or sre twelve
and someone comes to us with the pilot data which doesn't look like in a
story what are we going to do
and the first thing in this corpus the direct also put together which is available
from there are links to the lid lists from the g h u website
we found that there is in fact a big performance gap with the p lda
system even with what seems like a fairly simple mismatch namely you train your parameters
on switchboard and you tested on mixture or sre ten
and you can see that the green line
which is a pure sre system designed for sre ten works extremely well and it's
this at the same algorithm trained only on switchboard has three times the error rate
so in the supervised domain adaptation that we attacked first which daniel presented at icassp
we are given an additional data set which is in we have the out-of-domain switchboard
data we have an in the main mixture set and its labeled but it may
not be very be so how can we combine these two datasets
to accomplish good performance on sre data
the setup that we have used about these experiments is a typical i-vector system
i think some people may do different things in this in this back part but
daniel has convinced me that links norm with total covariance would just whitening is in
fact the best most consistent way to do it
a typical system parameters the
the lda is typically four hundred or six hundred in our experiments
and the important point in one of size here is that the i-vector extractor doesn't
need any labeled data so we call that in unsupervised training
the
links norm also is unsupervised
and to p lda parameters of the ones where we need the speaker labels that's
the harder data to find
and
in these experiments we found that we can always use switchboard for the i-vector extractor
itself we don't need to retrain that every time we go to a new domain
which is a tremendous practical advantage
they whitening parameters can be trained specifically for whatever domain you're working in which is
not so far to do either because you only need an unlabeled pile of data
to accomplish that
and then i wanna focus on the at that adaptation part of the covariance matrices
that was the biggest challenge for us
in principle
at least in a little bit of a simplistic math if we have no one
covariance matrices we can do this map adaptation the dog has been doing in gmms
for a long time
the original map behind that is a conjugate prior for a covariance matrix
and you end up with a sort of account based regularisation if you configure your
prior in a certain tricky way
you end up with a very simple formula which is account based regularization back to
an initial matrix and a towards a new data sample covariance matrix so that's what
shown here
this is the in domain covariance matrix
and where smoothing it back to what out-of-domain
covariance
and what we showed earlier inance first supervised adaptation we can get very good performance
let's get used to this crap i'm gonna show a couple more the red line
at the top is the out-of-domain system which has the bad performance which is trained
purely on switchboard the green line at the bottom
is the matched in domain system that's our target if we had all of the
in domain data
and what we're doing is taking various amounts
of in domain data
to see how well we can exploit it and even with a hundred speakers we
can cut seventy percent of that
with this adaptation process and if we use the entire data we get the same
performance actually slightly better by using both sets the just the in domain set
one of the questions with this is how do we set this alpha parameter i
mean in theory if we knew the if we knew the prior exactly would tell
is theoretically what it should be but empirically
the main point of this crap is we're not very sensitive to it if output
is zero
where entirely the out-of-domain system and it's always pretty bad
if output is one where entirely trying to do an in domain system if we
have almost no data in domain we have a very bad performance
but it soon as we start to have data that system is pretty good but
we're always better by staying in the middle somewhere and using both datasets
using a come a combination
now this work the theme is an unsupervised adaptation what that means as we no
longer have labels for this pile of in domain data
so it's the same setup
but now we don't have labels
this means we wanna do some kind of clustering
and we found empirically as i think people in the i-vector challenge seem to a
found as well that h the user
is a particularly good algorithm for this task for whatever reason
and you can measure clustering performance
with
if you actually have the truth labels you can evaluate a clustering algorithm by purity
and fragmentation purity being help your clusters are in fragmentation being how much a speaker
was accidentally distributed into other clusters
one of the things we spent quite a bit a time on in fact then
you'll spend a lot of time and making an i-vector averaging system
is what the metric used for the clustering you gotta do hierarchical clustering you gonna
work your way out on the bottom but what's the definition of whether
two
clusters should be merged
p lda the theory gives
and answer for
a speaker hypothesis test that these two are the same speaker
that's something that we worked within the past
and then you know as soon as we started up this year so that really
doesn't work well at all which is a little disappointing from a theoretical point of
view but we found that in a stories as well when we have multiple cuts
using the correct formula doesn't always work as well as we would like
will be traditionally do an sre is i-vector averaging which is pertain we have a
single cut
dana spent a lot of time on that this summer then we found out that
in fact the simplest and thing to do which is to compute the score between
every pair of cuts get a matrix of scores and then never recompute any metrics
just average the scores is in fact
the best performing system and it's also much easier because you don't have to get
in your algorithm at all you just pre-computed this distance matrix and feed it into
an off-the-shelf
clustering software
so just as a as a baseline we compared against k-means for clustering with this
purity and fragmentation and the main point is this h c with this scoring metric
wasn't fact quite a bit better the k-means so we're comfortable that it seems to
be clustering in an intelligent way
now we wanna move towards doing it for adaptation but the other thing we need
to know is how do we side when the start clustering how do we decide
how many speakers are really there because nobody has told us
to do this you have to avenge the makeup are decision that you're gonna start
merging that and basically that you look at the two most similar clusters and you
gotta decide are these from a different speaker or are they the same in you
can make a hard-decisioning
and this is one of the
the nice contributions of this work that was really don't after the summer i think
of
where we just treaty scores as speak to speaker recognition scores we do calibration in
the way that we do and in particular
this unsupervised calibration method than aca what daniel presented at icassp
can be used exactly in this situation we can take
are unlabeled pile of data and look at all the scores across to learn a
calibration from that we can actually no with threshold and we can make a decision
about when to stop
so how well does that work
this is a across are unlabeled pile as we introduce bigger and bigger piles
the this is the correct number of clusters the dashed line
this is five random draws where we draw on random subsets and we've average the
performance and the blue is the average which is the easiest one to see and
you can see in general this technique works pretty well it always underestimate typically about
twenty percent so you think there's a few were speakers in the really are what
you're pretty close and getting
and automated and reliable way to actually figure out how many speakers are there is
actually we're
we're pretty excited to even do this well at it that's very heart task
so to actually do the adaptation then
the recipe is we use our out-of-domain p lda
to compute the similarity matrix of all pairs
we don't cluster the data using that distance metric
estimate all this how many speakers there are and the speaker labels
generate another set of covariance matrices from this labeled data
and then we apply or adaptation formulas
on this data
so here's a similar curve as i so the for here is the out of
the out-of-domain system and the in domain system in green at the bottom
and
but we're so in here
is the h z
adaptation
performance and the supervised
adaptation
we should means the number of speakers
no sorry supervised adaptation is one issue before
excuse me
so that if you have to labels
for all of the data that's what we you compress the first time now by
self labeling
of course we're not as good
but we are in fact much better than we ever thought we could be because
when we first set up this task we really didn't think
in fact daniel i had a little bit and he was convinced that this was
never gonna work because how are you gonna learn your parameters from your system that
doesn't know what you're parameters are but factor can
so we've done surprisingly well myself labeling
and we're still able to get at five percent of the for performance get if
we have all the data but is unlabeled which still able to recover
almost all the performance
now what if we didn't know the number of clusters
so if we had an oracle the told us it is exactly this many speakers
with that make our system perform better so that the additional
bar here and in fact
our estimation of the number of speakers is good enough because even had we known
it exactly we're gonna get
almost the same performance
so even though we didn't get exactly correct number of speakers the hyper parameters that
we have estimated still work just as well
and that's illustrated in this way
which is the sensitivity to knowing the number of clusters so here we're using all
the data the actual number of speakers is here and this is what we estimated
with their stopping criterion
and you can see that as a sweep across all of our if we had
stopped at all of these different points and decided that was how many speakers that
were
there's not a tremendous sensitivity if we massively over cluster then we have a big
hit in performance and if we massively under cluster it is bad but there's a
pretty big fat region
where we get almost the same kind performance with their hyper parameters if we had
us start our clustering at that point
so in conclusion then
domain mismatch can be a surprisingly difficult problem in state-of-the-art systems using the lda
and
we are denoted supervised adaptation could work quite well but in fact
unsupervised adaptation also works extremely well
we can close at five percent of the performance gap due to the domain mismatch
in order to do that we need to do this adaptation we need to use
both the out-of-domain parameters and the in domain parameters not just label of the in
domain
and this unsupervised calibration trick
in fact gives as a useful and meaningful stopping criterion for figuring out how many
speakers are in our data
thank you
i four questions
it's a wonder i can imagine that the distribution of speakers
comments basically the number of segments per speaker
of your unsupervised set
will make a difference right i guess at you get this from these days or
whatever switchboard data so the
will be relatively homogeneous is a is that correct or
i think yes classes i one has these are not homogeneous but this is a
good pile of unlabeled data because in fact it's the same power that we used
as a labeled data set
so it's pretty much everything we could find from these speakers some of them have
very many phone calls some of them have you are
but all of them have quite a few in order to be in this file
obviously for example you couldn't learn any within class covariance if you only had one
example from each speaker
hidden in that pile so you're absolutely right is not just that we do the
labelling it's also that the pilot self has some richness in order for us to
discover
before we give a microphone image i have a related question
when you train the i-vector extractor the nice thing is that you can do it
unsupervised
but again how many cats per speaker so if we had only one speaker with
many cats obviously that's not good because we don't get the speaker variability
the converse situations where you have every speaker only once
you have been any duration with that would give a good idea that
i don't think that the we looked at but i
i completely agree that would make me uncomfortable as i said in this effort we
just were able to show that the out-of-domain data which we assume we do have
a good labeled set somewhere in some domain that we can use we were able
to use that when the rest of the time so we're comfortable where it came
from i don't think of ever run an experiment with
with what you say and that is interesting i suspect it would not work so
well
what get both kinds of variability comes from a variety of channels
the variability speaker and the channels in the not quite the same proportions as you
get in the state
if you collect data and the while
in a situation where they're very many speakers
you might have data like that so i think that's an interesting qualities two
thank you
very impressive work in right set of results works are also thank you for that
so i question i have is this is all telephone speech and test work very
well with that i have we consider what would happen if the out-of-domain tighter walls
the different channels such as mock fine
i and is that even a realistic hence would you have a pre-training microphone system
that you try and adapt
right so yes we have like the microphone the very first work right a few
years ago on this task was adapting from telephone to microphone and daniel revisited early
in the summer when we were debating working with dog on this dataset whether we
trusted here if he did a similar experiment
with the sre telephone and microphone and actually got similar results
it is
that does sound a bit surprising but we have seen in the sre is the
telephone a microphone is not nearly as art is that ought to be i don't
know the reason for that but yes we have than worked with telephone microphone histories
and it's not shockingly different in this great things
i just isn't that answer i'm because question
we trained i-vector start on unofficial database which there is no one speaker a per
utterance
one and it's about the same as you two thousand four and five or so
first
okay thank you
i knew that no i think about
thank you
so either really stupid question yesterday people are mentioning about how the mean shift clustering
algorithm is working well
is that i mean you don't seem to use that you use the
a limited to a lot i don't the clustering so
is that a reason why
i believe over the course of the summer we another people look the quite a
few different algorithms i know that's use in diarisation a no we have looked at
it in diarisation i cannot remember if we looked at it for this task
we did look at others a stephen is gonna talk about some other clustering algorithms
where's but i don't think he's gonna talk about the mean shift and so i'm
not sure i don't have that compares it clearly is also useful out with
the i just want to know if the this split and this protocols and available
yes to they are on the jhu website the link as in the paper okay
it thanks you wanna get the speech type of error but the lists
i encourage you to work on this task
which
one question
let's suppose that you are not as to do a speaker clustering but gender clustering
and you don't have any prior to how many genders other
input and they
the stopping criterion would be the same i mean you have a file sign genders
i'm not sure and if there is saying with the clustering accidently fine gender well
let me say one thing first is we did i think i forgot to mention
this is a gender independent system
well gender suppose that they classes to sip to cluster and not that the and
the speakers that any of the end of clusters either kind of
correctly well this is why daniel thought this wouldn't work
who knows what you're gonna cluster by we're just using the metric we are hoping
that the out of out-of-domain p lda metric is encouraging the clustering to focus on
speaker differences
but we cannot guarantee that except
with the results
i think more so than gender if for example language we're different which there might
be some differently data and you might think you would cluster is the same speaker
speaking multiple languages you might think that would confuse are clustering for us
so
you saw nick like to one aspect i think is very important especially in the
forensic framework could you
shows slide five
probably
so all
what you've neglected here is the decision threshold
yes we have neglected calibration of the final task that's
so it could possibly be that a factor of three becomes a factor of
one hundred
not to three degradation could actually
the factor one take it is yes which we simply neglected
you are right george when you collected that has we would think the with the
unsupervised calibration that we could accomplish calibration is what i would like you to do
when you get home yes
in two up
annotate this slide with the
decision
points
and all these systems are not even calibrated so
well we always have to run a separate calibration process to get onto single somewhat
easily to a when you go home
but go ahead and do that work
and it's only your only gonna have to do this for the in domain system
and then you
applying a threshold
but the dots on those two curves in zambia copy
thank you very well for your assignments are then
that question is already partially on so by are unsupervised score calibration paper which was
published icassp so as true
that
okay so we so we thank the speaker