i management and they're representing work from sri international on the adapted mean normalization for
unsupervised adaptation of speaker and bindings
the scroll control problem statement if on are actually trying to tackle deal with this
work
well then look at the wrong name normalization have it applies to the state-of-the-art speaker
recognition systems
and we're going for the proposed technique adaptive may normalization
and have a look at six times to see how phones
the problem statement
variability is well-known one of the biggest challenges to practical use of speech face recognition
systems
and mister really refers to changes in a fixed between the training
and successive detection attempts
a system
everything calmed and two types of variability
one is extremes that is something separated from this data
includes things like microphones the acoustic environment transmission channel
you turn it is intrinsic are really this is to do with this data
ally vary over time and things into two different variability here
include health
stress
they're all the stake
speaking style
these differences are collectively referred to as the main mismatch
when you looking at the differences between system training sonar
and detection attempts
now many of us know the domain mismatch typically results in performance of the system
now this is a performance with respect to the extent the performance of the system
once the systems trained on land we have a certain estimate of have standard for
if that the main when there is that the what changes
then we have almost all
due to this domain mismatch
now we use a just two different things
one is discrimination loss
that means less karen the system to separate state
the others mis calibration when assistance miss calibrated
and it gives a score
that's for a my actually recently the use of into believing something that shouldn't have
been detected prints
domain adaptation be used to cope with this problem
and it's two different ways of dealing with domain adaptation one is suitable
this is where we have labeled data
where we often get improvement or reliable improvement
but it is a high costs will be a man data that you end up
eating to improve system
the alternative is unsupervised adaptation
it has a very low cost
there's actually no and use a labelling and all
plenty of data available
and it's ideally matched to leave out conditions
but insight here is that we have no ground truth labels to reliable
for using this work
is the unsupervised adaptation snore
there are some shortcomings of unsupervised adaptation
one is a lack of generalisation
now quite high decisions a to be my in a have to comply a supervised
our approach for instance if we're going to retrain
lda okay ogi best system
we end up needing to make some kind of assumptions about
which clusters different audio segments mark going to with respect to different stages
it can also be over g for the data being trained with
trustworthiness is nullified the
when we get guarantees for improvements friends it was a patient their limited i see
that
then this complexity
some approaches have high computation and that makes it a little bit more difficult to
give two
clients or uses a goes out the door
and the question we trained and two years or where is the best place to
apply adaptation
and i in the unsupervised scenario
where can be fast and reliable once deployed
so on screen here we have diagram all the different stages of a speaker recognition
one
and we can look at what would happen if we applied adaptations which of these
stages
i think the feature extraction the mfcc is or how normalized cepstral coefficients
speaker embedded in it
if someone was tuned that
and hot in this in our here
attacks requires a for re-training or but the nn and the back end modules
you need to have a lot of data on hand ready to do that process
that's how to explore
what about speech activity detection
now there are approaches by next the genus wasn't stages
a different scenarios
this is useful when that is actually the main since the
but it's on purchase solution doesn't really help the discrimination a in the rest of
the form
lda purely eye calibration
no these the sum of the clock kinds of the backend process
but center require labels or prediction in your clustering and this can be are carried
by a projectionist
like normalization well there's no actual adaptation to go on us does not applicable
which leads us would mean normalization
now this is simple to that a parameter sorry in general typically about two hundred
numbers of the only i
at the request use doesn't help
but sort of the role of mean normalization in a system
nobody only i is a strong model when the assumptions of a few for the
p lda model car
now that is the distribution of the data going into it
fixed a standard normal distribution
in training
without trying to is that
mean normalization
and length normalization together achieve this
so the assumptions of for your
for a few when the system is ranked right
length normalization when we will actually projects embedding some to you know how does the
and that's a trigram actually right here that demonstrates just
but evenly spread around the house
and this issues a zero-mean works well during training
emphasis shifted domain
wow of course of training
such as with evaluation data
so this is in this diagram here
and then some producing you know how to say that has a distribution that is
not evenly distributed
therefore assumptions appeared in model i'm not sure field anymore
and we actually reduce the discrimination battle
no so that actual performance here when we look at this difference all using a
system based main
where we have taken the mean from the actual training data a best this the
impact if we actually i mean just the mean of the system
two i held out dataset
only relevant conditions of the data with benchmark
now there are more details on the evaluation protocol on the dataset used here later
on in the presentation
but for now this is a quick snapshot of what happens if you simply update
the main the main of the only a tighter polish dataset
actually see the equal error rate increase prior to nineteen percent
really helping and discrimination that
but even more impressive
is the fact that sail that's the cost of the likelihood ratio
and that's an indication of discrimination and calibration performance
improves file to sixty percent
no this is this five holding the calibration model as used in mismatched to the
other conditions
in particular this calibration model here is train the red source data
that's clean telephone data
and yet in the sre dataset and stickers in well
it is dramatically helping calibration
so having a roommate really is crucial
no it's okay a that it may normalization
so if you we've got
i mean that a suitable when evaluation conditions are homogeneous so if we deploy a
system we know the generally what we do about four it's not gonna very much
from that
that is that the okay
the problem comes in my conditions can vary over time will between trial
so for instance dealing with radio broadcast right
over time depending on the single signal the time of day or maybe a system
thing used for both telephone and microphone style
calls
then we end up having this distribution over here only bottomright where we have different
needs
and how that projects onto the in opposite
this means that a ideally what would love to be able to here is actually
and that domain
depending on the conditions of the trial at hand
so that means we wanna dynamically defining
as we're going to resist
that's what we contract requires method of adaptive name normalization
so what is a
well this process actually stand that if that's on trial based calibration
and what role based calibration does is actually whites into a problem can to define
the system parameters
in particular the calibration model parameters
it actually looks conditions of hand
of the trial coming in both sides the enrollment and test on
defines different subsets for those
conditions
and then finds calibration model on-the-fly using held out data
so called the system here
is to try my the system model what is general
and reliable as possible
no one extra advantage here's the overtime as systems the system is saying more and
more conditions are more relevant data
it can act agreement about the new conditions of the time
the he's the process
so taken a bit of is not show on the embedding is only a mean
normalization links not purely and calibration left hand side
and was fifteen
the a normalization and the adaptive process
one used to be there was inconsistent me
no we're doing is taking the goodies from after all
so in fact this can be this is an embedding
specific process
not a troll specific process which is a bit of the benefit here
and terms of computation
for each embedding what we do is we go meet her some personal
against a bunch of handed in
embedding
what we wanna do is found those embedding from the candidate on that are similar
conditioned to out embedding that's coming into the adaptively may normalized
we make a selection of that's also
we then find the condition name based on that sounds that
no and how many strong and we find and how many we would like to
find we dental weighting process
and then we use that men
as noted that the main one for that embedding
with and follow on through the rest of the pipeline
so what we're trying to do here is
make this happen on the fly in fact that actually has very little overhead
there are some ingredients that we need for that it may normalization
now in terms of making a comparison between embedding and handed it we need something
that can
tell us whether the conditions from those embedding the similar or not
but this we use condition field
and this is really what we only a what we use the speaker recognition
ever instead of discriminating stated
it's trained to discriminate different conditions
but conditions include compression time
re the five noise type
language in general
when we combine those things together we actually end up with our eleven thousand unique
conditions
so it should be a very thin slice
that we're dealing with
so for meaningful candidate embedding
and this is just a love mixture conditions anything controller really
and ideally it's including some examples and evaluation conditions
now if that's not the case again what the system could actually be a
after is deployed is actually have testing data along the way to calculate the whole
handed in embedding
to be more suited to the conditions
and this whole is used to dynamically estimate that means all conditions
finally there are twelve parameters one is the condition similarity threshold we don't want everything
from the candidate for coming through we wanna say we want to determine how similarity
is and make sure similar enough
the pastor in the next stage remain constant
the thing that we wanna sit is the maximum number of candidates to select
now everything if everything in that and are the cool
was about the threshold
everything will be faster to maybe we get a no benefit to the don't have
a natural system
we wanna make sure that we mean that how much longer term just select the
top number of this
so if we then go back to our picture here we can fill in a
few different things
for instance the comparison now is done that security i
we didn't do the selection process where n is the number of candidates for the
similarity
about the official
however if an
is more than a maximum where layer
which is an
we kind of the and with the highest similarity
but when making sure we can be most relevant ones for our main estimate
once we estimate domain
we go on to a weighted average
with a system
and that weighted average
means
them close we get to that type of value and the more we rely on
being you gonna make me
whereas we do in the following that the system in the case that no relevant
samples could be
and several in a few of "'em" benefits of adaptive may normalization
no harm said that it's very minimal overhead
and that over here is that defined by the number handed examples that has to
compare it is
this is also applied for embedding instead of a problem which usually done in from
based calibration
and the sense to doing a lot in terms of reducing computation
it can for the case of no relevant examples where the reverse is just the
main based on the
weighted average
you know enrollment audio or test audio could actually collected to re
time in the candidate pool
and this allows that are most relevant changes are that i'm aware of the systems
being coloured
now the simple process
with the parameters thing under two hundred numbers this changing he
it's also weighting against this is the main so makes it a little bit difficult
over fitting which is room benefit
and finally
we show a we find quite impressive here is that it allows a single
static calibration model to be applied across in this now that's exactly the problem that
row based calibration was trying to sell
by adapting the calibration model
no gonna step further that and the one of the system to that main
which of as a calibration models just a static
and the cost
the other night sure main normalization there
a last few only ice assumptions to be for field
we calibration model after peoria scoring
is also suitable
let's take a look of experiments
first of all the baseline system we do and t is the sri
can you s e
came submission for the sre i mean
nist evaluation this involves sixteen khz
how normalized cepstral coefficients
and multi band i think that things
a multiband which means that we trained the embedding system with by i k and
sixteen k data at any time we had sixteen k to i
we also downsampled to eight as well
so that the nn was exposed both i k and six think a directional sign
audio segments
that tended to help bridge the gap between i khz and sixteen khz evaluation data
we trained on the standard datasets
the references for those datasets are and i
and we do the standard ornamentation occurs
now this mentioned before the calibration model years trained on the right source data
now that is from the rats program darpa rats program
but is the telephone dinally clean data not the transmission data which is heavily degraded
in terms of evaluation
we split out of l six and two evaluation and norm
now from the nist sre corpora
it doesn't sixteen two thousand nineteen
but have their own name sets
available with an known as the unlabeled data as you can see in table or
speakers in while
we use the about fourteen
or evaluation and the dataset
but the notes the
again speakers for this are disjoint
this is right source data we axis that this plastic to have two different speaker
calls
ones for evaluation and ones for the normalization step
in terms of the adapted mean normalization parameters
we setting condition similarity threshold of ten
and the maximum number of candidates thing half number candidate samples
for the dataset
of candidates
and these by searching on the rats or style
so you can see how many segments were available for each an onset
including the pools
which we use initially
and the n b value all can't is that with trying to g
and that value of and remember
also helps with the
weighted average
it when the dynamic maintenance this tonight so close we can get that
a more relies on the and it may
let's look at the out of the box performance
so we've got here of for different datasets sre sixteen i can stick as in
the wall and the right clean telephone data
the baseline system here we consider them a norm is
simply the main that was estimated during the training of the system
and the calibration model is right
now what happens
if we go and look at a
adapting calibration model
the actual eval set
so this is a cheating experiment on the right hand side
essentially what we're doing is we're replacing the rest calibration model
we've the eval set calibration model
what we can see here
is that we're getting much better calibration performance
"'cause" some datasets that acoustic isn't model
rest doesn't seem too much
and
the equal error rates they tend to vary wildly between these different datasets
but the calibration is considerably better compared to those
but see lower value of one for non
one four seven
because it is matched to the rats data used in calibration model
let's look at the impact of relevant may normalization
now previously really on in this presentation we show the first two columns he the
baseline and the condition based mean normalization
now reading on the call
assume that it may normalization
what we fancy a using adapted mean normalization the whole
on the held-out data says training together the sre six thing i think
speaks in water and rats data
will be held out dataset as one day call handed
and that you mean normalization was able to outperform the conditions specific mean normalization in
the heterogeneous conditions
so in particular the honesty is in well
and the sre dataset
the calibration performance there is in proving quite significantly and sometimes in some cases
other times i two thousand i think it's a nice improvement
the where i've to seven sixteen also improves
quite reasonably
no what's interesting is the adaptive process didn't really have a direct condition
so that of a benefit there's well
but now you data requirements
how much data do we actually need in canada segment
productive may normalization work
well we've done on the slide here is we're looking at these still
the which remember a
how's measure the discrimination performance and calibration problem
the dashed fines of the baseline performance across the for different datasets
the more solid lines a what happens as we were very the number of canada
segments
now remember these used to the at least
one thousand two hundred samples
i mean doing this in my dataset specific scenario
where for instance
sre sixteen
they can pull is the actual unlabeled data from sre six thing so
suitable for the conditions when we randomly selecting from that held out on that
well we see here is that
quite rapidly after thirty two relevant segments
independent or we're already
in front of the based on local
so it's already sufficient for significance a lower improvement and we saw the
this also have an equal error rate in terms of be trained
not quite so much of a relative guy
again the true relevant segments from the target to my
wasn't nothing this is that process to get a good guy
now importantly what happens when we have adapted mean normalization
we employ
and the data and it can all is mismatched
two conditions that are gonna be evaluated
we wanna see what happens in this case so what we did for each
dataset will benchmarking here
we excluded the relevant data
common can hold for that is
so for instance
with the rats data set down the one on the table
we actually excluded that from the whole standards and just retained stick as in the
war
and the two sre dataset in the can cool
and that's all had to select from
in order to estimate domain and the hyper don't in the system
i remember when it can actually find anything but it is relevant
if all that the system in
so a wooden thing that the performance is the sign
as the baseline system
well better
now we can see is speakers in while
and rats actually perform reasonably well
there was an improvement still with stages and well which is surprising
with right
this still a that was just a little bit
sre six thing
integrated what's really with respect and baseline
without any relevant data for a min
and we tried to vary just selection threshold he in the hope that using a
higher threshold
we restrict the subset selection to really
the closest ones possible
other this didn't have
so this indicates is there was a problem with the currency purely eye are conditions
the audio card and wasn't quite optimal for selection
in this not mismatch scenario
so in summary we propose adaptive may normalization
it's simple and effective leveraging the test and adjourn used possible
it's useful i just i samples the state in fact i think that should be
thirty two samples of speech
the discrimination we sort improvements about twenty six distend and intense calibration mentioned through the
c o l
with or improvements of up to sixty "'cause"
sixty six percent relative over the baseline system
and what's important here is that actually and i want to study calibration model to
become suitable for varying conditions
that's a room between once you system goes out the door
in terms of future work we identified a couple things
we want to enhance the selection method to be robust when relevant matters like embezzling
which i do not very fast experiment
we also wanna do experiments in how active learning over time
can improve that calibration pool sorry not calibration pork and oracle
but collecting five test data
over time that's relevant to the examples and retaining i recent history
five hundred representation not be happy to hear any remarks of questions from anyone
thank you