sorry this is the i want to talk i'll give a bit of the background
briefly describe the system
talk about the data with playing with fourteen distinct conditions of are trying to calibrated
a little be existing calibration methods and then on the propose
trial based calibration
so we backchan a very good background already with the other talks but even very
accurate sid systems maybe not well calibrated this means that i and you might have
very low equal error rates for conditions evaluated independently
but once you pull those together a single threshold to reach that operating point when
applied
also just this problem the right here the blue we distributions of target trials the
red ones are impostor trials you can see be yellow threshold on the bottom of
affording conditions a very quite well but
so calibrating correctly for each ink each condition helps us to reduce this threshold variability
among many other benefits the fertile small
i probably die need a refresher of the other talks
but
what we want essentially is a calibrated scores that we can indicate the weight of
evidence for a given trial so that is a the likelihood ratio is this is
the person cuticle is not be of the us what's the application forensic evidence in
cool
so subsequently if we have calibrated scores we can my competent threshold bayes decisions
and this isn't trivial was we've heard without represent to be calibration set and it's
difficult to handle the various conditions with a single calibration model and also that later
in this talk will be needed measuring system performance of the number of metrics mainly
focusing on calibration loss he
so this is indicating a how close we are to performing the best we can
for a particular operating point
and in this work with focusing on equal costs between misses and false alarm sets
around the equal-error point
all the matrix we using ica loss of this is a more stringent criteria looking
at how well calibrated we are cross all points on the operating on need it
curved sorry
and we're also looking at the average equal error rate across the fourteen conditions so
we wanna make sure that if we calibrating assistant we're not losing speaker discriminability
and of course for all metrics low is better
now i'm here's to calibrate scores across those fourteen conditions sauce that the calibration loss
is minimal
with a single system
a brief
flow diagram of the system we using this study purity i-vector a lot ubm large
i-vectors trying diners the presence that you can look at hyper for reference to that
one the two errors with focusing on in this work out the orange boxes
calibration in particular obviously and then is a box called universal audio characterization so this
is a wide extracting meta information or side-information automatically from your i-vectors
use the evaluation dataset it's appalling condition dataset a given to us by the f
b i its source from park different sources and arabic cross in detail liza
and it's ninety nine
they're number of different raw conditions here that are not you tribute conditions they got
cross-language cross-channel mix of both our clean and noisy speech and got a variety of
durations in
a the there's more details in the paper in terms of speaker break down
language break down and
on the right hand side there got the equal error rate for my baseline system
and just to show you the difficulty i difficulty is increases as we go through
the conditions
since we want to the calibration data sets with put together three different datasets for
this study
first one is called email a this is essentially taking that f b i dataset
and doing cross validation so trying to calibration model with one half testing on the
other half important that around
and doing again before pulling the results in getting a matrix
the second dataset with labeled matched out this isn't actually matched to the dialogue but
we've done the best we can from the sre and fisher data
trying to sign languages
and
trying to cross channel cross language trials however we were lacking in cross language cross
channel
trials the mixture by them and a few languages whenever them there either
a funny the large variability dataset we actually didn't put emphasis on trying to collect
i don't like if you are dedicated
we simply took a nice variation of sre data and noise re noise that is
river and rats clean that as from the darpa rats project so this is
five languages of interest from the program you can look i four details on that
so it is that large variability dataset was meant to be kinda like let's just
try what we can of the calibration model so you wanted those in for the
evaluation
we're gonna be looking at three different calibration training screens the first is global which
is generally calibration logistic regression i'll the standard approach many of us to probably or
the u
there's metadata based analysis once implemented with discriminative purely eye and universal audio characterization this
is something that's been a very prominent in past sri evaluations with ball and darpa
rats program very useful bit
and finally we propose in that role based calibration
and that's also based on universal audio characterization to provide metadata
let's talk about the existing methods the calibration look at some results and the shortcomings
so global or generative calibration he learning single shift and scale for converting as a
rule score to a likelihood ratio
just on the side he can see what happens when you've got the score distributions
with that calibration enough to apply global calibration for fourteen conditions
so we're focusing score just very distributions around the remarks or improving lc lost
because we targeting around that there
so this calibration technique as nicole explain is a effective for a single nine condition
but once you but multiple conditions in the
you're not actually reducing the variability of your threshold
and that's a problem when you've got only condition data
quick description only middle based calibration
this takes into account sought informational metadata information
from each side of the trials that's the enrollment side and the test side
the big form of the on how we can but that to a likelihood right
here and that's accomplished you can look at i for more details on that one
that would discriminant purity i which is used to a jointly point minimum minimize
a cross entropy objective
well those parameters them bottom the
and b and m t represent the u i c vectors so that on the
next
what that
this we propose is i think those action in all the c conference a few
years back universal audio characterization very simple part
take a training dataset dividing classes of interest that much language channel snr
gender
and for each of those classes model it with a gaussian so it's a gaussian
backend
you continue test sample comes in
on the posteriors from each of those gas in the end up with a vector
on the right hand side
so it's like for instance that you
trying to system on french and english to distinguish those two languages and you get
a spanish test segment coming in
our hypothesis is that the system on sci well sounds like at the same french
twenty percent english in kind of reflect that posteriors
that's the only
let's take a look at the definition of the class c so we want to
do here's i
a given we had
and oracle experiment so we actually to the f b i data via crossvalidation he
what we can try now universal audio characterization
so we pick out three one different classes snr language and channel
and we said what con calibration loss improvement we're gonna get diagram global calibration
and that's was listed he and the bottom are sort is what happens if you
to each of those fourteen conditions
calibrated each one independently
a simple the results in court shows tables
so that's essentially what should be the best we could the
so what we've done here they will start potential of metal based calibration on our
two conditions
again this is something the guy of the source on the training
so we're chosen here language and channel
for this item mission
let's look at the sensitivity of the universal audio characterization and the training set used
for the calibration model
the top two lines a what happens when we using an oracle experiment again the
detailed i
and we comparing global emitted based calibration
basically what you can see he is that
vertically with the say you lost
the middle based calibration improves the sale of we're getting a slight reduction in april
error rate the and the c laws improving a little bit as well
sorry
it's i will do something there which is not to say
a if we then look at what happens when we bring any matched dataset remember
this is sre fisher data that's meant to try and be similar to the f
b i data conditions
we see something interest
with global calibration if we train the model on the matched i
rectly reducing calibration what severely compared to the art i guess that's expected in the
set because we don't always have the data that where evaluating on
but once we look at metal based calibration
if we use the matched data to train the universal audio characterization and then use
the actual with the i data to train a calibration model we're not doing too
bad
we're getting a subtle improvement in sales
the problem occurs once we start using the matched data for the calibration model that's
the discriminative but i mean
we start to really reduce l performance in calibration and then equal error rate average
equal error rate starts the ball
so we've got a high sensitivity to the calibration training set here
i one hypothesis that we've got than one on there is this may be due
to the lack of prof language and cross channel conditions in the linear discriminant space
so how do we handle beyond thing trial conditions
so i'm my forensic experts to lie we can implement that
we can select is represented calibration training set for each individual trial
no those two point eight million trials in this database is not easy thing that
can be done
these drawbacks calibration
so smart able body approach of forensic experts and i wasn't meant to replace then
by any means
but that was the motivation
so one adults is the system delays the choice of calibration training data until it
knows the conditions of the trial
so given a trial we select a representative are representative data set of the enrollment
sample then we construct trials against a tight of that's representing of the test sample
as well
so the challenge he is how do we found that representative
i'm gonna work through the box you just showing the process we did for selecting
for each individual trial a small subset of thousand target trials
and how many impostor trials come at
the first thing we do is to extract the u i c vectors from the
enrollment sought on the test on this is predicting the conditions essentially all the by
size of the trial
then we rank normalized slows you icy vectors against the calibration you i see so
we've got this candidate calibration dataset which could be the three sets of explain the
way
we extracted u i cs for each of those so we already know the conditions
the calibration data from a system specific
we're doing rank normalization
for those who don't know rank normalization very simple process where you simply replace the
actual value
in a given dimensional vector
with the rank
against everything in the calibration so you need a
set to come in that you rank against
more detailed in the five related to
similarity measure a very simple euclidean distance
from the rank normalized calibration devices
sorry here they have actually been rank normalized against
a this allows us to fonn most representative calibration segments for both enrollment and the
test
then as the sorting process so we've done before we got to this point is
actually taking the calibration candidate calibration segments and done exhaustive school comparison using acid system
you get a calibration score matrix now we're doing is sorting the rose
by similarity to the enrollment
and then the columns by similarity to test
what we end up with is the upper left point of being
most representative of the trial that's in given tools here
selection involves trying to get a thousand target trials
and we simply add to be
canada go to be selected calibration set to we get the
fighting than the next most representative
from the enrol side all the test site which underscores class based on a similarity
measure
not i think not here is that the segments without target trial as you going
through this process are excluded otherwise you have might have cross-database impostor trials which are
actually quite easy and that could bias the calibration model homes
that something's tonight about this
overcomes the intention is for star on the shortcomings of the middle based calibration but
selecting the most representative trials
and then learn the calibration model from that
representativeness is not guarantee
it's not saying that
we've got this full of data we can actually find software is then it not
is not the case with funding marched are presented
in the case that there is nothing like the trial that are coming across it
probably wouldn't but to something more like a general
a randomly selected calibration model
and that supposedly think it's better than overfitting possible
so this is suitable for evaluation scenarios where you've gotta have a decision for anything
so you've got speech from both sides of trial you need to produce a school
for evaluation so that does not represent what forensic experts what the
if for instance like hackett data for a given to all that much simply cite
we rented this impossible without
you know admit just call
that's just a few things to keep in one
that's look at that result this is on the matched data here results first the
global calibration technique
a across all the fourteen conditions we're getting a nice improvement average of thirty five
percent reduction in c lost
and not shown on the slide but in the type of this i twenty percent
reduction in sale a more stringent metric
so if we compare the three practise now on the large variability data so this
is the one pooled from many different sources just to throw the system
we see that middle based calibration
actually reduce the average of all the theme of
at the given operating point
but unfortunately increases see a lot and equal error rate as well
so again this is probably coming down to the overfitting issue all the lack of
trials in a certain conditions
where for instance
if the condition was coming into the metal based calibration technique that i nice in
a few trials for or few errors for
it my say pretty confident about that this is the why we should calibrate when
in fact it's quite mismatched of the day that's coming
a based calibration ever improve the calibration metrics in both god and also improve the
discrimination power of the system and this again is probably something that should be expected
given that you're trying to apply a single threshold to get the equal error rate
point
and you fourteen conditions
pictorially are found this kind of interesting just the have a threshold are going between
the different conditions here trained on the large variability data and you can say you
basically the metal based calibration the global calibration was to but the spread across the
thresholds their trial based calibration on this time scale down one on the
a starting to cluster them close to zero obviously it's not article all we haven't
succeeded in getting to where we need to be
but it's something in the right direction let's suppose
in conclusion we can say that
well it's difficult to calibrate over a wide range of conditions
at a based calibration we show that was struggling
when we haven't of the three training conditions or very few all we propose trial
based calibration to address that shortcoming
and what does this select the calibration training set at test time
i'd avoid overfitting to limited trials what using the minimum target trials that one thousand
target trials with round
and it reverts to a more general class of calibration model if the conditions around
same
future work there's a lot future work here
remove the computational bottleneck
calibrating two point eight million trials independently
so one option them up to closed forms solution but presented that a bleep industry
or not the thing that jeff actually mention this
some radical in
a indication of how representative the calibration set that was selected is full that raw
for instance here of the said to select and marched representative set it's that set
is in fact something forensic experts wouldn't have chosen
the user would one and i
can we incorporate phonetic information
relevant to joyce talk this one
is that the in an i-vector framework something suitable p
and finally can we actually learn a way of approximating calibration shift
and scale using just the u i c vectors
and that concludes montauk just leave you with the
flow diagram case or questions
so in your are the based calibration there's thanks to there's two components have to
train the
universal you a c and you try to calibrate rate that's correct
what the results reminding in the results would both the use both of those used
the matched dataset
we well that's it
so both for matched worst
it was quite bad actually one of your set i think there's a time where
you've got a night of the stuff
for the signal
so the real thank you think then this is just applying is the u s
c that's file because which you usage dataset matched dataset trial used for that the
data rate
so you still have a map sets of this service needs to see
so as to leave here it's just the u a c that's the issue
ability when we look at this
the us the is obviously flying of optical in the field a and also the
every equal error rate but if you wanna use the matched data for trying you
i c but use the actual evaluation data for the calibration set
we
we're doing
for remote to global in the same condition so we we're actually not
not be nothing too much from having that sort of my
yes i also in your future work that you still thinking about measuring how representative
the really is you have some ideas there because that my limited mathematical mind i
would think some sort of outlier detection
for the case the
but
a new comment on the road thought about at this point to be honest
but we know it would be something
i
definitely of interest that equally this is a tool to go along side of forensic
expert you know where we know that automatic tools can be used i in the
in certain decisions and to have a system that
can dynamically calibrate and provide a better decision to the expert that's already a benefit
but to have the confidence of the system's calibration is also i