Speech Transcript - Trial-based Calibration for Speaker Recognition in Unseen Conditions

sorry this is the i want to talk i'll give a bit of the background

briefly describe the system

talk about the data with playing with fourteen distinct conditions of are trying to calibrated

a little be existing calibration methods and then on the propose

trial based calibration

so we backchan a very good background already with the other talks but even very

accurate sid systems maybe not well calibrated this means that i and you might have

very low equal error rates for conditions evaluated independently

but once you pull those together a single threshold to reach that operating point when

applied

also just this problem the right here the blue we distributions of target trials the

red ones are impostor trials you can see be yellow threshold on the bottom of

affording conditions a very quite well but

so calibrating correctly for each ink each condition helps us to reduce this threshold variability

among many other benefits the fertile small

i probably die need a refresher of the other talks

but

what we want essentially is a calibrated scores that we can indicate the weight of

evidence for a given trial so that is a the likelihood ratio is this is

the person cuticle is not be of the us what's the application forensic evidence in

cool

so subsequently if we have calibrated scores we can my competent threshold bayes decisions

and this isn't trivial was we've heard without represent to be calibration set and it's

difficult to handle the various conditions with a single calibration model and also that later

in this talk will be needed measuring system performance of the number of metrics mainly

focusing on calibration loss he

so this is indicating a how close we are to performing the best we can

for a particular operating point

and in this work with focusing on equal costs between misses and false alarm sets

around the equal-error point

all the matrix we using ica loss of this is a more stringent criteria looking

at how well calibrated we are cross all points on the operating on need it

curved sorry

and we're also looking at the average equal error rate across the fourteen conditions so

we wanna make sure that if we calibrating assistant we're not losing speaker discriminability

and of course for all metrics low is better

now i'm here's to calibrate scores across those fourteen conditions sauce that the calibration loss

is minimal

with a single system

a brief

flow diagram of the system we using this study purity i-vector a lot ubm large

i-vectors trying diners the presence that you can look at hyper for reference to that

one the two errors with focusing on in this work out the orange boxes

calibration in particular obviously and then is a box called universal audio characterization so this

is a wide extracting meta information or side-information automatically from your i-vectors

use the evaluation dataset it's appalling condition dataset a given to us by the f

b i its source from park different sources and arabic cross in detail liza

and it's ninety nine

they're number of different raw conditions here that are not you tribute conditions they got

cross-language cross-channel mix of both our clean and noisy speech and got a variety of

durations in

a the there's more details in the paper in terms of speaker break down

language break down and

on the right hand side there got the equal error rate for my baseline system

and just to show you the difficulty i difficulty is increases as we go through

the conditions

since we want to the calibration data sets with put together three different datasets for

this study

first one is called email a this is essentially taking that f b i dataset

and doing cross validation so trying to calibration model with one half testing on the

other half important that around

and doing again before pulling the results in getting a matrix

the second dataset with labeled matched out this isn't actually matched to the dialogue but

we've done the best we can from the sre and fisher data

trying to sign languages

and

trying to cross channel cross language trials however we were lacking in cross language cross

channel

trials the mixture by them and a few languages whenever them there either

a funny the large variability dataset we actually didn't put emphasis on trying to collect

i don't like if you are dedicated

we simply took a nice variation of sre data and noise re noise that is

river and rats clean that as from the darpa rats project so this is

five languages of interest from the program you can look i four details on that

so it is that large variability dataset was meant to be kinda like let's just

try what we can of the calibration model so you wanted those in for the

evaluation

we're gonna be looking at three different calibration training screens the first is global which

is generally calibration logistic regression i'll the standard approach many of us to probably or

the u

there's metadata based analysis once implemented with discriminative purely eye and universal audio characterization this

is something that's been a very prominent in past sri evaluations with ball and darpa

rats program very useful bit

and finally we propose in that role based calibration

and that's also based on universal audio characterization to provide metadata

let's talk about the existing methods the calibration look at some results and the shortcomings

so global or generative calibration he learning single shift and scale for converting as a

rule score to a likelihood ratio

just on the side he can see what happens when you've got the score distributions

with that calibration enough to apply global calibration for fourteen conditions

so we're focusing score just very distributions around the remarks or improving lc lost

because we targeting around that there

so this calibration technique as nicole explain is a effective for a single nine condition

but once you but multiple conditions in the

you're not actually reducing the variability of your threshold

and that's a problem when you've got only condition data

quick description only middle based calibration

this takes into account sought informational metadata information

from each side of the trials that's the enrollment side and the test side

the big form of the on how we can but that to a likelihood right

here and that's accomplished you can look at i for more details on that one

that would discriminant purity i which is used to a jointly point minimum minimize

a cross entropy objective

well those parameters them bottom the

and b and m t represent the u i c vectors so that on the

what that

this we propose is i think those action in all the c conference a few

years back universal audio characterization very simple part

take a training dataset dividing classes of interest that much language channel snr

gender

and for each of those classes model it with a gaussian so it's a gaussian

backend

you continue test sample comes in

on the posteriors from each of those gas in the end up with a vector

on the right hand side

so it's like for instance that you

trying to system on french and english to distinguish those two languages and you get

a spanish test segment coming in

our hypothesis is that the system on sci well sounds like at the same french

twenty percent english in kind of reflect that posteriors

that's the only

let's take a look at the definition of the class c so we want to

do here's i

a given we had

and oracle experiment so we actually to the f b i data via crossvalidation he

what we can try now universal audio characterization

so we pick out three one different classes snr language and channel

and we said what con calibration loss improvement we're gonna get diagram global calibration

and that's was listed he and the bottom are sort is what happens if you

to each of those fourteen conditions

calibrated each one independently

a simple the results in court shows tables

so that's essentially what should be the best we could the

so what we've done here they will start potential of metal based calibration on our

two conditions

again this is something the guy of the source on the training

so we're chosen here language and channel

for this item mission

let's look at the sensitivity of the universal audio characterization and the training set used

for the calibration model

the top two lines a what happens when we using an oracle experiment again the

detailed i

and we comparing global emitted based calibration

basically what you can see he is that

vertically with the say you lost

the middle based calibration improves the sale of we're getting a slight reduction in april

error rate the and the c laws improving a little bit as well

sorry

it's i will do something there which is not to say

a if we then look at what happens when we bring any matched dataset remember

this is sre fisher data that's meant to try and be similar to the f

b i data conditions

we see something interest

with global calibration if we train the model on the matched i

rectly reducing calibration what severely compared to the art i guess that's expected in the

set because we don't always have the data that where evaluating on

but once we look at metal based calibration

if we use the matched data to train the universal audio characterization and then use

the actual with the i data to train a calibration model we're not doing too

bad

we're getting a subtle improvement in sales

the problem occurs once we start using the matched data for the calibration model that's

the discriminative but i mean

we start to really reduce l performance in calibration and then equal error rate average

equal error rate starts the ball

so we've got a high sensitivity to the calibration training set here

i one hypothesis that we've got than one on there is this may be due

to the lack of prof language and cross channel conditions in the linear discriminant space

so how do we handle beyond thing trial conditions

so i'm my forensic experts to lie we can implement that

we can select is represented calibration training set for each individual trial

no those two point eight million trials in this database is not easy thing that

can be done

these drawbacks calibration

so smart able body approach of forensic experts and i wasn't meant to replace then

by any means

but that was the motivation

so one adults is the system delays the choice of calibration training data until it

knows the conditions of the trial

so given a trial we select a representative are representative data set of the enrollment

sample then we construct trials against a tight of that's representing of the test sample

as well

so the challenge he is how do we found that representative

i'm gonna work through the box you just showing the process we did for selecting

for each individual trial a small subset of thousand target trials

and how many impostor trials come at

the first thing we do is to extract the u i c vectors from the

enrollment sought on the test on this is predicting the conditions essentially all the by

size of the trial

then we rank normalized slows you icy vectors against the calibration you i see so

we've got this candidate calibration dataset which could be the three sets of explain the

way

we extracted u i cs for each of those so we already know the conditions

the calibration data from a system specific

we're doing rank normalization

for those who don't know rank normalization very simple process where you simply replace the

actual value

in a given dimensional vector

with the rank

against everything in the calibration so you need a

set to come in that you rank against

more detailed in the five related to

similarity measure a very simple euclidean distance

from the rank normalized calibration devices

sorry here they have actually been rank normalized against

a this allows us to fonn most representative calibration segments for both enrollment and the

test

then as the sorting process so we've done before we got to this point is

actually taking the calibration candidate calibration segments and done exhaustive school comparison using acid system

you get a calibration score matrix now we're doing is sorting the rose

by similarity to the enrollment

and then the columns by similarity to test

what we end up with is the upper left point of being

most representative of the trial that's in given tools here

selection involves trying to get a thousand target trials

and we simply add to be

canada go to be selected calibration set to we get the

fighting than the next most representative

from the enrol side all the test site which underscores class based on a similarity

measure

not i think not here is that the segments without target trial as you going

through this process are excluded otherwise you have might have cross-database impostor trials which are

actually quite easy and that could bias the calibration model homes

that something's tonight about this

overcomes the intention is for star on the shortcomings of the middle based calibration but

selecting the most representative trials

and then learn the calibration model from that

representativeness is not guarantee

it's not saying that

we've got this full of data we can actually find software is then it not

is not the case with funding marched are presented

in the case that there is nothing like the trial that are coming across it

probably wouldn't but to something more like a general

a randomly selected calibration model

and that supposedly think it's better than overfitting possible

so this is suitable for evaluation scenarios where you've gotta have a decision for anything

so you've got speech from both sides of trial you need to produce a school

for evaluation so that does not represent what forensic experts what the

if for instance like hackett data for a given to all that much simply cite

we rented this impossible without

you know admit just call

that's just a few things to keep in one

that's look at that result this is on the matched data here results first the

global calibration technique

a across all the fourteen conditions we're getting a nice improvement average of thirty five

percent reduction in c lost

and not shown on the slide but in the type of this i twenty percent

reduction in sale a more stringent metric

so if we compare the three practise now on the large variability data so this

is the one pooled from many different sources just to throw the system

we see that middle based calibration

actually reduce the average of all the theme of

at the given operating point

but unfortunately increases see a lot and equal error rate as well

so again this is probably coming down to the overfitting issue all the lack of

trials in a certain conditions

where for instance

if the condition was coming into the metal based calibration technique that i nice in

a few trials for or few errors for

it my say pretty confident about that this is the why we should calibrate when

in fact it's quite mismatched of the day that's coming

a based calibration ever improve the calibration metrics in both god and also improve the

discrimination power of the system and this again is probably something that should be expected

given that you're trying to apply a single threshold to get the equal error rate

point

and you fourteen conditions

pictorially are found this kind of interesting just the have a threshold are going between

the different conditions here trained on the large variability data and you can say you

basically the metal based calibration the global calibration was to but the spread across the

thresholds their trial based calibration on this time scale down one on the

a starting to cluster them close to zero obviously it's not article all we haven't

succeeded in getting to where we need to be

but it's something in the right direction let's suppose

in conclusion we can say that

well it's difficult to calibrate over a wide range of conditions

at a based calibration we show that was struggling

when we haven't of the three training conditions or very few all we propose trial

based calibration to address that shortcoming

and what does this select the calibration training set at test time

i'd avoid overfitting to limited trials what using the minimum target trials that one thousand

target trials with round

and it reverts to a more general class of calibration model if the conditions around

same

future work there's a lot future work here

remove the computational bottleneck

calibrating two point eight million trials independently

so one option them up to closed forms solution but presented that a bleep industry

or not the thing that jeff actually mention this

some radical in

a indication of how representative the calibration set that was selected is full that raw

for instance here of the said to select and marched representative set it's that set

is in fact something forensic experts wouldn't have chosen

the user would one and i

can we incorporate phonetic information

relevant to joyce talk this one

is that the in an i-vector framework something suitable p

and finally can we actually learn a way of approximating calibration shift

and scale using just the u i c vectors

and that concludes montauk just leave you with the

flow diagram case or questions

so in your are the based calibration there's thanks to there's two components have to

train the

universal you a c and you try to calibrate rate that's correct

what the results reminding in the results would both the use both of those used

the matched dataset

we well that's it

so both for matched worst

it was quite bad actually one of your set i think there's a time where

you've got a night of the stuff

for the signal

so the real thank you think then this is just applying is the u s

c that's file because which you usage dataset matched dataset trial used for that the

data rate

so you still have a map sets of this service needs to see

so as to leave here it's just the u a c that's the issue

ability when we look at this

the us the is obviously flying of optical in the field a and also the

every equal error rate but if you wanna use the matched data for trying you

i c but use the actual evaluation data for the calibration set

we're doing

for remote to global in the same condition so we we're actually not

not be nothing too much from having that sort of my

yes i also in your future work that you still thinking about measuring how representative

the really is you have some ideas there because that my limited mathematical mind i

would think some sort of outlier detection

for the case the

but

a new comment on the road thought about at this point to be honest

but we know it would be something

definitely of interest that equally this is a tool to go along side of forensic

expert you know where we know that automatic tools can be used i in the

in certain decisions and to have a system that

can dynamically calibrate and provide a better decision to the expert that's already a benefit

but to have the confidence of the system's calibration is also i

Trial-based Calibration for Speaker Recognition in Unseen Conditions

Calibration, Evaluation & Forensics

Mitchell Mclaren, Aaron Lawson, Luciana Ferrer, Nicolas Scheffer and Yun Lei