Speech Transcript - Effects of the New Testing Paradigm of the 2012 NIST Speaker Recognition Evaluation

well everyone today i'm going to talk about the effects of the new testing paradigm

on the nist sre twelve

this work was done in collaboration with many colleagues

i including alvin events john george and jack

so before talking about what change nursery twelve let's just reminders also some things to

say the same

the task i industry twelve was text independent speaker detection

by speaker detection i mean

given some speech from a target speaker and some speech from a non target speaker

determine whether the target speaker and the non-target speaker the same person

evaluation consisted a long series of trials where a target trial is when the target

speaker the non-target speaker were the same and non-target trial where the target speaker non-target

speaker different so that much was the same

something the change history twelve was the joint knowledge of target speakers allowed

so on the past each trial had to be processed independently from one another

but in a street well

it was permissible to use knowledge of other target speakers for trial and this gave

rise to a distinction in the non-target speakers

and namely whether they were among the target speakers in which case they were considered

the known to man target speaker

or if they were among than not among the target speakers there are considered an

unknown non-target speaker

on this there were more and more very training data for each target speaker and

the majority of the target speakers in the evaluation

i had more than one segments for training

and in those cases a often the training data itself was varied

consisting of for example interviews recorded over a microphone phone calls recorded over a microphone

or phone calls recorded over telephone channel

most of the speakers in the sre twelve and was the target speakers are used

in prior evaluations which is something that was very different and they were identified in

advance and all their speech from these prior evaluations is made available

of those eighteen hundred at a new data was collected from three hundred and twenty

speakers roughly and roughly seventy of to we're not present in the prior evaluations and

those speakers had a single phone conversation released at the time of the evaluation

so we need these changes one question that may come up as y

and there were several reasons among them was to explore methods realising large quantities of

training data

to allow participants extended period of time to work on modeling techniques

to determine the benefit of allowing joint knowledge of target speakers particular the benefit of

for performance and also to increase the efficiency of the data collection

interest the data at the target speaker training data broke down into two cases if

released in advance of the evaluation the target speaker training data consisted of prior evaluation

data collected as part of the ldc mixtures of one three six and if released

at the start of the evaluation

the training data was a single phone conversation record as part of mixture seven

for the test segments most of them came from a newly collected corpus country next

which were phone calls a report over telephone channel

from prior mixture speakers

and they're also smaller number of phone conversations from the mixer seven corpus and

and these are phone conversations recorded other over a telephone channel or microphone channel

so there were many different types of trials include in the evaluation for example they

were trials where the speech had

noise added to it or it was reported in the naturally noisy environment

but among the trials we wanted to emphasise some subsets of particular interest us so

these recall common conditions there were five common conditions in the evaluation for today's presentation

will just going to focus on two

and those or interview speech in test without added noise and telephone channel speech and

test without added noise

so here we see in very round numbers the number of trials for each of

these common conditions

for common condition one again interview test with no added noise a roughly three thousand

target trials forty six thousand non-target trials from no non-target speakers sdc two thousand trials

from one or non-target speakers

in the core test which was required of all participants and optional test was assumption

the same but just with very a with a larger number of trials

and you see the numbers there likewise

target

non-target

speakers

non-targets

so let's look at some results

so here we see common condition two

which is telephone channel speech and test without added noise

and this is the results from one leading system the others are similar

and as might be expected better performance was able observed for known speakers that's the

red line

compared to the unknown speakers that's the black line

one thing to note is that known speakers had multiple telephone conversations and sometimes even

interviews

as their training data

so accuracy of the same system

but on common condition one which is interview speech and test without added noise

but unlike the last slide we saw

there's not a lot of difference between the two curves

and that gave was initially puzzling we wanted to know why and as it turns

out

and the known speakers for this common condition i were only known from a single

telephone channel recording

so where's and the previous slide the known speakers had a large amount of training

data by which to know them i hear the speakers where only known by a

single telephone channel

some in addition to having a small amount of data the trials were cross channel

so in addition to this concept of known and unknown non-target speakers

other also known and unknown systems

and what we mean here is that unknown systems presume that all of the non-target

trials came from no non-target speakers

and all systems presume that all the non-target trials were spoken by unknown non-target speakers

so customers also that

accuracy just a regular system only extended trials for common condition two

and we see the thin dotted lines

i'm not sure if we can actually see that especially in the back

but those are ninety percent confidence bounds which suggest that there was a significant difference

in performance

between the known non-targets on the on a non-targets again read is the colour for

the known non-targets black for the and known non-targets

so here we see an unknown system again that's where the system always presume that

the non-target speaker was a known and as might the expected

there is little this difference observed between the two curves

the accuracy and on system again that's where of the system presumes that all of

the non-target speakers are unknown which is just say there are among the target speakers

all of these are from the same site

and you're actually compared to two slides back

the performance differences is enhanced

summary sre twelve was an experiment with a new protocol and how speakers were made

known to the systems

after conversational telephone speech segments performance was improved when speakers are known to the system

for interview test segments such improvement was not observed that was just do the setup

of the evaluation

he was not observable stuff to say that would be observed if the evaluation allows

others actually a lot more information and that was covering the paper and other papers

covering things that we learn from the evaluation so let me encourage you to

look at those a more to contact us that address

in addition

considering future evaluations there is a question of whether allow enjoying knowledge of a target

speakers is a good idea going forward

one thing to note is that

joint knowledge of target speakers makes result increasingly dependent on the target speaker selected introduced

a trial independence

so this makes estimating

an error rates more difficult

also something to consider is whether to continue having multi session and multichannel

training for the target speakers

so nist will resume a series b on the i-vector challenge in the a near

future

some interest he's is

been expressed within the community regarding performing testing and acoustic environments different

from those of prior evaluations joe made mention that

some utility and that

also one thing to note is that's

in order to be able to conduct these types of evaluations it is necessary to

collect realistic in challenging speech data

which is both expensive and time-consuming

but in order to do that and have even better evaluation lessons learned from sre

twelve

will be take into account and considered in the next evaluation so i probably have

lots of time for questions

so thank you

so looking at your the

c one in c to the common condition one into yes can you talk to

the number of actual speakers were involved in the c one versus e c two

no trials but speakers right

the short answer is

yes but not now "'cause" i don't have that information handy but we did look

at that i can recall precisely

well one of the things i did not yet another but i recall the c

one had it on the order of about fifty forty three speakers involved only

i think we have

comparing those two about the effects ago about the known and this couple things changing

simultaneously the microphone in the television only yes hand the pool is much smaller "'cause"

i think it was only true from drawn from

mixture seven

right

so that's actually really excellent point that we try to emphasise during the evaluation workshop

but i neglected to mention here

is that the common conditions really we're not compare able at all

in this evaluation so the speakers were different and the

basically all the conditions change so it's i don't think you for noting that it's

inappropriate to make those comparisons

across common conditions within a common condition

it was interesting to look at some of the sub

some factor performance

could you write just commenting if you're going to be following up on his or

her as part of the nist unnecessary process

so this is actually something we've been looking into

pretty extensively the short answer is it's remains to be determined but the long answers

this is something we're seeking to do

okay

make it i've a practise it

are criteria you said at the end of the presentation they that they'll be focus

on multichannel enrollment a training conditions

once the question

whether question is like cyanide in the last sre twelve you present at the workshop

i think those any one thing that the

my kind enrollment or telephone and enrollment it seems like focus wasn't neto maybe that

just wasn't nothing just this time ramp up to still is a big challenge awfully

so it just one if that was still going to be effective some continuing evaluations

well that's a question and one of the things that we're very eager for is

to get feedback from one

from the community one thing that is

time consuming and

if not expensive the difficulty is setting up the evaluation even with the data

and so

we're much more likely to include that again of people will actually participate

also got a second question if of got on a i'm not sure if you

where the nn i-vector paragon that's come out for frame framework for sre twelve

very impressive performance particular on telephone conditions as you mine i that the nn you

need a lot of data for training and things very difficult to get that level

one thing are afraid of is

teams that might not have the infrastructure do such thing

how would like here with the other things that do have the infrastructure in future

evaluations is there are something that can be done about that such as the i-vector

challenge with the i-vectors are presented

just one and you've got thoughts on that

in short no but that's a good question and

something that model

perfectly willing to explore

i just want to common to one o or of your conclusion point your i'm

be happy to know but

you have been mine with this point source of course to extend the v nist

databases with new challenging conditions

but i think it's also interesting to us

increase the query actual conditions we have a lot of for to do on the

act recognition by increasing cell use given number of speakers

maybe buying one out of menu chewed

and by adding

in of the data per speaker of course it will

for us to the reviewers over the evaluation protocol and look at the results per

speaker like

jodie the past the us also look also at the difference is that if you

just

select randomly one thousand test

in a lot that the bayes to do you have some performance differences if you

choice so one set compare to your the sets and a lot of things like

that

i think

Effects of the New Testing Paradigm of the 2012 NIST Speaker Recognition Evaluation

Calibration, Evaluation & Forensics

Alvin F. Martin, Craig S Greenberg, Vincent M. Stanford, John M. Howard, George R. Doddington and John J. Godfrey