she this work was done in collaboration with a very large number of colleagues and

everyone the latter part work so

i'd like to think

disarray and george daniel jack tommy alvin alan mark and dog

so the goal of the challenge was to support and encourage development of new methods

for speaker detection utilizing i-vectors in the intent was to explore new ideas and machine

learning for used in speaker recognition

and to trying to the field more accessible for people outside of the audio processing

community and to improve the performance of the technology

the chunk format for people who don't know

was to use i-vectors us also the sure been in audio was to distribute the

i-vectors themselves and it was all hosted on a web platform so was entirely online

the registration the system submission and receiving results all online

the reason for using i-vectors in the web platform was to attempt to expand the

number and types of participants including ones from the ml community

and to allow iterative submissions with the fast turnaround and order to support research progress

during the actual evaluation

another think that was that different from what people may be accustomed to with the

regular sre was that's a large development set of unlabeled i-vectors was distributed to be

used for dev data the intent there was to encourage new creative approaches the modelling

and in particular

the use of clustering to improve performance

in addition to these things one thing we were hoping to do was to set

a precedence or at least have a proof of concept for future evaluations where there

can be web based registration data distribution potentially and results submission trying to make this

more efficient and more user friendly

after the community

so

the objective straight the data selection was to include multiple training sessions for each target

speaker in the main evaluation test in a recent histories

an optional test is involved multiple training sessions but

in this challenge we wanted to include that for everyone to do is the main

focus

also i same handset target trials and cross sex nontarget trials both the which or

unusual

for the regular sre

also something different was taking i-vectors from a log normal distribution as opposed to

some discrete uniform

durations the reason for this was filters more realistic and it's the challenge that's

people seemed eager to address

and also well just varying the duration allows us to do

post evaluation analysis

so the task is speaker detection which hopefully everybody here knows by the third data

for the c

and the system i was evaluated over a set of trials where you trial compared

a target speaker model in this case this was a set of five i-vectors and

it test speech segment comprised of a single i-vector

the system determines whether or not the speaker and the test segment as a target

speaker about put in a single real number

and no decision was necessary

the trial outputs are then compare to ground truth to compute a performance measure which

for the i-vector challenge was in dcf

hopefully people know what target trials and non-target trials and misses and false alarms are

it does anyone not know that

okay if not come see me afterwards and measure was dcf which is essentially just

the miss rates a time plus one hundred times of false alarm rate

and the official overall measure was the mindcf

seen here

so the challenge i-vectors were produced with the system developed jointly we between johns hopkins

and mit lincoln labs and it uses the standard mfccs and don't the acoustic features

and use the gmm train set

the source data were the ldc mixer corpora particular mixtures one three seven as well

as remakes and included around sixty thousand telephone call sides from about six thousand speakers

and the duration of these calls were up to five minutes drawn from a log

normal distribution

with the mean of nearly forty seconds

for each selected segment participants were provided with this example dimensional i-vector as well as

the duration from of the speech from which the i'd draw a vector was extracted

so this is the data and then the data was partitioned into a development set

and enrollment test set

after the development partition the calls were from speakers without test data

and consisted of round three six thousand telephone call sides from around five thousand speakers

and as i said earlier it was unlabeled so

no speaker like bowls we're given with the development partition

for the enrollment and test

calls were from speakers with at least five calls from different phone numbers and at

least eight calls from a single phone number consisted of a about thirteen hundred target

speakers

i'm sort target models

almost ten thousand test i-vectors and the target trials we're limited to ten same intent

different phone number calls per speaker and non-target trials came from other target speakers as

well as a five hundred speakers who are not

other target speakers two hundred fifty males and

fifty females

the trials consisted of all possible pairs of a target speaker and the test i-vector

about twelve and half a million trials

and included cross sex nontarget trials as willow same number

target trials

the trials were divided into two randomly selected subsets that someone asked about this the

speakers did overlap between the progress subset and the evaluation subset

forty percent was used for a progress subset which

was what was used to monitor progress and people familiar maybe not from where i

should say with the

i challenge there was a

a progress board where people could see how they were doing and how other people

we're doing

and that was actually

updated using the progress set and sixty percent of the data was held out

into the end of the evaluation period

and then the system submissions were scored for the official results

using this remaining sixty percent

so some structure to the evaluation system output for each trial could be based only

on the trials model and test i-vectors as well as the durations provided and the

provided development data

normalization over multiple test segments are target speakers was not allowed

use of evaluation data from for nontarget speaker modeling was not allowed

and training system parameters using data not provided as part of the challenge was also

model out one two and three these or

pretty typical for the nist develops for is actually knew

in the intent was to remove data engineering and also encourage participation from so it's

a don't have a lot of their own speech data

so in terms of dissipation there about three hundred registrants from about fifty countries

hundred and forty of the registrants from hundred and five unique sites

i'm at least one valid submissions so there were some

some number people registered but worked able to some of the system

the numbers submissions actually exceeded eight thousand if we compare these numbers to a street

well we do see a really large increase in participation which are excited c

in addition to receiving data

a baseline system was distributed with the evaluation

it used a variant of cosine scoring accuracy the five steps estimate a global mean

and covariance and the unlabeled data

update that's mean and variance by center and whiten you know a project them onto

a unit sphere

and that for each model

i average it's five i-vectors and project those on the unit sphere and then compute

the inner product

one thing to note is because the dev data was unlabeled at the b d

c n and lda were

not possible to use

in addition to that there was an oracle system that was not provided but kept

a g h u

which have access to the development speaker data will development data speaker labels

and the

a system was gender dependent with a four dimensional speaker space all of the i-vectors

for each model or let length normalized are then averaged

and it discarded i-vectors with duration less than thirty seconds which actually reduce the development

set

quite a bit

and here we see our first result so

z a red line as their oracle system

and the blue line is the baseline system the solid line is on the evaluation

set of trials use of the sixty percent or

held out in the dotted line is on the progress set

so basically the gap between these lines indicate the

potential value of having speaker labels

so the hope was to be able to use clustering techniques from the development set

up close this gap

here we see

results

so i here

is the mindcf on the oracle system and on the baseline system the blue line

is the progress set

and the red line is the ml set

and here we see the

top ten performing systems and how they did on the progress set and on the

ml sit

performance on the eval set was consistently better than progress

not exactly sure why then some random variation

and seventy five percent of participants submitted a system that outperform the baseline true really

please soon as well

are we do not time

so okay great actually course so

oops what skip this

accuracy progress over time

the green line is on the of al so that

and the blue line as on the progress set

and the red line is on the progress set to so basically the green line

is the very best score observed to date

same with the blue line

and then the red line is for the system that and it up

with the top performance

at the end so we see it's

history of the performance over time

couple thing is that we note it was the performance levelled off after about six

weeks we ran this from december

through april

and basically after six weeks but not much for a progress was observed

and also interesting to note was that the leading system

did not lead basically from december till february

output by it's a period that

i taking the lead to stay there

here we see performance by gender on the left

of each of these is the leading system

and on the right is the baseline system

one thing kind of interesting to note

is the leading system did worse

a on same sex trials than on male only and female only i which might

be unexpected but

i think an explanation for this is that there were calibration issues

accuracy performance by same and different phone number

here the blue is the baseline

on the left the same number of the right is different number

and here i guess like with the gender

i we see limited degradation in performance to the change and phone number from the

leading system

so this was very close

even compared to the baseline which was fairly close

so there's some additional information available you can see the odyssey paper for more results

for example more information about the progress over time and gender effects as well same

a different phone numbers

we also have an interspeech paper that does some analysis of participation

i gives us some of these same results but on the progress set the odyssey

paper focuses entirely on the ml set

and there's the lots of work to do

so that we have future paper on duration age another results as you can see

those things for additional information you can also please feel free to contact us

so some conclusions we thought that the process worked which was very exciting for us

the website was brought up and stayed up which was good

participation exceeded that of prior sre is

which was a of the goal

and many states significantly improved on the baseline system

further investigation and feedback will be needed

in order to determine the extent to which the you participation was from outside of

the audio processing community

for people who are signed up we

eventually asked if they were from the audio processing community but we didn't thing to

do that during the initial sign up so that all other cases we don't know

whether a the additional participation came from outside the outside the audio processing community or

not

thousands of submissions provide data for further analysis which we look for to doing

and

these things include things like clustering of unlabeled data and gender differences across and within

trials effects of handsets role of duration

and speaking of future work

we plan to enhance the online platform for example would like to put analysis tools

on the platform for participants to use

we expect to

offer further online challenges

and in part because they're more readily organized and also because

it's a possible to efficiently we use of test data

but we expect that we use results will affect full fledged evaluations as well or

the typical s are used

as well for example we'd like to

have increasingly web based in user friendly procedures for

i registration in for data distribution

and it's possible that were use a separate datasets evaluation datasets

want a four iterations graph performance in another held-out with limited exposure

i we've seen this used in

i have passed

nist evaluations and it may

and see renewed use in a series

thank you very much

i craig and you pass like twenty one okay

i'm wondering with those this seems just the weights is leading system

is that the leading system and that's two conditions are same sets

a sure that is the same system in

in both

i used in a reasonable idea oracle was and different directed and what you distribute

which one six hundred twenty four hundred one so why you keep the same i-vector

the two distributions i lincoln you may be addressed

like

craig in your final slide you mentioned

that the last point that's a data set for iterated use

are you thinking of something similar to what you have now the

the point i'm getting at is

if you want to train for example calibration or fusion

then it's then it's very nice to average of feedback for example

the derivatives of your system parameters with respect to those schools so

you think

it would be possible to

i'm not sure whether it's in the question is

is this an issue of not having speaker labels for development or

well

we want to be able to train

a fusion sure on the type that so can you see that happening or

because if you would just give us the data we could do that but if

the data stays

on the other side and all side

then

that's a more difficult and then sure more complex

right

yes and one thing that maybe i should clarifies this was really meant in the

context of sre in other nist evaluations sometimes

the reuse dataset from one you're to another here

up of also have some

i guess what's called the progress set but they use a different sense then we

are using it here

where people won't get

the key for that

but they will have the key for the review set

as editors your question or

okay

we question i just wondered is not relevant to do according to the rules but

those thirty nine nodes

are all models are a little different speakers where they're not there were some speakers

because there was a distortion weighted it would be or not