she this work was done in collaboration with a very large number of colleagues and
everyone the latter part work so
i'd like to think
disarray and george daniel jack tommy alvin alan mark and dog
so the goal of the challenge was to support and encourage development of new methods
for speaker detection utilizing i-vectors in the intent was to explore new ideas and machine
learning for used in speaker recognition
and to trying to the field more accessible for people outside of the audio processing
community and to improve the performance of the technology
the chunk format for people who don't know
was to use i-vectors us also the sure been in audio was to distribute the
i-vectors themselves and it was all hosted on a web platform so was entirely online
the registration the system submission and receiving results all online
the reason for using i-vectors in the web platform was to attempt to expand the
number and types of participants including ones from the ml community
and to allow iterative submissions with the fast turnaround and order to support research progress
during the actual evaluation
another think that was that different from what people may be accustomed to with the
regular sre was that's a large development set of unlabeled i-vectors was distributed to be
used for dev data the intent there was to encourage new creative approaches the modelling
and in particular
the use of clustering to improve performance
in addition to these things one thing we were hoping to do was to set
a precedence or at least have a proof of concept for future evaluations where there
can be web based registration data distribution potentially and results submission trying to make this
more efficient and more user friendly
after the community
so
the objective straight the data selection was to include multiple training sessions for each target
speaker in the main evaluation test in a recent histories
an optional test is involved multiple training sessions but
in this challenge we wanted to include that for everyone to do is the main
focus
also i same handset target trials and cross sex nontarget trials both the which or
unusual
for the regular sre
also something different was taking i-vectors from a log normal distribution as opposed to
some discrete uniform
durations the reason for this was filters more realistic and it's the challenge that's
people seemed eager to address
and also well just varying the duration allows us to do
post evaluation analysis
so the task is speaker detection which hopefully everybody here knows by the third data
for the c
and the system i was evaluated over a set of trials where you trial compared
a target speaker model in this case this was a set of five i-vectors and
it test speech segment comprised of a single i-vector
the system determines whether or not the speaker and the test segment as a target
speaker about put in a single real number
and no decision was necessary
the trial outputs are then compare to ground truth to compute a performance measure which
for the i-vector challenge was in dcf
hopefully people know what target trials and non-target trials and misses and false alarms are
it does anyone not know that
okay if not come see me afterwards and measure was dcf which is essentially just
the miss rates a time plus one hundred times of false alarm rate
and the official overall measure was the mindcf
seen here
so the challenge i-vectors were produced with the system developed jointly we between johns hopkins
and mit lincoln labs and it uses the standard mfccs and don't the acoustic features
and use the gmm train set
the source data were the ldc mixer corpora particular mixtures one three seven as well
as remakes and included around sixty thousand telephone call sides from about six thousand speakers
and the duration of these calls were up to five minutes drawn from a log
normal distribution
with the mean of nearly forty seconds
for each selected segment participants were provided with this example dimensional i-vector as well as
the duration from of the speech from which the i'd draw a vector was extracted
so this is the data and then the data was partitioned into a development set
and enrollment test set
after the development partition the calls were from speakers without test data
and consisted of round three six thousand telephone call sides from around five thousand speakers
and as i said earlier it was unlabeled so
no speaker like bowls we're given with the development partition
for the enrollment and test
calls were from speakers with at least five calls from different phone numbers and at
least eight calls from a single phone number consisted of a about thirteen hundred target
speakers
i'm sort target models
almost ten thousand test i-vectors and the target trials we're limited to ten same intent
different phone number calls per speaker and non-target trials came from other target speakers as
well as a five hundred speakers who are not
other target speakers two hundred fifty males and
fifty females
the trials consisted of all possible pairs of a target speaker and the test i-vector
about twelve and half a million trials
and included cross sex nontarget trials as willow same number
target trials
the trials were divided into two randomly selected subsets that someone asked about this the
speakers did overlap between the progress subset and the evaluation subset
forty percent was used for a progress subset which
was what was used to monitor progress and people familiar maybe not from where i
should say with the
i challenge there was a
a progress board where people could see how they were doing and how other people
we're doing
and that was actually
updated using the progress set and sixty percent of the data was held out
into the end of the evaluation period
and then the system submissions were scored for the official results
using this remaining sixty percent
so some structure to the evaluation system output for each trial could be based only
on the trials model and test i-vectors as well as the durations provided and the
provided development data
normalization over multiple test segments are target speakers was not allowed
use of evaluation data from for nontarget speaker modeling was not allowed
and training system parameters using data not provided as part of the challenge was also
model out one two and three these or
pretty typical for the nist develops for is actually knew
in the intent was to remove data engineering and also encourage participation from so it's
a don't have a lot of their own speech data
so in terms of dissipation there about three hundred registrants from about fifty countries
hundred and forty of the registrants from hundred and five unique sites
i'm at least one valid submissions so there were some
some number people registered but worked able to some of the system
the numbers submissions actually exceeded eight thousand if we compare these numbers to a street
well we do see a really large increase in participation which are excited c
in addition to receiving data
a baseline system was distributed with the evaluation
it used a variant of cosine scoring accuracy the five steps estimate a global mean
and covariance and the unlabeled data
update that's mean and variance by center and whiten you know a project them onto
a unit sphere
and that for each model
i average it's five i-vectors and project those on the unit sphere and then compute
the inner product
one thing to note is because the dev data was unlabeled at the b d
c n and lda were
not possible to use
in addition to that there was an oracle system that was not provided but kept
a g h u
which have access to the development speaker data will development data speaker labels
and the
a system was gender dependent with a four dimensional speaker space all of the i-vectors
for each model or let length normalized are then averaged
and it discarded i-vectors with duration less than thirty seconds which actually reduce the development
set
quite a bit
and here we see our first result so
z a red line as their oracle system
and the blue line is the baseline system the solid line is on the evaluation
set of trials use of the sixty percent or
held out in the dotted line is on the progress set
so basically the gap between these lines indicate the
potential value of having speaker labels
so the hope was to be able to use clustering techniques from the development set
up close this gap
here we see
results
so i here
is the mindcf on the oracle system and on the baseline system the blue line
is the progress set
and the red line is the ml set
and here we see the
top ten performing systems and how they did on the progress set and on the
ml sit
performance on the eval set was consistently better than progress
not exactly sure why then some random variation
and seventy five percent of participants submitted a system that outperform the baseline true really
please soon as well
are we do not time
so okay great actually course so
oops what skip this
accuracy progress over time
the green line is on the of al so that
and the blue line as on the progress set
and the red line is on the progress set to so basically the green line
is the very best score observed to date
same with the blue line
and then the red line is for the system that and it up
with the top performance
at the end so we see it's
history of the performance over time
couple thing is that we note it was the performance levelled off after about six
weeks we ran this from december
through april
and basically after six weeks but not much for a progress was observed
and also interesting to note was that the leading system
did not lead basically from december till february
output by it's a period that
i taking the lead to stay there
here we see performance by gender on the left
of each of these is the leading system
and on the right is the baseline system
one thing kind of interesting to note
is the leading system did worse
a on same sex trials than on male only and female only i which might
be unexpected but
i think an explanation for this is that there were calibration issues
accuracy performance by same and different phone number
here the blue is the baseline
on the left the same number of the right is different number
and here i guess like with the gender
i we see limited degradation in performance to the change and phone number from the
leading system
so this was very close
even compared to the baseline which was fairly close
so there's some additional information available you can see the odyssey paper for more results
for example more information about the progress over time and gender effects as well same
a different phone numbers
we also have an interspeech paper that does some analysis of participation
i gives us some of these same results but on the progress set the odyssey
paper focuses entirely on the ml set
and there's the lots of work to do
so that we have future paper on duration age another results as you can see
those things for additional information you can also please feel free to contact us
so some conclusions we thought that the process worked which was very exciting for us
the website was brought up and stayed up which was good
participation exceeded that of prior sre is
which was a of the goal
and many states significantly improved on the baseline system
further investigation and feedback will be needed
in order to determine the extent to which the you participation was from outside of
the audio processing community
for people who are signed up we
eventually asked if they were from the audio processing community but we didn't thing to
do that during the initial sign up so that all other cases we don't know
whether a the additional participation came from outside the outside the audio processing community or
not
thousands of submissions provide data for further analysis which we look for to doing
and
these things include things like clustering of unlabeled data and gender differences across and within
trials effects of handsets role of duration
and speaking of future work
we plan to enhance the online platform for example would like to put analysis tools
on the platform for participants to use
we expect to
offer further online challenges
and in part because they're more readily organized and also because
it's a possible to efficiently we use of test data
but we expect that we use results will affect full fledged evaluations as well or
the typical s are used
as well for example we'd like to
have increasingly web based in user friendly procedures for
i registration in for data distribution
and it's possible that were use a separate datasets evaluation datasets
want a four iterations graph performance in another held-out with limited exposure
i we've seen this used in
i have passed
nist evaluations and it may
and see renewed use in a series
thank you very much
i craig and you pass like twenty one okay
i'm wondering with those this seems just the weights is leading system
is that the leading system and that's two conditions are same sets
a sure that is the same system in
in both
i used in a reasonable idea oracle was and different directed and what you distribute
which one six hundred twenty four hundred one so why you keep the same i-vector
the two distributions i lincoln you may be addressed
like
craig in your final slide you mentioned
that the last point that's a data set for iterated use
are you thinking of something similar to what you have now the
the point i'm getting at is
if you want to train for example calibration or fusion
then it's then it's very nice to average of feedback for example
the derivatives of your system parameters with respect to those schools so
you think
it would be possible to
i'm not sure whether it's in the question is
is this an issue of not having speaker labels for development or
well
we want to be able to train
a fusion sure on the type that so can you see that happening or
because if you would just give us the data we could do that but if
the data stays
on the other side and all side
then
that's a more difficult and then sure more complex
right
yes and one thing that maybe i should clarifies this was really meant in the
context of sre in other nist evaluations sometimes
the reuse dataset from one you're to another here
up of also have some
i guess what's called the progress set but they use a different sense then we
are using it here
where people won't get
the key for that
but they will have the key for the review set
as editors your question or
okay
we question i just wondered is not relevant to do according to the rules but
those thirty nine nodes
are all models are a little different speakers where they're not there were some speakers
because there was a distortion weighted it would be or not