well everyone today i'm going to talk about the effects of the new testing paradigm
on the nist sre twelve
this work was done in collaboration with many colleagues
i including alvin events john george and jack
so before talking about what change nursery twelve let's just reminders also some things to
say the same
the task i industry twelve was text independent speaker detection
by speaker detection i mean
given some speech from a target speaker and some speech from a non target speaker
determine whether the target speaker and the non-target speaker the same person
evaluation consisted a long series of trials where a target trial is when the target
speaker the non-target speaker were the same and non-target trial where the target speaker non-target
speaker different so that much was the same
something the change history twelve was the joint knowledge of target speakers allowed
so on the past each trial had to be processed independently from one another
but in a street well
it was permissible to use knowledge of other target speakers for trial and this gave
rise to a distinction in the non-target speakers
and namely whether they were among the target speakers in which case they were considered
the known to man target speaker
or if they were among than not among the target speakers there are considered an
unknown non-target speaker
on this there were more and more very training data for each target speaker and
the majority of the target speakers in the evaluation
i had more than one segments for training
and in those cases a often the training data itself was varied
consisting of for example interviews recorded over a microphone phone calls recorded over a microphone
or phone calls recorded over telephone channel
most of the speakers in the sre twelve and was the target speakers are used
in prior evaluations which is something that was very different and they were identified in
advance and all their speech from these prior evaluations is made available
of those eighteen hundred at a new data was collected from three hundred and twenty
speakers roughly and roughly seventy of to we're not present in the prior evaluations and
those speakers had a single phone conversation released at the time of the evaluation
so we need these changes one question that may come up as y
and there were several reasons among them was to explore methods realising large quantities of
training data
to allow participants extended period of time to work on modeling techniques
to determine the benefit of allowing joint knowledge of target speakers particular the benefit of
for performance and also to increase the efficiency of the data collection
interest the data at the target speaker training data broke down into two cases if
released in advance of the evaluation the target speaker training data consisted of prior evaluation
data collected as part of the ldc mixtures of one three six and if released
at the start of the evaluation
the training data was a single phone conversation record as part of mixture seven
for the test segments most of them came from a newly collected corpus country next
which were phone calls a report over telephone channel
from prior mixture speakers
and they're also smaller number of phone conversations from the mixer seven corpus and
and these are phone conversations recorded other over a telephone channel or microphone channel
so there were many different types of trials include in the evaluation for example they
were trials where the speech had
noise added to it or it was reported in the naturally noisy environment
but among the trials we wanted to emphasise some subsets of particular interest us so
these recall common conditions there were five common conditions in the evaluation for today's presentation
will just going to focus on two
and those or interview speech in test without added noise and telephone channel speech and
test without added noise
so here we see in very round numbers the number of trials for each of
these common conditions
for common condition one again interview test with no added noise a roughly three thousand
target trials forty six thousand non-target trials from no non-target speakers sdc two thousand trials
from one or non-target speakers
in the core test which was required of all participants and optional test was assumption
the same but just with very a with a larger number of trials
and you see the numbers there likewise
target
non-target
speakers
non-targets
so let's look at some results
so here we see common condition two
which is telephone channel speech and test without added noise
and this is the results from one leading system the others are similar
and as might be expected better performance was able observed for known speakers that's the
red line
compared to the unknown speakers that's the black line
one thing to note is that known speakers had multiple telephone conversations and sometimes even
interviews
as their training data
so accuracy of the same system
but on common condition one which is interview speech and test without added noise
but unlike the last slide we saw
there's not a lot of difference between the two curves
and that gave was initially puzzling we wanted to know why and as it turns
out
and the known speakers for this common condition i were only known from a single
telephone channel recording
so where's and the previous slide the known speakers had a large amount of training
data by which to know them i hear the speakers where only known by a
single telephone channel
some in addition to having a small amount of data the trials were cross channel
so in addition to this concept of known and unknown non-target speakers
other also known and unknown systems
and what we mean here is that unknown systems presume that all of the non-target
trials came from no non-target speakers
and all systems presume that all the non-target trials were spoken by unknown non-target speakers
so customers also that
accuracy just a regular system only extended trials for common condition two
and we see the thin dotted lines
i'm not sure if we can actually see that especially in the back
but those are ninety percent confidence bounds which suggest that there was a significant difference
in performance
between the known non-targets on the on a non-targets again read is the colour for
the known non-targets black for the and known non-targets
so here we see an unknown system again that's where the system always presume that
the non-target speaker was a known and as might the expected
there is little this difference observed between the two curves
the accuracy and on system again that's where of the system presumes that all of
the non-target speakers are unknown which is just say there are among the target speakers
all of these are from the same site
and you're actually compared to two slides back
the performance differences is enhanced
so
summary sre twelve was an experiment with a new protocol and how speakers were made
known to the systems
after conversational telephone speech segments performance was improved when speakers are known to the system
for interview test segments such improvement was not observed that was just do the setup
of the evaluation
he was not observable stuff to say that would be observed if the evaluation allows
others actually a lot more information and that was covering the paper and other papers
covering things that we learn from the evaluation so let me encourage you to
look at those a more to contact us that address
in addition
considering future evaluations there is a question of whether allow enjoying knowledge of a target
speakers is a good idea going forward
one thing to note is that
joint knowledge of target speakers makes result increasingly dependent on the target speaker selected introduced
a trial independence
so this makes estimating
an error rates more difficult
also something to consider is whether to continue having multi session and multichannel
training for the target speakers
so nist will resume a series b on the i-vector challenge in the a near
future
some interest he's is
been expressed within the community regarding performing testing and acoustic environments different
from those of prior evaluations joe made mention that
some utility and that
also one thing to note is that's
in order to be able to conduct these types of evaluations it is necessary to
collect realistic in challenging speech data
which is both expensive and time-consuming
but in order to do that and have even better evaluation lessons learned from sre
twelve
will be take into account and considered in the next evaluation so i probably have
lots of time for questions
so thank you
so looking at your the
c one in c to the common condition one into yes can you talk to
the number of actual speakers were involved in the c one versus e c two
no trials but speakers right
the short answer is
yes but not now "'cause" i don't have that information handy but we did look
at that i can recall precisely
well one of the things i did not yet another but i recall the c
one had it on the order of about fifty forty three speakers involved only
so
i think we have
comparing those two about the effects ago about the known and this couple things changing
simultaneously the microphone in the television only yes hand the pool is much smaller "'cause"
i think it was only true from drawn from
mixture seven
right
so that's actually really excellent point that we try to emphasise during the evaluation workshop
but i neglected to mention here
is that the common conditions really we're not compare able at all
in this evaluation so the speakers were different and the
basically all the conditions change so it's i don't think you for noting that it's
inappropriate to make those comparisons
across common conditions within a common condition
it was interesting to look at some of the sub
some factor performance
could you write just commenting if you're going to be following up on his or
her as part of the nist unnecessary process
so this is actually something we've been looking into
pretty extensively the short answer is it's remains to be determined but the long answers
this is something we're seeking to do
okay
make it i've a practise it
are criteria you said at the end of the presentation they that they'll be focus
on multichannel enrollment a training conditions
once the question
whether question is like cyanide in the last sre twelve you present at the workshop
i think those any one thing that the
my kind enrollment or telephone and enrollment it seems like focus wasn't neto maybe that
just wasn't nothing just this time ramp up to still is a big challenge awfully
so it just one if that was still going to be effective some continuing evaluations
well that's a question and one of the things that we're very eager for is
to get feedback from one
from the community one thing that is
time consuming and
if not expensive the difficulty is setting up the evaluation even with the data
and so
we're much more likely to include that again of people will actually participate
also got a second question if of got on a i'm not sure if you
where the nn i-vector paragon that's come out for frame framework for sre twelve
very impressive performance particular on telephone conditions as you mine i that the nn you
need a lot of data for training and things very difficult to get that level
one thing are afraid of is
teams that might not have the infrastructure do such thing
how would like here with the other things that do have the infrastructure in future
evaluations is there are something that can be done about that such as the i-vector
challenge with the i-vectors are presented
just one and you've got thoughts on that
in short no but that's a good question and
something that model
we
perfectly willing to explore
i just want to common to one o or of your conclusion point your i'm
be happy to know but
you have been mine with this point source of course to extend the v nist
databases with new challenging conditions
but i think it's also interesting to us
increase the query actual conditions we have a lot of for to do on the
act recognition by increasing cell use given number of speakers
maybe buying one out of menu chewed
and by adding
in of the data per speaker of course it will
for us to the reviewers over the evaluation protocol and look at the results per
speaker like
jodie the past the us also look also at the difference is that if you
just
select randomly one thousand test
in a lot that the bayes to do you have some performance differences if you
choice so one set compare to your the sets and a lot of things like
that
i think
i