really
thanks for tuning you were this presentation
my name is only to actually and together with my colleagues listed here and baseline
although presenting an overview of the twenty nineteen nist cts speaker recognition challenge
which was organized in the summers to fall twenty nineteen
where is the outline on my presentation
all start by describing the highlights of the twenty nineteen cts challenge
then
define a task
given a summary on the data sets and performance metric for this challenge share some
participation statistics followed by results
and system performance analyses
all then conclude by a summary on the cts challenge
and sharing the main observations
baseline presents the main highlands
for the cts challenge
which included
and new data board style
evaluation with the progress that
and all and training condition
as well and eighty we design and more flexible evaluation web platform
some recently introduced
highlights also included
the use of cts data collected outside of north america
variable test segment durations in the tend to sixty second range
well as factors
for computing the metric
and finally
labeled and unlabeled labeled development set
which were provided by nist
the basic task in the cts challenge
what speaker detection that a given enrollment data from a target speaker and test data
from an unknown speaker determine whether the two speakers are the same
this speaker
detection task can be posed as a two class hypothesis testing problem where the null
hypothesis and
the test segment s is spoken wine a target speaker
and the alternative hypothesis
the test segment is not spoken by the target speaker
the system a for this task is this statistic computed on the test segment
known as the log-likelihood ratio defined in this line
in terms of evaluation conditions
the cts challenge offer an open training condition that allow the use of on the
limited data for system training
to demonstrate possible performance king
for speaker enrollment
two conditions were also offer
namely one conversation
and three conversations
where the one conversation condition
the system was only given one segment
with approximately sixty seconds of speech content
wine for the three conversation condition the system was given
three sixty seconds segments
lastly
the test involves segments of variable speech durations
in the tend to sixty second range
the development and evaluation data for the cts challenge were extracted from the call minor
to work was
which was collected by the ldc
cmn two contains
t s t and more conversations collected outside of north america
spoken in tune a generic
we extracted a label set
a labeled set
from color or class size of the conversations
and on labeled set
probably carly signs
then they have data for the cts challenge safely combine the sre eighteen development and
test sets
while the evaluation data was derived from the on expose portions of the scene and
cm into
we use the thirty seven displayed
to create the progress and test subsets
so how we select the second
for enrollment segment selection
we extracted segments from with a fixed speech duration of approximately sixty seconds
for each class
we selected three conversations and for each conversation we extracted to sixty seconds
we use the random often for selecting the two segments from each conversation
therefore
potentially segments might be overlapping to some extent
as for test segment selection
we extracted segments with variable speech duration in the tend to sixty second range
the nominal speech duration for the segments where sound from a uniform distribution
we sequentially extracted and many segments as possible from each conversation
without any offset onto we exhausted the duration all
that conversation
in this like to see the statistics for the twenty nineteen cts challenge development and
evaluation
sets
the first three rows
in the table summarizes the statistics for the development set which as i mentioned simply
combine the sre eighteen
cts development and test sets into one package
the last two rows of the table on the other hand
sure that statistics for the twenty nineteen cts challenge progress and test subsets
note that the combine size of the progress and test subsets
roughly equal the size of the star rating test
this table on the slide summarizes the various partitions in the twenty nineteen cts challenge
progress subset
which include
gender number of enrollment cuts
and enrollment test phone number match as well as the cts time
some notable card characteristics on the progress subset include
or larger
female trials
then male trials as well as much larger yes the and
trial then of all trials
and similarly
we observe a similar data characteristics for the very partitions
in the test
and for performance measurement
we use a
metric which is known as the detection cost or see that
which is basically a weighted average of false rejection and false alarm well probabilities
with the weights define in the table in this line
to improve the interpretability of the see that
it is commonly normalized
by default cost
also defined in this line
this results in a simply few in a simplified notation for the see that
which is parameterized by the detection threshold
which for this evaluation set
to the log of beta also define in this like
finally a primary cost where c primary was calculated where
each
partition presented in the previous line
and the final result
was the average of all parties partitions
c primary
in this line
we see
the participation statistics for the cts challenge
we received submissions from fifty one teens which where one sixty seven signs
forty three of which
where from i can be near twenty three from industry
and one of the government
also shown in this line is that he map
all
the work on please which shows
where
the participants were coming from
in this like we see the submission statistics for this twenty nineteen cts challenge
the blue bars show the number of submissions protein in total we see you
thirteen forty seven solutions
the initial may while
fifty one teens with the minimum and maximum all one
and seventy eight submission respectively
this line shows a block diagram of the baseline speaker recognition system developed for the
cts challenge
using the nist speaker
and language recognition evaluation toolkit
as well as called me to different
training configurations
without and with prior sre data
where used
to accommodate
well the first time and returning are "'cause" it and
no hyper
parameter tuning were score calibration was used in the base in the development of the
baseline system
here on this like to see the overall
speaker recognition results on the progress
and test software so
specifically respectively on top and bottom
the blue and red bars respectively represent
the minimum and actual c primaries
the yellow and orange for example lines represent the performance of the baseline system trained
without and with
prior sre data
several of the relations can be made from this used to plus first
performance trained on the two subsets are generally similar
although slightly better results are up there on the progress subset
compared to the set to the test subset
which is
primarily attributed
to the overfitting moreover tuning all the submissions systems
on the progress subset
second
nearly half of the solutions outperform the baseline system trained on box alone
while
the numbers
much smaller
where smaller the relatively when compared to the baseline
that utilizes the prior
sre data
for their
and majority of the systems
and achieve relatively small calibration errors
in particular on the progress subset
and this is in line
with the calibration performance of their
on the sre eighteen cts data
finally
we can see from this figure
that from this these figures that
except for the top performing teen the performance along the next top team five teams
it's not remarkable
therefore in this line represent statistical analysis of performance to gain further insight on actual
performance differences among the top performing systems
these plots show the performance confidence intervals around the actual detection cost forty system
for both the progress
and the test subsets
in general the progress of the active it's a wider confidence margin then the test
set which is expected because
it has a relatively smaller number of trials
also
it is interesting to note that most of the top systems may perform comparably under
different sampling of the trials space
another interesting observation is that the systems with large error bars
maybe a less robust than systems with roughly comparable performance but
smaller error bars
for instance although t eighteen achieves the lowest detection cost index of it's a much
wider confidence margin
compared to the second top system
these observations further highlight the importance all statistical significance test
one reporting performance results
well in the model selection stage
during system development particularly when the number of trials
is relatively small
in this like missy
performance comparison or sre eighteen races sre nineteen cts solutions for several top performing systems
in terms of
actual and minimum detection cost
we saw notable improvement in speaker recognition performance
as large as seventy percent
relative
by
some leading systems while for others more moderate but consistent improvements where of their
he's performance improvements are attributed to
one
large amounts of in domain development data available from a large
number of labeled speakers
and to the use of extend the and more complex and two and neural networks
framework
for secure embedding extraction
that
can effectively exploit the vast amounts of data that's available through data augmentation
this line shows speaker recognition performance for a top performing submission in terms of detection
error tradeoff
or ensure that
care
as a function of evaluation subset
the solid
black curves represent f because contours meaning that all points on any given contour correspond
to the same detection cost about
the circles and crosses denote the minimum and actual
detection cost respectively
firstly consistent with our observations from the overall
results for which i presented in the for
in the previous use like
the detection errors
that means the false alarm and for false reject errors
across
the operating points of interest
which is the low false alarm region
for the test
subset or greater than those for this
for the progress subset in addition the calibration error for the test subset is relatively
larger
and as noted previously we think that yes
e at this is primarily
result of over tuning over fitting
all the submissions systems on the progress subset
this light
shows speaker recognition performance for top performing submission
as a function of the cts data
contrary to the results also on the two
on the twenty eighteen cts domain data
where performance on the key
pac and data was better than that
on the what data across all operating points it seems from this figure
that
for the operating points of interest
which is a low false alarm region that performance on the t p s t
and data is comparable to that on the work data
with me
this is due to the large amounts of what data available for system development in
sre nineteen compared to sre eighteen where only a small amount of
what development data supplied
this line shows that curve
performance all the top performing submission as a function all enrollment test phone number match
as one would expect better for better performances of their when speech segment
from the same phone number are used in trials
nevertheless
the error is still remain relatively high even for the same phone number condition
this indicates that they're factors other than the channel which is the phone microphone in
this case that me inversely impact speaker recognition performance and these include
both intrinsic and extrinsic variabilities
this figure shows a that terms of for top performing submission as a function of
test segments duration
with a limited performance limited performance difference
for durations larger than forty second however there is a rapid growth in performance when
the speech duration decreases from
thirty second to twenty second and similarly from twenty second to ten second
this indicates that additional speech in the test recording helps improve the performance
when the test
segments speech duration is relatively short
now means shorter than thirty second
but it doesn't make a noticeable difference when there is at least thirty seconds of
speech in the test segment
also
we saw that we see can see that the calibration error increases as the test
segment
duration
decreases
which is not unexpected
to summarize
for the twenty nineteen cts challenge in un improve where
platform
for automated submission validation and scoring
was introduced
which is of order a little more style evaluation
through this platform rereleased a software package for system output scoring and validation as well
as a baseline speaker recognition system description
and results
in terms of data and metric
we use the on expose
portions of the cm in two
to create a
progress and test also using a thirty seven displayed
and we also released
a large amounts of labeled and unlabeled
data for system development data wrongly matched
the evaluation data
in terms of results we saw a remarkable improvements
due to availability of a large amounts of in domain data from a large number
of speakers
we also saw improvements from a the extensive use of data limitation
and also from the use all
extend the and more complex
neural network architectures
finally
effective use of the data set as well as the choice or
calibration set
where he
to performing well in this evaluation
with this
a lexicon who this time
and i appreciate
your attention
be well and stay safe
then