really

thanks for tuning you were this presentation

my name is only to actually and together with my colleagues listed here and baseline

although presenting an overview of the twenty nineteen nist cts speaker recognition challenge

which was organized in the summers to fall twenty nineteen

where is the outline on my presentation

all start by describing the highlights of the twenty nineteen cts challenge

then

define a task

given a summary on the data sets and performance metric for this challenge share some

participation statistics followed by results

and system performance analyses

all then conclude by a summary on the cts challenge

and sharing the main observations

baseline presents the main highlands

for the cts challenge

which included

and new data board style

evaluation with the progress that

and all and training condition

as well and eighty we design and more flexible evaluation web platform

some recently introduced

highlights also included

the use of cts data collected outside of north america

variable test segment durations in the tend to sixty second range

well as factors

for computing the metric

and finally

labeled and unlabeled labeled development set

which were provided by nist

the basic task in the cts challenge

what speaker detection that a given enrollment data from a target speaker and test data

from an unknown speaker determine whether the two speakers are the same

this speaker

detection task can be posed as a two class hypothesis testing problem where the null

hypothesis and

the test segment s is spoken wine a target speaker

and the alternative hypothesis

the test segment is not spoken by the target speaker

the system a for this task is this statistic computed on the test segment

known as the log-likelihood ratio defined in this line

in terms of evaluation conditions

the cts challenge offer an open training condition that allow the use of on the

limited data for system training

to demonstrate possible performance king

for speaker enrollment

two conditions were also offer

namely one conversation

and three conversations

where the one conversation condition

the system was only given one segment

with approximately sixty seconds of speech content

wine for the three conversation condition the system was given

three sixty seconds segments

lastly

the test involves segments of variable speech durations

in the tend to sixty second range

the development and evaluation data for the cts challenge were extracted from the call minor

to work was

which was collected by the ldc

cmn two contains

t s t and more conversations collected outside of north america

spoken in tune a generic

we extracted a label set

a labeled set

from color or class size of the conversations

and on labeled set

probably carly signs

then they have data for the cts challenge safely combine the sre eighteen development and

test sets

while the evaluation data was derived from the on expose portions of the scene and

cm into

we use the thirty seven displayed

to create the progress and test subsets

so how we select the second

for enrollment segment selection

we extracted segments from with a fixed speech duration of approximately sixty seconds

for each class

we selected three conversations and for each conversation we extracted to sixty seconds

we use the random often for selecting the two segments from each conversation

therefore

potentially segments might be overlapping to some extent

as for test segment selection

we extracted segments with variable speech duration in the tend to sixty second range

the nominal speech duration for the segments where sound from a uniform distribution

we sequentially extracted and many segments as possible from each conversation

without any offset onto we exhausted the duration all

that conversation

in this like to see the statistics for the twenty nineteen cts challenge development and

evaluation

sets

the first three rows

in the table summarizes the statistics for the development set which as i mentioned simply

combine the sre eighteen

cts development and test sets into one package

the last two rows of the table on the other hand

sure that statistics for the twenty nineteen cts challenge progress and test subsets

note that the combine size of the progress and test subsets

roughly equal the size of the star rating test

this table on the slide summarizes the various partitions in the twenty nineteen cts challenge

progress subset

which include

gender number of enrollment cuts

and enrollment test phone number match as well as the cts time

some notable card characteristics on the progress subset include

or larger

female trials

then male trials as well as much larger yes the and

trial then of all trials

and similarly

we observe a similar data characteristics for the very partitions

in the test

and for performance measurement

we use a

metric which is known as the detection cost or see that

which is basically a weighted average of false rejection and false alarm well probabilities

with the weights define in the table in this line

to improve the interpretability of the see that

it is commonly normalized

by default cost

also defined in this line

this results in a simply few in a simplified notation for the see that

which is parameterized by the detection threshold

which for this evaluation set

to the log of beta also define in this like

finally a primary cost where c primary was calculated where

each

partition presented in the previous line

and the final result

was the average of all parties partitions

c primary

in this line

we see

the participation statistics for the cts challenge

we received submissions from fifty one teens which where one sixty seven signs

forty three of which

where from i can be near twenty three from industry

and one of the government

also shown in this line is that he map

all

the work on please which shows

where

the participants were coming from

in this like we see the submission statistics for this twenty nineteen cts challenge

the blue bars show the number of submissions protein in total we see you

thirteen forty seven solutions

the initial may while

fifty one teens with the minimum and maximum all one

and seventy eight submission respectively

this line shows a block diagram of the baseline speaker recognition system developed for the

cts challenge

using the nist speaker

and language recognition evaluation toolkit

as well as called me to different

training configurations

without and with prior sre data

where used

to accommodate

well the first time and returning are "'cause" it and

no hyper

parameter tuning were score calibration was used in the base in the development of the

baseline system

here on this like to see the overall

speaker recognition results on the progress

and test software so

specifically respectively on top and bottom

the blue and red bars respectively represent

the minimum and actual c primaries

the yellow and orange for example lines represent the performance of the baseline system trained

without and with

prior sre data

several of the relations can be made from this used to plus first

performance trained on the two subsets are generally similar

although slightly better results are up there on the progress subset

compared to the set to the test subset

which is

primarily attributed

to the overfitting moreover tuning all the submissions systems

on the progress subset

second

nearly half of the solutions outperform the baseline system trained on box alone

while

the numbers

much smaller

where smaller the relatively when compared to the baseline

that utilizes the prior

sre data

for their

and majority of the systems

and achieve relatively small calibration errors

in particular on the progress subset

and this is in line

with the calibration performance of their

on the sre eighteen cts data

finally

we can see from this figure

that from this these figures that

except for the top performing teen the performance along the next top team five teams

it's not remarkable

therefore in this line represent statistical analysis of performance to gain further insight on actual

performance differences among the top performing systems

these plots show the performance confidence intervals around the actual detection cost forty system

for both the progress

and the test subsets

in general the progress of the active it's a wider confidence margin then the test

set which is expected because

it has a relatively smaller number of trials

also

it is interesting to note that most of the top systems may perform comparably under

different sampling of the trials space

another interesting observation is that the systems with large error bars

maybe a less robust than systems with roughly comparable performance but

smaller error bars

for instance although t eighteen achieves the lowest detection cost index of it's a much

wider confidence margin

compared to the second top system

these observations further highlight the importance all statistical significance test

one reporting performance results

well in the model selection stage

during system development particularly when the number of trials

is relatively small

in this like missy

performance comparison or sre eighteen races sre nineteen cts solutions for several top performing systems

in terms of

actual and minimum detection cost

we saw notable improvement in speaker recognition performance

as large as seventy percent

relative

by

some leading systems while for others more moderate but consistent improvements where of their

he's performance improvements are attributed to

one

large amounts of in domain development data available from a large

number of labeled speakers

and to the use of extend the and more complex and two and neural networks

framework

for secure embedding extraction

that

can effectively exploit the vast amounts of data that's available through data augmentation

this line shows speaker recognition performance for a top performing submission in terms of detection

error tradeoff

or ensure that

care

as a function of evaluation subset

the solid

black curves represent f because contours meaning that all points on any given contour correspond

to the same detection cost about

the circles and crosses denote the minimum and actual

detection cost respectively

firstly consistent with our observations from the overall

results for which i presented in the for

in the previous use like

the detection errors

that means the false alarm and for false reject errors

across

the operating points of interest

which is the low false alarm region

for the test

subset or greater than those for this

for the progress subset in addition the calibration error for the test subset is relatively

larger

and as noted previously we think that yes

e at this is primarily

result of over tuning over fitting

all the submissions systems on the progress subset

this light

shows speaker recognition performance for top performing submission

as a function of the cts data

contrary to the results also on the two

on the twenty eighteen cts domain data

where performance on the key

pac and data was better than that

on the what data across all operating points it seems from this figure

that

for the operating points of interest

which is a low false alarm region that performance on the t p s t

and data is comparable to that on the work data

with me

this is due to the large amounts of what data available for system development in

sre nineteen compared to sre eighteen where only a small amount of

what development data supplied

this line shows that curve

performance all the top performing submission as a function all enrollment test phone number match

as one would expect better for better performances of their when speech segment

from the same phone number are used in trials

nevertheless

the error is still remain relatively high even for the same phone number condition

this indicates that they're factors other than the channel which is the phone microphone in

this case that me inversely impact speaker recognition performance and these include

both intrinsic and extrinsic variabilities

this figure shows a that terms of for top performing submission as a function of

test segments duration

with a limited performance limited performance difference

for durations larger than forty second however there is a rapid growth in performance when

the speech duration decreases from

thirty second to twenty second and similarly from twenty second to ten second

this indicates that additional speech in the test recording helps improve the performance

when the test

segments speech duration is relatively short

now means shorter than thirty second

but it doesn't make a noticeable difference when there is at least thirty seconds of

speech in the test segment

also

we saw that we see can see that the calibration error increases as the test

segment

duration

decreases

which is not unexpected

to summarize

for the twenty nineteen cts challenge in un improve where

platform

for automated submission validation and scoring

was introduced

which is of order a little more style evaluation

through this platform rereleased a software package for system output scoring and validation as well

as a baseline speaker recognition system description

and results

in terms of data and metric

we use the on expose

portions of the cm in two

to create a

progress and test also using a thirty seven displayed

and we also released

a large amounts of labeled and unlabeled

data for system development data wrongly matched

the evaluation data

in terms of results we saw a remarkable improvements

due to availability of a large amounts of in domain data from a large number

of speakers

we also saw improvements from a the extensive use of data limitation

and also from the use all

extend the and more complex

neural network architectures

finally

effective use of the data set as well as the choice or

calibration set

where he

to performing well in this evaluation

with this

a lexicon who this time

and i appreciate

your attention

be well and stay safe

then