0:00:12really
0:00:13thanks for tuning you were this presentation
0:00:16my name is only to actually and together with my colleagues listed here and baseline
0:00:21although presenting an overview of the twenty nineteen nist cts speaker recognition challenge
0:00:27which was organized in the summers to fall twenty nineteen
0:00:32where is the outline on my presentation
0:00:34all start by describing the highlights of the twenty nineteen cts challenge
0:00:39then
0:00:40define a task
0:00:41given a summary on the data sets and performance metric for this challenge share some
0:00:47participation statistics followed by results
0:00:50and system performance analyses
0:00:52all then conclude by a summary on the cts challenge
0:00:56and sharing the main observations
0:01:00baseline presents the main highlands
0:01:02for the cts challenge
0:01:04which included
0:01:06and new data board style
0:01:09evaluation with the progress that
0:01:11and all and training condition
0:01:14as well and eighty we design and more flexible evaluation web platform
0:01:20some recently introduced
0:01:22highlights also included
0:01:25the use of cts data collected outside of north america
0:01:30variable test segment durations in the tend to sixty second range
0:01:35well as factors
0:01:37for computing the metric
0:01:39and finally
0:01:41labeled and unlabeled labeled development set
0:01:44which were provided by nist
0:01:48the basic task in the cts challenge
0:01:51what speaker detection that a given enrollment data from a target speaker and test data
0:01:57from an unknown speaker determine whether the two speakers are the same
0:02:03this speaker
0:02:04detection task can be posed as a two class hypothesis testing problem where the null
0:02:10hypothesis and
0:02:12the test segment s is spoken wine a target speaker
0:02:16and the alternative hypothesis
0:02:19the test segment is not spoken by the target speaker
0:02:23the system a for this task is this statistic computed on the test segment
0:02:28known as the log-likelihood ratio defined in this line
0:02:34in terms of evaluation conditions
0:02:37the cts challenge offer an open training condition that allow the use of on the
0:02:42limited data for system training
0:02:44to demonstrate possible performance king
0:02:47for speaker enrollment
0:02:49two conditions were also offer
0:02:51namely one conversation
0:02:54and three conversations
0:02:55where the one conversation condition
0:02:58the system was only given one segment
0:03:01with approximately sixty seconds of speech content
0:03:05wine for the three conversation condition the system was given
0:03:09three sixty seconds segments
0:03:11lastly
0:03:13the test involves segments of variable speech durations
0:03:17in the tend to sixty second range
0:03:22the development and evaluation data for the cts challenge were extracted from the call minor
0:03:27to work was
0:03:29which was collected by the ldc
0:03:31cmn two contains
0:03:33t s t and more conversations collected outside of north america
0:03:39spoken in tune a generic
0:03:41we extracted a label set
0:03:43a labeled set
0:03:44from color or class size of the conversations
0:03:48and on labeled set
0:03:50probably carly signs
0:03:52then they have data for the cts challenge safely combine the sre eighteen development and
0:03:57test sets
0:03:58while the evaluation data was derived from the on expose portions of the scene and
0:04:04cm into
0:04:05we use the thirty seven displayed
0:04:08to create the progress and test subsets
0:04:13so how we select the second
0:04:17for enrollment segment selection
0:04:19we extracted segments from with a fixed speech duration of approximately sixty seconds
0:04:26for each class
0:04:28we selected three conversations and for each conversation we extracted to sixty seconds
0:04:36we use the random often for selecting the two segments from each conversation
0:04:40therefore
0:04:41potentially segments might be overlapping to some extent
0:04:45as for test segment selection
0:04:48we extracted segments with variable speech duration in the tend to sixty second range
0:04:54the nominal speech duration for the segments where sound from a uniform distribution
0:05:00we sequentially extracted and many segments as possible from each conversation
0:05:06without any offset onto we exhausted the duration all
0:05:10that conversation
0:05:13in this like to see the statistics for the twenty nineteen cts challenge development and
0:05:19evaluation
0:05:22sets
0:05:24the first three rows
0:05:26in the table summarizes the statistics for the development set which as i mentioned simply
0:05:32combine the sre eighteen
0:05:35cts development and test sets into one package
0:05:38the last two rows of the table on the other hand
0:05:41sure that statistics for the twenty nineteen cts challenge progress and test subsets
0:05:46note that the combine size of the progress and test subsets
0:05:51roughly equal the size of the star rating test
0:05:57this table on the slide summarizes the various partitions in the twenty nineteen cts challenge
0:06:04progress subset
0:06:05which include
0:06:07gender number of enrollment cuts
0:06:10and enrollment test phone number match as well as the cts time
0:06:15some notable card characteristics on the progress subset include
0:06:19or larger
0:06:20female trials
0:06:22then male trials as well as much larger yes the and
0:06:26trial then of all trials
0:06:30and similarly
0:06:32we observe a similar data characteristics for the very partitions
0:06:36in the test
0:06:41and for performance measurement
0:06:45we use a
0:06:47metric which is known as the detection cost or see that
0:06:51which is basically a weighted average of false rejection and false alarm well probabilities
0:06:57with the weights define in the table in this line
0:07:01to improve the interpretability of the see that
0:07:04it is commonly normalized
0:07:06by default cost
0:07:08also defined in this line
0:07:10this results in a simply few in a simplified notation for the see that
0:07:15which is parameterized by the detection threshold
0:07:19which for this evaluation set
0:07:22to the log of beta also define in this like
0:07:26finally a primary cost where c primary was calculated where
0:07:31each
0:07:32partition presented in the previous line
0:07:35and the final result
0:07:37was the average of all parties partitions
0:07:41c primary
0:07:43in this line
0:07:44we see
0:07:46the participation statistics for the cts challenge
0:07:49we received submissions from fifty one teens which where one sixty seven signs
0:07:57forty three of which
0:07:58where from i can be near twenty three from industry
0:08:02and one of the government
0:08:06also shown in this line is that he map
0:08:09all
0:08:10the work on please which shows
0:08:13where
0:08:14the participants were coming from
0:08:18in this like we see the submission statistics for this twenty nineteen cts challenge
0:08:23the blue bars show the number of submissions protein in total we see you
0:08:29thirteen forty seven solutions
0:08:31the initial may while
0:08:33fifty one teens with the minimum and maximum all one
0:08:37and seventy eight submission respectively
0:08:42this line shows a block diagram of the baseline speaker recognition system developed for the
0:08:47cts challenge
0:08:49using the nist speaker
0:08:51and language recognition evaluation toolkit
0:08:54as well as called me to different
0:08:56training configurations
0:08:58without and with prior sre data
0:09:01where used
0:09:02to accommodate
0:09:03well the first time and returning are "'cause" it and
0:09:07no hyper
0:09:08parameter tuning were score calibration was used in the base in the development of the
0:09:13baseline system
0:09:16here on this like to see the overall
0:09:19speaker recognition results on the progress
0:09:21and test software so
0:09:24specifically respectively on top and bottom
0:09:28the blue and red bars respectively represent
0:09:31the minimum and actual c primaries
0:09:34the yellow and orange for example lines represent the performance of the baseline system trained
0:09:39without and with
0:09:41prior sre data
0:09:42several of the relations can be made from this used to plus first
0:09:47performance trained on the two subsets are generally similar
0:09:51although slightly better results are up there on the progress subset
0:09:56compared to the set to the test subset
0:09:59which is
0:10:01primarily attributed
0:10:03to the overfitting moreover tuning all the submissions systems
0:10:07on the progress subset
0:10:09second
0:10:10nearly half of the solutions outperform the baseline system trained on box alone
0:10:16while
0:10:17the numbers
0:10:18much smaller
0:10:19where smaller the relatively when compared to the baseline
0:10:23that utilizes the prior
0:10:25sre data
0:10:26for their
0:10:27and majority of the systems
0:10:30and achieve relatively small calibration errors
0:10:33in particular on the progress subset
0:10:36and this is in line
0:10:37with the calibration performance of their
0:10:40on the sre eighteen cts data
0:10:43finally
0:10:44we can see from this figure
0:10:46that from this these figures that
0:10:48except for the top performing teen the performance along the next top team five teams
0:10:55it's not remarkable
0:10:56therefore in this line represent statistical analysis of performance to gain further insight on actual
0:11:03performance differences among the top performing systems
0:11:07these plots show the performance confidence intervals around the actual detection cost forty system
0:11:15for both the progress
0:11:17and the test subsets
0:11:18in general the progress of the active it's a wider confidence margin then the test
0:11:24set which is expected because
0:11:26it has a relatively smaller number of trials
0:11:29also
0:11:31it is interesting to note that most of the top systems may perform comparably under
0:11:38different sampling of the trials space
0:11:40another interesting observation is that the systems with large error bars
0:11:45maybe a less robust than systems with roughly comparable performance but
0:11:50smaller error bars
0:11:52for instance although t eighteen achieves the lowest detection cost index of it's a much
0:11:58wider confidence margin
0:12:00compared to the second top system
0:12:03these observations further highlight the importance all statistical significance test
0:12:09one reporting performance results
0:12:12well in the model selection stage
0:12:14during system development particularly when the number of trials
0:12:18is relatively small
0:12:21in this like missy
0:12:22performance comparison or sre eighteen races sre nineteen cts solutions for several top performing systems
0:12:30in terms of
0:12:31actual and minimum detection cost
0:12:34we saw notable improvement in speaker recognition performance
0:12:38as large as seventy percent
0:12:41relative
0:12:43by
0:12:44some leading systems while for others more moderate but consistent improvements where of their
0:12:50he's performance improvements are attributed to
0:12:54one
0:12:56large amounts of in domain development data available from a large
0:13:00number of labeled speakers
0:13:02and to the use of extend the and more complex and two and neural networks
0:13:07framework
0:13:08for secure embedding extraction
0:13:10that
0:13:11can effectively exploit the vast amounts of data that's available through data augmentation
0:13:19this line shows speaker recognition performance for a top performing submission in terms of detection
0:13:25error tradeoff
0:13:27or ensure that
0:13:29care
0:13:30as a function of evaluation subset
0:13:34the solid
0:13:34black curves represent f because contours meaning that all points on any given contour correspond
0:13:41to the same detection cost about
0:13:44the circles and crosses denote the minimum and actual
0:13:48detection cost respectively
0:13:50firstly consistent with our observations from the overall
0:13:54results for which i presented in the for
0:13:57in the previous use like
0:13:59the detection errors
0:14:01that means the false alarm and for false reject errors
0:14:05across
0:14:06the operating points of interest
0:14:08which is the low false alarm region
0:14:12for the test
0:14:14subset or greater than those for this
0:14:16for the progress subset in addition the calibration error for the test subset is relatively
0:14:22larger
0:14:23and as noted previously we think that yes
0:14:28e at this is primarily
0:14:32result of over tuning over fitting
0:14:35all the submissions systems on the progress subset
0:14:39this light
0:14:40shows speaker recognition performance for top performing submission
0:14:44as a function of the cts data
0:14:47contrary to the results also on the two
0:14:50on the twenty eighteen cts domain data
0:14:53where performance on the key
0:14:55pac and data was better than that
0:14:59on the what data across all operating points it seems from this figure
0:15:04that
0:15:05for the operating points of interest
0:15:08which is a low false alarm region that performance on the t p s t
0:15:12and data is comparable to that on the work data
0:15:15with me
0:15:16this is due to the large amounts of what data available for system development in
0:15:21sre nineteen compared to sre eighteen where only a small amount of
0:15:27what development data supplied
0:15:30this line shows that curve
0:15:33performance all the top performing submission as a function all enrollment test phone number match
0:15:39as one would expect better for better performances of their when speech segment
0:15:46from the same phone number are used in trials
0:15:48nevertheless
0:15:50the error is still remain relatively high even for the same phone number condition
0:15:55this indicates that they're factors other than the channel which is the phone microphone in
0:16:01this case that me inversely impact speaker recognition performance and these include
0:16:07both intrinsic and extrinsic variabilities
0:16:14this figure shows a that terms of for top performing submission as a function of
0:16:19test segments duration
0:16:22with a limited performance limited performance difference
0:16:27for durations larger than forty second however there is a rapid growth in performance when
0:16:33the speech duration decreases from
0:16:35thirty second to twenty second and similarly from twenty second to ten second
0:16:40this indicates that additional speech in the test recording helps improve the performance
0:16:46when the test
0:16:48segments speech duration is relatively short
0:16:50now means shorter than thirty second
0:16:52but it doesn't make a noticeable difference when there is at least thirty seconds of
0:16:58speech in the test segment
0:17:00also
0:17:01we saw that we see can see that the calibration error increases as the test
0:17:06segment
0:17:08duration
0:17:09decreases
0:17:10which is not unexpected
0:17:13to summarize
0:17:16for the twenty nineteen cts challenge in un improve where
0:17:20platform
0:17:21for automated submission validation and scoring
0:17:25was introduced
0:17:26which is of order a little more style evaluation
0:17:30through this platform rereleased a software package for system output scoring and validation as well
0:17:36as a baseline speaker recognition system description
0:17:40and results
0:17:42in terms of data and metric
0:17:44we use the on expose
0:17:46portions of the cm in two
0:17:49to create a
0:17:52progress and test also using a thirty seven displayed
0:17:56and we also released
0:17:57a large amounts of labeled and unlabeled
0:18:01data for system development data wrongly matched
0:18:05the evaluation data
0:18:06in terms of results we saw a remarkable improvements
0:18:11due to availability of a large amounts of in domain data from a large number
0:18:17of speakers
0:18:19we also saw improvements from a the extensive use of data limitation
0:18:27and also from the use all
0:18:29extend the and more complex
0:18:31neural network architectures
0:18:33finally
0:18:35effective use of the data set as well as the choice or
0:18:38calibration set
0:18:40where he
0:18:41to performing well in this evaluation
0:18:44with this
0:18:45a lexicon who this time
0:18:47and i appreciate
0:18:49your attention
0:18:51be well and stay safe
0:18:53then