0:00:12 | really |
---|
0:00:13 | thanks for tuning you were this presentation |
---|
0:00:16 | my name is only to actually and together with my colleagues listed here and baseline |
---|
0:00:21 | although presenting an overview of the twenty nineteen nist cts speaker recognition challenge |
---|
0:00:27 | which was organized in the summers to fall twenty nineteen |
---|
0:00:32 | where is the outline on my presentation |
---|
0:00:34 | all start by describing the highlights of the twenty nineteen cts challenge |
---|
0:00:39 | then |
---|
0:00:40 | define a task |
---|
0:00:41 | given a summary on the data sets and performance metric for this challenge share some |
---|
0:00:47 | participation statistics followed by results |
---|
0:00:50 | and system performance analyses |
---|
0:00:52 | all then conclude by a summary on the cts challenge |
---|
0:00:56 | and sharing the main observations |
---|
0:01:00 | baseline presents the main highlands |
---|
0:01:02 | for the cts challenge |
---|
0:01:04 | which included |
---|
0:01:06 | and new data board style |
---|
0:01:09 | evaluation with the progress that |
---|
0:01:11 | and all and training condition |
---|
0:01:14 | as well and eighty we design and more flexible evaluation web platform |
---|
0:01:20 | some recently introduced |
---|
0:01:22 | highlights also included |
---|
0:01:25 | the use of cts data collected outside of north america |
---|
0:01:30 | variable test segment durations in the tend to sixty second range |
---|
0:01:35 | well as factors |
---|
0:01:37 | for computing the metric |
---|
0:01:39 | and finally |
---|
0:01:41 | labeled and unlabeled labeled development set |
---|
0:01:44 | which were provided by nist |
---|
0:01:48 | the basic task in the cts challenge |
---|
0:01:51 | what speaker detection that a given enrollment data from a target speaker and test data |
---|
0:01:57 | from an unknown speaker determine whether the two speakers are the same |
---|
0:02:03 | this speaker |
---|
0:02:04 | detection task can be posed as a two class hypothesis testing problem where the null |
---|
0:02:10 | hypothesis and |
---|
0:02:12 | the test segment s is spoken wine a target speaker |
---|
0:02:16 | and the alternative hypothesis |
---|
0:02:19 | the test segment is not spoken by the target speaker |
---|
0:02:23 | the system a for this task is this statistic computed on the test segment |
---|
0:02:28 | known as the log-likelihood ratio defined in this line |
---|
0:02:34 | in terms of evaluation conditions |
---|
0:02:37 | the cts challenge offer an open training condition that allow the use of on the |
---|
0:02:42 | limited data for system training |
---|
0:02:44 | to demonstrate possible performance king |
---|
0:02:47 | for speaker enrollment |
---|
0:02:49 | two conditions were also offer |
---|
0:02:51 | namely one conversation |
---|
0:02:54 | and three conversations |
---|
0:02:55 | where the one conversation condition |
---|
0:02:58 | the system was only given one segment |
---|
0:03:01 | with approximately sixty seconds of speech content |
---|
0:03:05 | wine for the three conversation condition the system was given |
---|
0:03:09 | three sixty seconds segments |
---|
0:03:11 | lastly |
---|
0:03:13 | the test involves segments of variable speech durations |
---|
0:03:17 | in the tend to sixty second range |
---|
0:03:22 | the development and evaluation data for the cts challenge were extracted from the call minor |
---|
0:03:27 | to work was |
---|
0:03:29 | which was collected by the ldc |
---|
0:03:31 | cmn two contains |
---|
0:03:33 | t s t and more conversations collected outside of north america |
---|
0:03:39 | spoken in tune a generic |
---|
0:03:41 | we extracted a label set |
---|
0:03:43 | a labeled set |
---|
0:03:44 | from color or class size of the conversations |
---|
0:03:48 | and on labeled set |
---|
0:03:50 | probably carly signs |
---|
0:03:52 | then they have data for the cts challenge safely combine the sre eighteen development and |
---|
0:03:57 | test sets |
---|
0:03:58 | while the evaluation data was derived from the on expose portions of the scene and |
---|
0:04:04 | cm into |
---|
0:04:05 | we use the thirty seven displayed |
---|
0:04:08 | to create the progress and test subsets |
---|
0:04:13 | so how we select the second |
---|
0:04:17 | for enrollment segment selection |
---|
0:04:19 | we extracted segments from with a fixed speech duration of approximately sixty seconds |
---|
0:04:26 | for each class |
---|
0:04:28 | we selected three conversations and for each conversation we extracted to sixty seconds |
---|
0:04:36 | we use the random often for selecting the two segments from each conversation |
---|
0:04:40 | therefore |
---|
0:04:41 | potentially segments might be overlapping to some extent |
---|
0:04:45 | as for test segment selection |
---|
0:04:48 | we extracted segments with variable speech duration in the tend to sixty second range |
---|
0:04:54 | the nominal speech duration for the segments where sound from a uniform distribution |
---|
0:05:00 | we sequentially extracted and many segments as possible from each conversation |
---|
0:05:06 | without any offset onto we exhausted the duration all |
---|
0:05:10 | that conversation |
---|
0:05:13 | in this like to see the statistics for the twenty nineteen cts challenge development and |
---|
0:05:19 | evaluation |
---|
0:05:22 | sets |
---|
0:05:24 | the first three rows |
---|
0:05:26 | in the table summarizes the statistics for the development set which as i mentioned simply |
---|
0:05:32 | combine the sre eighteen |
---|
0:05:35 | cts development and test sets into one package |
---|
0:05:38 | the last two rows of the table on the other hand |
---|
0:05:41 | sure that statistics for the twenty nineteen cts challenge progress and test subsets |
---|
0:05:46 | note that the combine size of the progress and test subsets |
---|
0:05:51 | roughly equal the size of the star rating test |
---|
0:05:57 | this table on the slide summarizes the various partitions in the twenty nineteen cts challenge |
---|
0:06:04 | progress subset |
---|
0:06:05 | which include |
---|
0:06:07 | gender number of enrollment cuts |
---|
0:06:10 | and enrollment test phone number match as well as the cts time |
---|
0:06:15 | some notable card characteristics on the progress subset include |
---|
0:06:19 | or larger |
---|
0:06:20 | female trials |
---|
0:06:22 | then male trials as well as much larger yes the and |
---|
0:06:26 | trial then of all trials |
---|
0:06:30 | and similarly |
---|
0:06:32 | we observe a similar data characteristics for the very partitions |
---|
0:06:36 | in the test |
---|
0:06:41 | and for performance measurement |
---|
0:06:45 | we use a |
---|
0:06:47 | metric which is known as the detection cost or see that |
---|
0:06:51 | which is basically a weighted average of false rejection and false alarm well probabilities |
---|
0:06:57 | with the weights define in the table in this line |
---|
0:07:01 | to improve the interpretability of the see that |
---|
0:07:04 | it is commonly normalized |
---|
0:07:06 | by default cost |
---|
0:07:08 | also defined in this line |
---|
0:07:10 | this results in a simply few in a simplified notation for the see that |
---|
0:07:15 | which is parameterized by the detection threshold |
---|
0:07:19 | which for this evaluation set |
---|
0:07:22 | to the log of beta also define in this like |
---|
0:07:26 | finally a primary cost where c primary was calculated where |
---|
0:07:31 | each |
---|
0:07:32 | partition presented in the previous line |
---|
0:07:35 | and the final result |
---|
0:07:37 | was the average of all parties partitions |
---|
0:07:41 | c primary |
---|
0:07:43 | in this line |
---|
0:07:44 | we see |
---|
0:07:46 | the participation statistics for the cts challenge |
---|
0:07:49 | we received submissions from fifty one teens which where one sixty seven signs |
---|
0:07:57 | forty three of which |
---|
0:07:58 | where from i can be near twenty three from industry |
---|
0:08:02 | and one of the government |
---|
0:08:06 | also shown in this line is that he map |
---|
0:08:09 | all |
---|
0:08:10 | the work on please which shows |
---|
0:08:13 | where |
---|
0:08:14 | the participants were coming from |
---|
0:08:18 | in this like we see the submission statistics for this twenty nineteen cts challenge |
---|
0:08:23 | the blue bars show the number of submissions protein in total we see you |
---|
0:08:29 | thirteen forty seven solutions |
---|
0:08:31 | the initial may while |
---|
0:08:33 | fifty one teens with the minimum and maximum all one |
---|
0:08:37 | and seventy eight submission respectively |
---|
0:08:42 | this line shows a block diagram of the baseline speaker recognition system developed for the |
---|
0:08:47 | cts challenge |
---|
0:08:49 | using the nist speaker |
---|
0:08:51 | and language recognition evaluation toolkit |
---|
0:08:54 | as well as called me to different |
---|
0:08:56 | training configurations |
---|
0:08:58 | without and with prior sre data |
---|
0:09:01 | where used |
---|
0:09:02 | to accommodate |
---|
0:09:03 | well the first time and returning are "'cause" it and |
---|
0:09:07 | no hyper |
---|
0:09:08 | parameter tuning were score calibration was used in the base in the development of the |
---|
0:09:13 | baseline system |
---|
0:09:16 | here on this like to see the overall |
---|
0:09:19 | speaker recognition results on the progress |
---|
0:09:21 | and test software so |
---|
0:09:24 | specifically respectively on top and bottom |
---|
0:09:28 | the blue and red bars respectively represent |
---|
0:09:31 | the minimum and actual c primaries |
---|
0:09:34 | the yellow and orange for example lines represent the performance of the baseline system trained |
---|
0:09:39 | without and with |
---|
0:09:41 | prior sre data |
---|
0:09:42 | several of the relations can be made from this used to plus first |
---|
0:09:47 | performance trained on the two subsets are generally similar |
---|
0:09:51 | although slightly better results are up there on the progress subset |
---|
0:09:56 | compared to the set to the test subset |
---|
0:09:59 | which is |
---|
0:10:01 | primarily attributed |
---|
0:10:03 | to the overfitting moreover tuning all the submissions systems |
---|
0:10:07 | on the progress subset |
---|
0:10:09 | second |
---|
0:10:10 | nearly half of the solutions outperform the baseline system trained on box alone |
---|
0:10:16 | while |
---|
0:10:17 | the numbers |
---|
0:10:18 | much smaller |
---|
0:10:19 | where smaller the relatively when compared to the baseline |
---|
0:10:23 | that utilizes the prior |
---|
0:10:25 | sre data |
---|
0:10:26 | for their |
---|
0:10:27 | and majority of the systems |
---|
0:10:30 | and achieve relatively small calibration errors |
---|
0:10:33 | in particular on the progress subset |
---|
0:10:36 | and this is in line |
---|
0:10:37 | with the calibration performance of their |
---|
0:10:40 | on the sre eighteen cts data |
---|
0:10:43 | finally |
---|
0:10:44 | we can see from this figure |
---|
0:10:46 | that from this these figures that |
---|
0:10:48 | except for the top performing teen the performance along the next top team five teams |
---|
0:10:55 | it's not remarkable |
---|
0:10:56 | therefore in this line represent statistical analysis of performance to gain further insight on actual |
---|
0:11:03 | performance differences among the top performing systems |
---|
0:11:07 | these plots show the performance confidence intervals around the actual detection cost forty system |
---|
0:11:15 | for both the progress |
---|
0:11:17 | and the test subsets |
---|
0:11:18 | in general the progress of the active it's a wider confidence margin then the test |
---|
0:11:24 | set which is expected because |
---|
0:11:26 | it has a relatively smaller number of trials |
---|
0:11:29 | also |
---|
0:11:31 | it is interesting to note that most of the top systems may perform comparably under |
---|
0:11:38 | different sampling of the trials space |
---|
0:11:40 | another interesting observation is that the systems with large error bars |
---|
0:11:45 | maybe a less robust than systems with roughly comparable performance but |
---|
0:11:50 | smaller error bars |
---|
0:11:52 | for instance although t eighteen achieves the lowest detection cost index of it's a much |
---|
0:11:58 | wider confidence margin |
---|
0:12:00 | compared to the second top system |
---|
0:12:03 | these observations further highlight the importance all statistical significance test |
---|
0:12:09 | one reporting performance results |
---|
0:12:12 | well in the model selection stage |
---|
0:12:14 | during system development particularly when the number of trials |
---|
0:12:18 | is relatively small |
---|
0:12:21 | in this like missy |
---|
0:12:22 | performance comparison or sre eighteen races sre nineteen cts solutions for several top performing systems |
---|
0:12:30 | in terms of |
---|
0:12:31 | actual and minimum detection cost |
---|
0:12:34 | we saw notable improvement in speaker recognition performance |
---|
0:12:38 | as large as seventy percent |
---|
0:12:41 | relative |
---|
0:12:43 | by |
---|
0:12:44 | some leading systems while for others more moderate but consistent improvements where of their |
---|
0:12:50 | he's performance improvements are attributed to |
---|
0:12:54 | one |
---|
0:12:56 | large amounts of in domain development data available from a large |
---|
0:13:00 | number of labeled speakers |
---|
0:13:02 | and to the use of extend the and more complex and two and neural networks |
---|
0:13:07 | framework |
---|
0:13:08 | for secure embedding extraction |
---|
0:13:10 | that |
---|
0:13:11 | can effectively exploit the vast amounts of data that's available through data augmentation |
---|
0:13:19 | this line shows speaker recognition performance for a top performing submission in terms of detection |
---|
0:13:25 | error tradeoff |
---|
0:13:27 | or ensure that |
---|
0:13:29 | care |
---|
0:13:30 | as a function of evaluation subset |
---|
0:13:34 | the solid |
---|
0:13:34 | black curves represent f because contours meaning that all points on any given contour correspond |
---|
0:13:41 | to the same detection cost about |
---|
0:13:44 | the circles and crosses denote the minimum and actual |
---|
0:13:48 | detection cost respectively |
---|
0:13:50 | firstly consistent with our observations from the overall |
---|
0:13:54 | results for which i presented in the for |
---|
0:13:57 | in the previous use like |
---|
0:13:59 | the detection errors |
---|
0:14:01 | that means the false alarm and for false reject errors |
---|
0:14:05 | across |
---|
0:14:06 | the operating points of interest |
---|
0:14:08 | which is the low false alarm region |
---|
0:14:12 | for the test |
---|
0:14:14 | subset or greater than those for this |
---|
0:14:16 | for the progress subset in addition the calibration error for the test subset is relatively |
---|
0:14:22 | larger |
---|
0:14:23 | and as noted previously we think that yes |
---|
0:14:28 | e at this is primarily |
---|
0:14:32 | result of over tuning over fitting |
---|
0:14:35 | all the submissions systems on the progress subset |
---|
0:14:39 | this light |
---|
0:14:40 | shows speaker recognition performance for top performing submission |
---|
0:14:44 | as a function of the cts data |
---|
0:14:47 | contrary to the results also on the two |
---|
0:14:50 | on the twenty eighteen cts domain data |
---|
0:14:53 | where performance on the key |
---|
0:14:55 | pac and data was better than that |
---|
0:14:59 | on the what data across all operating points it seems from this figure |
---|
0:15:04 | that |
---|
0:15:05 | for the operating points of interest |
---|
0:15:08 | which is a low false alarm region that performance on the t p s t |
---|
0:15:12 | and data is comparable to that on the work data |
---|
0:15:15 | with me |
---|
0:15:16 | this is due to the large amounts of what data available for system development in |
---|
0:15:21 | sre nineteen compared to sre eighteen where only a small amount of |
---|
0:15:27 | what development data supplied |
---|
0:15:30 | this line shows that curve |
---|
0:15:33 | performance all the top performing submission as a function all enrollment test phone number match |
---|
0:15:39 | as one would expect better for better performances of their when speech segment |
---|
0:15:46 | from the same phone number are used in trials |
---|
0:15:48 | nevertheless |
---|
0:15:50 | the error is still remain relatively high even for the same phone number condition |
---|
0:15:55 | this indicates that they're factors other than the channel which is the phone microphone in |
---|
0:16:01 | this case that me inversely impact speaker recognition performance and these include |
---|
0:16:07 | both intrinsic and extrinsic variabilities |
---|
0:16:14 | this figure shows a that terms of for top performing submission as a function of |
---|
0:16:19 | test segments duration |
---|
0:16:22 | with a limited performance limited performance difference |
---|
0:16:27 | for durations larger than forty second however there is a rapid growth in performance when |
---|
0:16:33 | the speech duration decreases from |
---|
0:16:35 | thirty second to twenty second and similarly from twenty second to ten second |
---|
0:16:40 | this indicates that additional speech in the test recording helps improve the performance |
---|
0:16:46 | when the test |
---|
0:16:48 | segments speech duration is relatively short |
---|
0:16:50 | now means shorter than thirty second |
---|
0:16:52 | but it doesn't make a noticeable difference when there is at least thirty seconds of |
---|
0:16:58 | speech in the test segment |
---|
0:17:00 | also |
---|
0:17:01 | we saw that we see can see that the calibration error increases as the test |
---|
0:17:06 | segment |
---|
0:17:08 | duration |
---|
0:17:09 | decreases |
---|
0:17:10 | which is not unexpected |
---|
0:17:13 | to summarize |
---|
0:17:16 | for the twenty nineteen cts challenge in un improve where |
---|
0:17:20 | platform |
---|
0:17:21 | for automated submission validation and scoring |
---|
0:17:25 | was introduced |
---|
0:17:26 | which is of order a little more style evaluation |
---|
0:17:30 | through this platform rereleased a software package for system output scoring and validation as well |
---|
0:17:36 | as a baseline speaker recognition system description |
---|
0:17:40 | and results |
---|
0:17:42 | in terms of data and metric |
---|
0:17:44 | we use the on expose |
---|
0:17:46 | portions of the cm in two |
---|
0:17:49 | to create a |
---|
0:17:52 | progress and test also using a thirty seven displayed |
---|
0:17:56 | and we also released |
---|
0:17:57 | a large amounts of labeled and unlabeled |
---|
0:18:01 | data for system development data wrongly matched |
---|
0:18:05 | the evaluation data |
---|
0:18:06 | in terms of results we saw a remarkable improvements |
---|
0:18:11 | due to availability of a large amounts of in domain data from a large number |
---|
0:18:17 | of speakers |
---|
0:18:19 | we also saw improvements from a the extensive use of data limitation |
---|
0:18:27 | and also from the use all |
---|
0:18:29 | extend the and more complex |
---|
0:18:31 | neural network architectures |
---|
0:18:33 | finally |
---|
0:18:35 | effective use of the data set as well as the choice or |
---|
0:18:38 | calibration set |
---|
0:18:40 | where he |
---|
0:18:41 | to performing well in this evaluation |
---|
0:18:44 | with this |
---|
0:18:45 | a lexicon who this time |
---|
0:18:47 | and i appreciate |
---|
0:18:49 | your attention |
---|
0:18:51 | be well and stay safe |
---|
0:18:53 | then |
---|