0:00:13 | grinning thanks for tuning in for my second presentation in the session mining is only |
---|
0:00:19 | marginally and together with my colleagues listed here on this line although presenting an overview |
---|
0:00:24 | of the twenty nineteen nist |
---|
0:00:27 | audiovisual speaker recognition evaluation |
---|
0:00:29 | which was organised |
---|
0:00:31 | in the fall of twenty nineteen |
---|
0:00:35 | before i start my presentation of electing by you |
---|
0:00:38 | if you have an already all like to invite you to see my first presentation |
---|
0:00:42 | the session which was an overview of the twenty nineteen |
---|
0:00:47 | chris sre cts challenge |
---|
0:00:51 | in addition of blinding white each signal and participate in the twenty nist cts challenge |
---|
0:00:57 | which is currently going |
---|
0:01:02 | so |
---|
0:01:02 | here is the outline of my presentation |
---|
0:01:05 | all start by describing the highlights all the twenty nineteen audiovisual sre |
---|
0:01:10 | then to find the task you may summary on the data sets and performance metric |
---|
0:01:15 | for this evaluation |
---|
0:01:17 | share some participation statistics followed by results and system performance analyses |
---|
0:01:23 | although i would like a quite a summary on the audio-visual sre nineteen and sharing |
---|
0:01:28 | the main observations |
---|
0:01:31 | baseline presents the main highlights |
---|
0:01:34 | well the twenty nineteen sre |
---|
0:01:36 | which included |
---|
0:01:37 | video data for audiovisual person recognition |
---|
0:01:41 | and open training condition |
---|
0:01:43 | as well and a redesign |
---|
0:01:45 | and more flexible evaluation web platform |
---|
0:01:49 | recently introduced highlight also included |
---|
0:01:52 | audio from value |
---|
0:01:54 | which means |
---|
0:01:55 | on you |
---|
0:01:56 | recordings that were extracted from |
---|
0:01:59 | big online but |
---|
0:02:03 | so the primary task for the twenty nineteen audiovisual sre was person detection meaning that |
---|
0:02:10 | given enrollment video data from the target person |
---|
0:02:14 | and test video data from an unknown person automatically determine whether the target person is |
---|
0:02:21 | present in the test menu |
---|
0:02:23 | this person detection problem |
---|
0:02:25 | can be posed as a two class hypothesis testing problem |
---|
0:02:30 | where |
---|
0:02:31 | the null hypothesis is the test video s |
---|
0:02:35 | belongs to the target percent and alternative our hypothesis is the test video does not |
---|
0:02:41 | belong to the target person |
---|
0:02:44 | system that would for this task is then statistics computed on the test video known |
---|
0:02:50 | as the log-likelihood ratio defining the slide |
---|
0:02:54 | in terms of evaluation conditions |
---|
0:02:56 | the audio-visual sre nineteen offer an open training condition that allow the use of on |
---|
0:03:02 | limited data for system training to demonstrate possible performance gains |
---|
0:03:08 | for enrollment |
---|
0:03:09 | the systems where given video segments |
---|
0:03:12 | we would variables speech content ranging from ten seconds to six hundred seconds |
---|
0:03:18 | in addition |
---|
0:03:19 | the systems were provided with |
---|
0:03:21 | diarisation more as well as face bounding boxes |
---|
0:03:27 | for the face frames containing the target individual |
---|
0:03:31 | lastly the test involve area segments of variable durations in the ten to six hundred |
---|
0:03:38 | seconds grange |
---|
0:03:42 | the development and evaluation data for the only visual sre nineteen where extracted from the |
---|
0:03:47 | channels multimedia and the vast corpora |
---|
0:03:50 | the channels multimedia dataset was extracted from the entire janice benchmark r e |
---|
0:03:56 | and it consists of two subsets namely or and four |
---|
0:04:02 | each of which |
---|
0:04:03 | "'cause" with this on there and test splits |
---|
0:04:07 | we all for this evaluation we only use the course outside |
---|
0:04:10 | because it better reflects |
---|
0:04:13 | the data conditions in sre nineteen |
---|
0:04:16 | the vast corpus on the other hand was collected by the ldc and contains a |
---|
0:04:21 | mature online videos such as video belongs or belonged spoken in english the videos have |
---|
0:04:28 | extremely divers audio and visual conditions |
---|
0:04:31 | background environments a code different codecs different illuminations and hoses |
---|
0:04:38 | in addition |
---|
0:04:39 | third to be multiple individuals hearing in each video |
---|
0:04:45 | baseline shows speech |
---|
0:04:47 | duration histograms for the enrollment and test segments in the audio-visual sre nineteen data and |
---|
0:04:53 | test sets which are shown on the left and right plots respectively |
---|
0:04:59 | the enrollment segments speech durations were calculated after applying diarisation why no diarisation where applied |
---|
0:05:07 | to the test segments |
---|
0:05:10 | nevertheless |
---|
0:05:11 | the enrollment and test histogram school adhere to follow |
---|
0:05:15 | log normal distributions and overall |
---|
0:05:19 | they are consistent across to them and test sets |
---|
0:05:24 | this table shows the data statistics |
---|
0:05:27 | four or subset of the channels multimedia data set as well as the audio-visual sre |
---|
0:05:34 | nineteen and test sets |
---|
0:05:36 | which were extracted from the vast corpus |
---|
0:05:39 | notice that over all the size of the channels data is larger than size of |
---|
0:05:45 | the sre nineteen |
---|
0:05:46 | audio visual data and test sets |
---|
0:05:50 | which makes it a good candidate for system training and development purposes |
---|
0:05:58 | for performance measurement we use the mimicry known as the detection cost or see that |
---|
0:06:04 | for sure |
---|
0:06:05 | which is and weighted average of false reject and false alarm probabilities |
---|
0:06:10 | with the weights defined in the table and baseline |
---|
0:06:14 | to improve the inter credibility of the see that it is commonly normalized by default |
---|
0:06:19 | cost |
---|
0:06:21 | define it slide |
---|
0:06:23 | this results in a simplified notation for to see that |
---|
0:06:27 | which is parameterized by detection cost |
---|
0:06:31 | and for this evaluation |
---|
0:06:33 | this detection threshold |
---|
0:06:35 | is the log all data and beta is also defined in this slide |
---|
0:06:43 | this slide presents the participation statistics for the sre nineteen audio visual evaluation |
---|
0:06:51 | overall we received submissions from fourteen team which were performed by twenty six sides |
---|
0:06:58 | eight which where |
---|
0:06:59 | from industry and the remaining eighteen where from i continue |
---|
0:07:05 | also shown this line |
---|
0:07:08 | is the shape of the work on three |
---|
0:07:11 | which |
---|
0:07:13 | shows us where |
---|
0:07:14 | the participating teams where coming from |
---|
0:07:20 | this line shows the number of submissions true seen her tract and demonstrate tracks in |
---|
0:07:26 | total audio visual and audio visual cranks |
---|
0:07:30 | for the twenty nineteen audiovisual |
---|
0:07:32 | speaker recognition evaluation |
---|
0:07:35 | we can see that majority of the teams participated in all three tracks |
---|
0:07:41 | and one two teams only participated in the audio |
---|
0:07:45 | and audiovisual tracks and |
---|
0:07:47 | one team |
---|
0:07:48 | participated in the audio only tracked |
---|
0:07:51 | in total received one hundred and two submissions |
---|
0:07:55 | which were made by |
---|
0:07:57 | fourteen teams as english |
---|
0:08:04 | this line shows the block diagram of the baseline speaker recognition system developed for the |
---|
0:08:09 | audio visual history using the nist |
---|
0:08:13 | speaker and language recognition evaluation toolkit as well as called me |
---|
0:08:19 | the and then extractor was trained using call he walks alone |
---|
0:08:24 | version two recipe |
---|
0:08:26 | and to develop this is system we didn't use any hyper parameter tuning |
---|
0:08:32 | or score calibration |
---|
0:08:40 | this line shows |
---|
0:08:41 | a block diagram of the baseline face recognition system developed for the audio visual history |
---|
0:08:48 | and to develop this we used a the face then as well as the nist |
---|
0:08:52 | ancillary to toolkit |
---|
0:08:54 | we use the pre-training multicast convolutional your neural network model for face detection and for |
---|
0:09:03 | inventing extraction |
---|
0:09:06 | use the rest of the model that was trained on b g gee face to |
---|
0:09:10 | dataset |
---|
0:09:11 | in order to tune the hyper parameters we use the janice multimedia data set |
---|
0:09:17 | and similar to what we had where the baseline speaker recognition system nor score calibration |
---|
0:09:23 | was used |
---|
0:09:24 | for the face recognition system |
---|
0:09:30 | this line shows the performance of the primary submissions |
---|
0:09:35 | parodying pair tract |
---|
0:09:37 | as well as the performance of the baseline system in terms of the actual and |
---|
0:09:42 | minimum costs |
---|
0:09:43 | on the test |
---|
0:09:46 | the blue bars and red bars show the minimum and actual cost respectively |
---|
0:09:53 | the y-axis do you know it's |
---|
0:09:56 | the c primarily and is a point the limit for the y-axis is limited to |
---|
0:10:01 | is the two point five to facilitate crossed system comparisons in the lower cost regions |
---|
0:10:09 | we can make something pornographer observations from this figure first compared to the most recent |
---|
0:10:15 | sre which was the sre eighteen |
---|
0:10:18 | at the time |
---|
0:10:19 | there seems to be in notable improvement in audio-only speaker recognition performance |
---|
0:10:26 | and these improvements are largely at you attributed to the use of extended and more |
---|
0:10:33 | complex and two and neural network architectures such as the rest the architectures along with |
---|
0:10:40 | soft marching loss functions such as the angular softmax |
---|
0:10:45 | for speaker and baiting extraction |
---|
0:10:48 | and given the size of these models |
---|
0:10:51 | they can effectively exploit the vast amounts of training data that is available through |
---|
0:10:59 | data augmentation |
---|
0:11:02 | the second observation is that performance trends for the top for teens are generally similar |
---|
0:11:09 | and we can see that the actual cost |
---|
0:11:12 | for the all you only submissions or larger than those for the visually submissions |
---|
0:11:18 | and the audiovisual fusion which means the combination of speaker and face recognition systems results |
---|
0:11:25 | in salt stantially gains in person recognition performance |
---|
0:11:29 | so for example we can see greater than eighty five percent relative improvement in terms |
---|
0:11:36 | of the minimum detection cost for the leading system compared to either of the speaker |
---|
0:11:43 | over face recognition systems along |
---|
0:11:46 | thirdly more than half of the submissions outperform the baseline audio visual system |
---|
0:11:53 | with the leading system achieving larger than ninety percent improvement over the baseline |
---|
0:12:01 | the fourth observation is that i in terms of calibration performance mixed results we can |
---|
0:12:08 | see makes results for some teens |
---|
0:12:11 | for example to talk to teens the calibration errors for speaker recognition systems or larger |
---|
0:12:17 | than those for the face recognition systems |
---|
0:12:19 | while for some others the opposite is true |
---|
0:12:23 | finally in terms of the minimum detection cost it to top performing speaker and face |
---|
0:12:29 | recognition systems achieve comparable results which is very promising a all this evaluation for the |
---|
0:12:36 | speaker recognition community |
---|
0:12:38 | given the results we have seen before in prior studies where face recognition systems were |
---|
0:12:45 | shown to outperform speaker recognition systems by and large margin |
---|
0:12:51 | it's also worth emphasizing here not the top performing speaker and face recognition system |
---|
0:12:59 | we each or from teen five |
---|
0:13:04 | they're both a single systems that means do you know a system combination or fusion |
---|
0:13:11 | a ford used to |
---|
0:13:14 | systems |
---|
0:13:18 | so no to gain further insight on actual performance differences among the top performing systems |
---|
0:13:25 | we also computed would stratagem based ninety five percent confidence interval a for these point |
---|
0:13:33 | estimates of the performance |
---|
0:13:36 | the progress on the slide show the performance confidence intervals around the actual detection cost |
---|
0:13:42 | for instance team for the audio switches on the call visual which is shown in |
---|
0:13:48 | the middle and false visual track such as shown at the bottom |
---|
0:13:52 | in general |
---|
0:13:54 | the audio systems extra between our confidence margin then their visual counterparts this could be |
---|
0:14:02 | partly because most of the parties and swore from the speaker recognition community |
---|
0:14:08 | using off-the-shelf face recognition systems along with pre-training law models which where not necessarily optimize |
---|
0:14:16 | for the task i and in the sre nineteen audio visual |
---|
0:14:23 | evaluation |
---|
0:14:24 | also unknown instead of this notice that several leading systems almost perform comparably under different |
---|
0:14:32 | sample aims of the trial space |
---|
0:14:35 | and another interesting observation is that the audio visuals fusion seems to boost a decision |
---|
0:14:42 | making confidence all the systems by significant margin two point where two leading systems |
---|
0:14:54 | performed the other systems |
---|
0:14:57 | statistically significantly |
---|
0:15:00 | these observations fair further highlight the importance of statistical significance tests wine reporting |
---|
0:15:07 | performance results or in the model selection stage during system development particularly when the number |
---|
0:15:14 | of a trials |
---|
0:15:16 | a relatively small |
---|
0:15:20 | this line shows a the performance carriers a bit that stands for detection error tradeoff |
---|
0:15:26 | that performance curves for a top performing system for the audio visual and audio visual |
---|
0:15:32 | tracks |
---|
0:15:33 | the solid black cherry in the figure represent adequate cost contours and that means that |
---|
0:15:39 | all other points on a given contour correspond to the same on detection cost about |
---|
0:15:47 | so here we can see not consistent with our previous observations from the overall results |
---|
0:15:52 | on if you slide back |
---|
0:15:55 | you audiovisual fusion provide remarkable improvements in performance |
---|
0:15:59 | across all operating points not just a single operating point on adaptor which is expected |
---|
0:16:07 | given how complementary the two modalities audio and visual modalities or |
---|
0:16:13 | in addition for a wide range of operating points this speaker and face recognition systems |
---|
0:16:19 | provided comparable performance which is very problems promising for the speaker recognition community |
---|
0:16:25 | and shows how far the technology has come so far |
---|
0:16:29 | this slide shows a normalized target and non-target score distributions for |
---|
0:16:34 | a top performing system for all tracks and means audio visual and audio visual track |
---|
0:16:41 | then they recall dashed line |
---|
0:16:43 | represents the detection threshold which you relative related to the value of data which we |
---|
0:16:51 | discussed when we were talking about the performance measurement |
---|
0:16:55 | here we can see that this score distribution from the audio on the end face |
---|
0:17:00 | only systems |
---|
0:17:01 | there were they roughly aligned with a target and non-target distributions showing some overlap and |
---|
0:17:09 | that racial point |
---|
0:17:11 | however a their diffusion the audiovisual fusion the target and nontarget classes are |
---|
0:17:19 | well separated with minimal overlap |
---|
0:17:22 | a threshold by |
---|
0:17:25 | and we speculate on it and this is actually |
---|
0:17:30 | a the reason that |
---|
0:17:33 | we see such low errors |
---|
0:17:39 | specifically on low false rejects |
---|
0:17:42 | for systems that use audiovisual fusion |
---|
0:17:46 | so in summary we use the new and improved evaluation web platform for automated submission |
---|
0:17:52 | validation and scoring forty audiovisual is very nineteen to this web platform |
---|
0:17:58 | we release the software package for system now meditations scoring |
---|
0:18:02 | we also released the baseline person recognition system description and results |
---|
0:18:07 | in terms of data may take a for the first time we introduce video data |
---|
0:18:12 | for audiovisual person recognition |
---|
0:18:15 | rereleased large labeled data sets which are extracted from the janice multimedia data set as |
---|
0:18:21 | well as the bass corpus |
---|
0:18:23 | and these datasets probably matched evaluation set |
---|
0:18:26 | in terms of results |
---|
0:18:29 | is also actual things a from the audiovisual fusion |
---|
0:18:34 | we also so that a top performing speaker and recognition systems perform |
---|
0:18:39 | a comparably |
---|
0:18:41 | we saw major improvements that were attributed to the use of more |
---|
0:18:48 | extended then more complex neural network models such as the rest the model |
---|
0:18:52 | along with angular margin losses |
---|
0:18:56 | in addition to this the improvements were attributed to the ecstasy use of data augmentation |
---|
0:19:02 | and in a clustering of at estimating which was done primarily for diarization paris |
---|
0:19:13 | effective use of this test set as well as the choice of calibration set where |
---|
0:19:17 | also very working and they were key to performing well in this evaluation |
---|
0:19:22 | and finally although fusion still seems to |
---|
0:19:26 | playable we saw that strong single systems can be as good as fusion system |
---|
0:19:34 | and with that a like to include conclude this time |
---|
0:19:38 | i thank you very attention e well and stays |
---|