0:00:17 | i everyone finds not require an and all be presenting the work by myself and |
---|
0:00:23 | my car was |
---|
0:00:24 | on the voices from a distance challenge two thousand nineteen analysis of speaker verification results |
---|
0:00:30 | and main challenges |
---|
0:00:34 | when we look at evaluations and challenges in the community they tend to provide a |
---|
0:00:38 | common data |
---|
0:00:39 | benchmarks performance metrics for the advancement of research in speaker recognition community |
---|
0:00:44 | some examples of might be for me without the nist sre series |
---|
0:00:49 | speakers in the one challenge |
---|
0:00:50 | voxel speaker recognition challenge |
---|
0:00:52 | and s-dsc |
---|
0:00:54 | previous evaluations focused on speaker verification in the mines considering telephone dialler microphone data |
---|
0:01:01 | different speaking styles |
---|
0:01:03 | noisy data vocal effort audio from video short duration and better and while |
---|
0:01:08 | however they haven't been that many that focus on |
---|
0:01:11 | there even inside in the far-field distant speaker domain |
---|
0:01:16 | now |
---|
0:01:17 | nowadays we've got commercial personal assistants a at a really |
---|
0:01:22 | outstanding in this area so trying to get a bit more of an understanding in |
---|
0:01:25 | this context is important especially when we will you know single microphone |
---|
0:01:29 | scenario |
---|
0:01:31 | and the voices from a distance challenge in two thousand nineteen was hosted by sri |
---|
0:01:35 | international and lap forty one |
---|
0:01:38 | i in to stage two thousand and nineteen |
---|
0:01:41 | and what this channel focused on was both speaker recognition and speech recognition |
---|
0:01:45 | using the distant farfield speech acquired using a single microphone |
---|
0:01:49 | in noisy and realistic reverberant environments |
---|
0:01:52 | there are several objectives that we had for this challenge |
---|
0:01:55 | one was the benchmark in a state-of-the-art technology for farfield speech |
---|
0:02:00 | we want to support the development of new ideas in technology to bring that technology |
---|
0:02:05 | for |
---|
0:02:06 | we wanted to support new research groups entering the field of distant speech processing |
---|
0:02:11 | and that was larceny and i would be publicly available dataset |
---|
0:02:15 | i think that this is realistic of reverberation characteristics |
---|
0:02:21 | what we noticed since the release of the public database in two thousand nineteen of |
---|
0:02:25 | those an increase use the voices dataset |
---|
0:02:28 | so we thought this actually called for will be current special session that we're hosting |
---|
0:02:32 | here in odyssey two thousand forty even not virtual |
---|
0:02:36 | now the session we're hoping will focus on broad areas such a single buses multi |
---|
0:02:41 | channel speaker recognition |
---|
0:02:43 | single versus multichannel speech enhancement for speaker recognition |
---|
0:02:47 | domain adaptation for farfield speaker recognition |
---|
0:02:51 | calibration in five two conditions |
---|
0:02:53 | and advancing the standard |
---|
0:02:55 | over what we saw in the voices from a distance challenge two thousand nine ten |
---|
0:03:01 | let's have a look at what the voices corpus actually had in |
---|
0:03:05 | so voices stands for voice is obscured in complex environment setting |
---|
0:03:09 | and it is alaska now publicly available corpus collecting in collected in the real reverberant |
---|
0:03:15 | environments |
---|
0:03:16 | well we have inside the dataset is |
---|
0:03:19 | three thousand nine hundred or more hours of audio |
---|
0:03:22 | from about a million segment |
---|
0:03:24 | multiple rooms for internal |
---|
0:03:26 | different distracters that just t v in babble noise |
---|
0:03:29 | and different microphones different distances |
---|
0:03:32 | we even have a male speaker the right tails to mimic human head movement |
---|
0:03:37 | the idea for this dataset was that would be useful for speaker recognition |
---|
0:03:41 | automatic speech recognition |
---|
0:03:43 | speech enhancement |
---|
0:03:45 | and speech activity detection |
---|
0:03:49 | here a couple different statistics from the voices dataset |
---|
0:03:52 | it is released under the creative commons full license and that makes it accessible commercial |
---|
0:03:57 | academic and government used |
---|
0:04:00 | when a large number of speakers three hundred over for different rooms |
---|
0:04:04 | up to twenty different microphones and different microphone types |
---|
0:04:09 | these source data so that we used it was a read speech data set accordingly |
---|
0:04:13 | per state |
---|
0:04:15 | and we've got number of different background noises including babble |
---|
0:04:18 | music |
---|
0:04:18 | and tv sounds |
---|
0:04:20 | of the loudspeaker when it orientates for re mimicking human head movement |
---|
0:04:25 | a ranges between zero two hundred ninety degrees |
---|
0:04:30 | but still will read half of what we sort in the challenge of two thousand |
---|
0:04:33 | nine ten |
---|
0:04:35 | we have two different a speaker recognition asr |
---|
0:04:39 | and they had to different task conditions one was a fixed condition |
---|
0:04:42 | and the idea here was the data was constrained |
---|
0:04:45 | everyone got to use the sign constraint dataset |
---|
0:04:48 | the purpose behind this was to benchmarking assistance trained with that same data set to |
---|
0:04:53 | see if there's a dramatic difference between interictal technologies for what was commonly applied |
---|
0:04:58 | in the open condition |
---|
0:05:00 | it were left use any available dataset private or public |
---|
0:05:04 | now the idea here was to quantify those guys it could be achieved when we |
---|
0:05:07 | have and constraints amount of data |
---|
0:05:09 | relative to the fixed condition |
---|
0:05:14 | in terms of the goal here |
---|
0:05:15 | well looking at |
---|
0:05:16 | can we determine whether i target speaker space |
---|
0:05:20 | in a segment of speech and that's true enrollment of that target speaker |
---|
0:05:25 | but performance metric is too much the nist sre |
---|
0:05:28 | cost functions |
---|
0:05:29 | when the parameters on screen |
---|
0:05:32 | as far as the challenge we also provided a score of so uses can measure |
---|
0:05:36 | performance |
---|
0:05:37 | during development confirm the validity of discourse before they submitting them to us for evaluation |
---|
0:05:46 | and the training set in fixed condition was my to all speakers in the while |
---|
0:05:51 | but a collection |
---|
0:05:52 | and voxel n one and fox lead to datasets |
---|
0:05:57 | in terms of development and evaluation died of the challenge participants for lead to develop |
---|
0:06:02 | on the development data |
---|
0:06:04 | and then it was held out evaluation data that i and schmack the systems on |
---|
0:06:08 | another couple of different things to point out here about how we divided these conditions |
---|
0:06:12 | here |
---|
0:06:14 | we make sure that we actually had some room mismatch between enrollment and test |
---|
0:06:18 | as well as rooms use between development and evaluation |
---|
0:06:22 | and this is to help me to mimic mitigate |
---|
0:06:25 | sorry mimic what would happen with a system developed in all of our tree true |
---|
0:06:29 | level data |
---|
0:06:30 | and then sent out for the real world use |
---|
0:06:34 | similarly we had mismatch between enrollment and test all the microphone type |
---|
0:06:39 | comparing the studio two lapel |
---|
0:06:41 | or to the l members and their own remote |
---|
0:06:45 | we also had mismatch between the enrollment and verification for the microphone used |
---|
0:06:50 | between those two different tasks |
---|
0:06:54 | finally the last speaker orientation |
---|
0:06:56 | we have quite a range then we list of those ranges so that we lie |
---|
0:07:00 | to analyze the impact of head movement on speaker recognition |
---|
0:07:05 | in terms the results we had twenty one change successfully submitted scores |
---|
0:07:09 | and for voiced aims also submitted the scores for the open submission so we can |
---|
0:07:14 | get that comparison point |
---|
0:07:17 | entire we have fifty i system submissions if a fixed knife right |
---|
0:07:21 | however on the side here was shown that all scores for each |
---|
0:07:24 | a t |
---|
0:07:26 | i will begin to these a little bit on the next slide |
---|
0:07:29 | let's start analysing some others results |
---|
0:07:33 | the first thing we did was we would that the confidence intervals the ninety five |
---|
0:07:36 | percent confidence intervals |
---|
0:07:38 | and we did this by using a modified version of a joint bootstrapping technique |
---|
0:07:42 | reference can be found in i |
---|
0:07:45 | now the reason we modified this was to account for the correlation of trials to |
---|
0:07:49 | more to multiple models being available per speaker |
---|
0:07:54 | that is different recording from a speaker could have represent a different in rowley |
---|
0:07:59 | and so this correlation that happens |
---|
0:08:01 | in the trial scores |
---|
0:08:03 | what we're calling here the in people's between the five ninety five percent files |
---|
0:08:07 | of the resulting empirical distribution |
---|
0:08:10 | now if we look at those top for scores on them in a little |
---|
0:08:13 | we can see that the confidence intervals on our |
---|
0:08:16 | when you don't take into account the speaker sampling all the multiple models per state |
---|
0:08:20 | so can easily as if we don't take that into account |
---|
0:08:24 | what we should be looking other red buffy |
---|
0:08:27 | that gives us a more true impact of what the confidence intervals are |
---|
0:08:33 | and from look at those for systems with respect to the other submissions |
---|
0:08:37 | we see that the significantly different compared to the rest of this submission |
---|
0:08:41 | however they also perform relatively similar |
---|
0:08:47 | somebody observations we found when looking at about what a different group submitted |
---|
0:08:52 | wasn't the top teams applied weighted prediction error for dereverberation i remember the voices corpus |
---|
0:08:59 | has a lottery of the rooms a quite and noisy |
---|
0:09:02 | and that was the and the person really step |
---|
0:09:06 | every team also use an extract the system with that are augmentation |
---|
0:09:10 | and this is sometimes complimented we present it image net and that's net based architectures |
---|
0:09:16 | but i was the most popular choice in the back and |
---|
0:09:19 | and system calibration was actually for is crucial here |
---|
0:09:23 | with all the bottom sixteens final to achieve good system calibration |
---|
0:09:27 | and what that means is there was a significant difference between the minimum |
---|
0:09:30 | and actual dcf values for which the system should have been shameful |
---|
0:09:36 | a cycle now what happens when you change of the enrollment condition |
---|
0:09:41 | in particular we looking at what happens in reverberant environment should be used source data |
---|
0:09:46 | that is now reverberation close talking microphone |
---|
0:09:49 | well use data from a different room |
---|
0:09:52 | with reverberation |
---|
0:09:54 | to enrol |
---|
0:09:56 | but we actually stole one can i of the blue results of the balloon buffy |
---|
0:10:00 | a resource enrollment against testing with room for data whereas the red us enrolling on |
---|
0:10:08 | reverberant room three data |
---|
0:10:10 | against the same test larry |
---|
0:10:12 | we see the red bows a higher than if i |
---|
0:10:15 | this reverberation enrollment |
---|
0:10:18 | cost than on to forty two percent relative source enrol a degradation |
---|
0:10:23 | and that depends on the system is being benchmark they're of course |
---|
0:10:26 | but it does suggest that speaker should be enrolled using close-talking segment |
---|
0:10:31 | a clean this stage |
---|
0:10:33 | basically when you have this in our all different reverberation between enrollment and test |
---|
0:10:38 | reverberation doesn't have a role |
---|
0:10:41 | when enrolling on it |
---|
0:10:45 | but several different background distracted |
---|
0:10:48 | we call them distracters because then to start the system from that fruit speech the |
---|
0:10:52 | speaker |
---|
0:10:53 | we had t v in the background |
---|
0:10:55 | or babble noise in the background |
---|
0:10:57 | when enrolling we enrolled clean speech no destruction |
---|
0:11:01 | but from verification we had three different types no destruction |
---|
0:11:04 | t vs the tv noise which sometimes include stage |
---|
0:11:08 | and babble noise |
---|
0:11:11 | and what we found that the systems that was submitted would reasonably robust to the |
---|
0:11:14 | effect of t v noise in the background |
---|
0:11:16 | however with babble |
---|
0:11:18 | including the speech environment for the true speaker |
---|
0:11:21 | resort in we have forty five to fifty percent relative degradation so it's quite a |
---|
0:11:26 | significant drop the |
---|
0:11:30 | okay now microphone time |
---|
0:11:33 | we had i studio mic place close to the source for enrollment |
---|
0:11:36 | and then treated from my class lapel men's and down tree but verification at different |
---|
0:11:41 | positions |
---|
0:11:42 | and look at different distances i in the next slide |
---|
0:11:45 | here we just one to look at how you different microphones |
---|
0:11:49 | the quite |
---|
0:11:51 | consistently across systems |
---|
0:11:53 | have a step down going from boundary commenced a lapel microphone |
---|
0:12:00 | from looking at different distances we just looking at a top five systems here |
---|
0:12:04 | to constrain results to look at |
---|
0:12:07 | with the lapel mikes placed at seven distances for the top five times |
---|
0:12:12 | note you to be self non-overlapping masking effects them or just of my standard approach |
---|
0:12:17 | to parse a greater challenge |
---|
0:12:20 | what was interesting of the bus the really stand out the |
---|
0:12:23 | read |
---|
0:12:25 | kill and blue |
---|
0:12:27 | tended to be partially obscured |
---|
0:12:28 | so some of them are actually hidden |
---|
0:12:30 | all very far from the |
---|
0:12:33 | speaker |
---|
0:12:34 | so the standard really draw performance as well |
---|
0:12:39 | this also tends to explain the poor performance of the lapel mikes in general |
---|
0:12:43 | embedded and remains a sore on the previous slide |
---|
0:12:48 | and it was a summary now we're looking at the remaining challenges |
---|
0:12:51 | based on organs a so far from voices publications and system submissions |
---|
0:12:58 | but range in the ratio are characteristic |
---|
0:13:01 | was to the three times worse than evaluation set and also the development set |
---|
0:13:06 | now this was quite i |
---|
0:13:09 | great the level of reverberation evaluation room |
---|
0:13:12 | embedded development |
---|
0:13:13 | and i was quite clear we found that this |
---|
0:13:17 | sue severe amount of reverberation the country that the degree to degrade results compared to |
---|
0:13:22 | development |
---|
0:13:24 | current speaker recognition technology doesn't tend to address |
---|
0:13:27 | the impact of reverberation sufficiently |
---|
0:13:31 | the error rates a lot harder for reverberation condition then the source signal |
---|
0:13:35 | the reverberation in the presence of noise for the degrades the performance |
---|
0:13:39 | and the |
---|
0:13:40 | increasing distance |
---|
0:13:42 | provides a big impact of reverberation and degraded performance |
---|
0:13:46 | so we need to explore novel speaker modeling techniques in a context is capable of |
---|
0:13:50 | handling long time information |
---|
0:13:53 | utterances the alien light reverberation the can happen in this nice |
---|
0:13:57 | and try and make a robust to multiple noise conditions |
---|
0:14:02 | system calibration is seven was critical for systems deployed in the real world |
---|
0:14:07 | the bottom sixteen style to successfully calibrated system |
---|
0:14:10 | and the previous work to shine that there is actually allows degradation calibration performance when |
---|
0:14:15 | the distance the microphone |
---|
0:14:17 | is significantly different between the calibration training conditions and one attacks to the court |
---|
0:14:23 | so one way that we might be out to mitigate justify effect |
---|
0:14:27 | is to have calibration methods that dynamically consider conditions of the trial |
---|
0:14:32 | the predicted distance for instance |
---|
0:14:36 | and that of the challenge is based on single-channel market find voices that was actually |
---|
0:14:40 | collected with microphone well my |
---|
0:14:42 | more microphones in the room |
---|
0:14:44 | and we haven't looked into the effective for instance being for me |
---|
0:14:49 | and there are a number of the front end processing |
---|
0:14:52 | that would like to look at |
---|
0:14:53 | including speech enhancement |
---|
0:14:55 | dereverberation a little bit so that typically for the task of speaker recognition |
---|
0:15:01 | so we hope you enjoy this special session at odyssey this year and that you |
---|
0:15:06 | continue to drive technology forward in these areas |
---|
0:15:09 | and we look forward to seeing what comes out of it |
---|
0:15:12 | thank you |
---|