0:00:17i everyone finds not require an and all be presenting the work by myself and
0:00:23my car was
0:00:24on the voices from a distance challenge two thousand nineteen analysis of speaker verification results
0:00:30and main challenges
0:00:34when we look at evaluations and challenges in the community they tend to provide a
0:00:38common data
0:00:39benchmarks performance metrics for the advancement of research in speaker recognition community
0:00:44some examples of might be for me without the nist sre series
0:00:49speakers in the one challenge
0:00:50voxel speaker recognition challenge
0:00:52and s-dsc
0:00:54previous evaluations focused on speaker verification in the mines considering telephone dialler microphone data
0:01:01different speaking styles
0:01:03noisy data vocal effort audio from video short duration and better and while
0:01:08however they haven't been that many that focus on
0:01:11there even inside in the far-field distant speaker domain
0:01:16now
0:01:17nowadays we've got commercial personal assistants a at a really
0:01:22outstanding in this area so trying to get a bit more of an understanding in
0:01:25this context is important especially when we will you know single microphone
0:01:29scenario
0:01:31and the voices from a distance challenge in two thousand nineteen was hosted by sri
0:01:35international and lap forty one
0:01:38i in to stage two thousand and nineteen
0:01:41and what this channel focused on was both speaker recognition and speech recognition
0:01:45using the distant farfield speech acquired using a single microphone
0:01:49in noisy and realistic reverberant environments
0:01:52there are several objectives that we had for this challenge
0:01:55one was the benchmark in a state-of-the-art technology for farfield speech
0:02:00we want to support the development of new ideas in technology to bring that technology
0:02:05for
0:02:06we wanted to support new research groups entering the field of distant speech processing
0:02:11and that was larceny and i would be publicly available dataset
0:02:15i think that this is realistic of reverberation characteristics
0:02:21what we noticed since the release of the public database in two thousand nineteen of
0:02:25those an increase use the voices dataset
0:02:28so we thought this actually called for will be current special session that we're hosting
0:02:32here in odyssey two thousand forty even not virtual
0:02:36now the session we're hoping will focus on broad areas such a single buses multi
0:02:41channel speaker recognition
0:02:43single versus multichannel speech enhancement for speaker recognition
0:02:47domain adaptation for farfield speaker recognition
0:02:51calibration in five two conditions
0:02:53and advancing the standard
0:02:55over what we saw in the voices from a distance challenge two thousand nine ten
0:03:01let's have a look at what the voices corpus actually had in
0:03:05so voices stands for voice is obscured in complex environment setting
0:03:09and it is alaska now publicly available corpus collecting in collected in the real reverberant
0:03:15environments
0:03:16well we have inside the dataset is
0:03:19three thousand nine hundred or more hours of audio
0:03:22from about a million segment
0:03:24multiple rooms for internal
0:03:26different distracters that just t v in babble noise
0:03:29and different microphones different distances
0:03:32we even have a male speaker the right tails to mimic human head movement
0:03:37the idea for this dataset was that would be useful for speaker recognition
0:03:41automatic speech recognition
0:03:43speech enhancement
0:03:45and speech activity detection
0:03:49here a couple different statistics from the voices dataset
0:03:52it is released under the creative commons full license and that makes it accessible commercial
0:03:57academic and government used
0:04:00when a large number of speakers three hundred over for different rooms
0:04:04up to twenty different microphones and different microphone types
0:04:09these source data so that we used it was a read speech data set accordingly
0:04:13per state
0:04:15and we've got number of different background noises including babble
0:04:18music
0:04:18and tv sounds
0:04:20of the loudspeaker when it orientates for re mimicking human head movement
0:04:25a ranges between zero two hundred ninety degrees
0:04:30but still will read half of what we sort in the challenge of two thousand
0:04:33nine ten
0:04:35we have two different a speaker recognition asr
0:04:39and they had to different task conditions one was a fixed condition
0:04:42and the idea here was the data was constrained
0:04:45everyone got to use the sign constraint dataset
0:04:48the purpose behind this was to benchmarking assistance trained with that same data set to
0:04:53see if there's a dramatic difference between interictal technologies for what was commonly applied
0:04:58in the open condition
0:05:00it were left use any available dataset private or public
0:05:04now the idea here was to quantify those guys it could be achieved when we
0:05:07have and constraints amount of data
0:05:09relative to the fixed condition
0:05:14in terms of the goal here
0:05:15well looking at
0:05:16can we determine whether i target speaker space
0:05:20in a segment of speech and that's true enrollment of that target speaker
0:05:25but performance metric is too much the nist sre
0:05:28cost functions
0:05:29when the parameters on screen
0:05:32as far as the challenge we also provided a score of so uses can measure
0:05:36performance
0:05:37during development confirm the validity of discourse before they submitting them to us for evaluation
0:05:46and the training set in fixed condition was my to all speakers in the while
0:05:51but a collection
0:05:52and voxel n one and fox lead to datasets
0:05:57in terms of development and evaluation died of the challenge participants for lead to develop
0:06:02on the development data
0:06:04and then it was held out evaluation data that i and schmack the systems on
0:06:08another couple of different things to point out here about how we divided these conditions
0:06:12here
0:06:14we make sure that we actually had some room mismatch between enrollment and test
0:06:18as well as rooms use between development and evaluation
0:06:22and this is to help me to mimic mitigate
0:06:25sorry mimic what would happen with a system developed in all of our tree true
0:06:29level data
0:06:30and then sent out for the real world use
0:06:34similarly we had mismatch between enrollment and test all the microphone type
0:06:39comparing the studio two lapel
0:06:41or to the l members and their own remote
0:06:45we also had mismatch between the enrollment and verification for the microphone used
0:06:50between those two different tasks
0:06:54finally the last speaker orientation
0:06:56we have quite a range then we list of those ranges so that we lie
0:07:00to analyze the impact of head movement on speaker recognition
0:07:05in terms the results we had twenty one change successfully submitted scores
0:07:09and for voiced aims also submitted the scores for the open submission so we can
0:07:14get that comparison point
0:07:17entire we have fifty i system submissions if a fixed knife right
0:07:21however on the side here was shown that all scores for each
0:07:24a t
0:07:26i will begin to these a little bit on the next slide
0:07:29let's start analysing some others results
0:07:33the first thing we did was we would that the confidence intervals the ninety five
0:07:36percent confidence intervals
0:07:38and we did this by using a modified version of a joint bootstrapping technique
0:07:42reference can be found in i
0:07:45now the reason we modified this was to account for the correlation of trials to
0:07:49more to multiple models being available per speaker
0:07:54that is different recording from a speaker could have represent a different in rowley
0:07:59and so this correlation that happens
0:08:01in the trial scores
0:08:03what we're calling here the in people's between the five ninety five percent files
0:08:07of the resulting empirical distribution
0:08:10now if we look at those top for scores on them in a little
0:08:13we can see that the confidence intervals on our
0:08:16when you don't take into account the speaker sampling all the multiple models per state
0:08:20so can easily as if we don't take that into account
0:08:24what we should be looking other red buffy
0:08:27that gives us a more true impact of what the confidence intervals are
0:08:33and from look at those for systems with respect to the other submissions
0:08:37we see that the significantly different compared to the rest of this submission
0:08:41however they also perform relatively similar
0:08:47somebody observations we found when looking at about what a different group submitted
0:08:52wasn't the top teams applied weighted prediction error for dereverberation i remember the voices corpus
0:08:59has a lottery of the rooms a quite and noisy
0:09:02and that was the and the person really step
0:09:06every team also use an extract the system with that are augmentation
0:09:10and this is sometimes complimented we present it image net and that's net based architectures
0:09:16but i was the most popular choice in the back and
0:09:19and system calibration was actually for is crucial here
0:09:23with all the bottom sixteens final to achieve good system calibration
0:09:27and what that means is there was a significant difference between the minimum
0:09:30and actual dcf values for which the system should have been shameful
0:09:36a cycle now what happens when you change of the enrollment condition
0:09:41in particular we looking at what happens in reverberant environment should be used source data
0:09:46that is now reverberation close talking microphone
0:09:49well use data from a different room
0:09:52with reverberation
0:09:54to enrol
0:09:56but we actually stole one can i of the blue results of the balloon buffy
0:10:00a resource enrollment against testing with room for data whereas the red us enrolling on
0:10:08reverberant room three data
0:10:10against the same test larry
0:10:12we see the red bows a higher than if i
0:10:15this reverberation enrollment
0:10:18cost than on to forty two percent relative source enrol a degradation
0:10:23and that depends on the system is being benchmark they're of course
0:10:26but it does suggest that speaker should be enrolled using close-talking segment
0:10:31a clean this stage
0:10:33basically when you have this in our all different reverberation between enrollment and test
0:10:38reverberation doesn't have a role
0:10:41when enrolling on it
0:10:45but several different background distracted
0:10:48we call them distracters because then to start the system from that fruit speech the
0:10:52speaker
0:10:53we had t v in the background
0:10:55or babble noise in the background
0:10:57when enrolling we enrolled clean speech no destruction
0:11:01but from verification we had three different types no destruction
0:11:04t vs the tv noise which sometimes include stage
0:11:08and babble noise
0:11:11and what we found that the systems that was submitted would reasonably robust to the
0:11:14effect of t v noise in the background
0:11:16however with babble
0:11:18including the speech environment for the true speaker
0:11:21resort in we have forty five to fifty percent relative degradation so it's quite a
0:11:26significant drop the
0:11:30okay now microphone time
0:11:33we had i studio mic place close to the source for enrollment
0:11:36and then treated from my class lapel men's and down tree but verification at different
0:11:41positions
0:11:42and look at different distances i in the next slide
0:11:45here we just one to look at how you different microphones
0:11:49the quite
0:11:51consistently across systems
0:11:53have a step down going from boundary commenced a lapel microphone
0:12:00from looking at different distances we just looking at a top five systems here
0:12:04to constrain results to look at
0:12:07with the lapel mikes placed at seven distances for the top five times
0:12:12note you to be self non-overlapping masking effects them or just of my standard approach
0:12:17to parse a greater challenge
0:12:20what was interesting of the bus the really stand out the
0:12:23read
0:12:25kill and blue
0:12:27tended to be partially obscured
0:12:28so some of them are actually hidden
0:12:30all very far from the
0:12:33speaker
0:12:34so the standard really draw performance as well
0:12:39this also tends to explain the poor performance of the lapel mikes in general
0:12:43embedded and remains a sore on the previous slide
0:12:48and it was a summary now we're looking at the remaining challenges
0:12:51based on organs a so far from voices publications and system submissions
0:12:58but range in the ratio are characteristic
0:13:01was to the three times worse than evaluation set and also the development set
0:13:06now this was quite i
0:13:09great the level of reverberation evaluation room
0:13:12embedded development
0:13:13and i was quite clear we found that this
0:13:17sue severe amount of reverberation the country that the degree to degrade results compared to
0:13:22development
0:13:24current speaker recognition technology doesn't tend to address
0:13:27the impact of reverberation sufficiently
0:13:31the error rates a lot harder for reverberation condition then the source signal
0:13:35the reverberation in the presence of noise for the degrades the performance
0:13:39and the
0:13:40increasing distance
0:13:42provides a big impact of reverberation and degraded performance
0:13:46so we need to explore novel speaker modeling techniques in a context is capable of
0:13:50handling long time information
0:13:53utterances the alien light reverberation the can happen in this nice
0:13:57and try and make a robust to multiple noise conditions
0:14:02system calibration is seven was critical for systems deployed in the real world
0:14:07the bottom sixteen style to successfully calibrated system
0:14:10and the previous work to shine that there is actually allows degradation calibration performance when
0:14:15the distance the microphone
0:14:17is significantly different between the calibration training conditions and one attacks to the court
0:14:23so one way that we might be out to mitigate justify effect
0:14:27is to have calibration methods that dynamically consider conditions of the trial
0:14:32the predicted distance for instance
0:14:36and that of the challenge is based on single-channel market find voices that was actually
0:14:40collected with microphone well my
0:14:42more microphones in the room
0:14:44and we haven't looked into the effective for instance being for me
0:14:49and there are a number of the front end processing
0:14:52that would like to look at
0:14:53including speech enhancement
0:14:55dereverberation a little bit so that typically for the task of speaker recognition
0:15:01so we hope you enjoy this special session at odyssey this year and that you
0:15:06continue to drive technology forward in these areas
0:15:09and we look forward to seeing what comes out of it
0:15:12thank you