0:00:15okay the third talk we have is on the two thousand best speaker recognition interim
0:00:23assessment and crickets can present
0:00:42okay so we have
0:00:43for all results from best and how important is the calibrated test data
0:00:49so now i'm going to give you the background information to help you interpret what
0:00:54you're previously
0:00:59a quick on the best program it stands for biometrics exploitation of science and technology
0:01:06was my or program that once in two thousand and nine with the objective of
0:01:10advancing the state-of-the-art in biometrics including speaker recognition models anymore about it but if you
0:01:15want more information here's a link
0:01:20no best managed what we call the best the best
0:01:25which was be best evaluation speaker track for the best interim assessment the objective here
0:01:32was to measure progress in speaker recognition relative to performance prior to sort the program
0:01:38also to measure performance on data due to speaker recognition evaluations
0:01:45a cell for those already which clusterise
0:01:51please
0:01:53no one's one to close there is
0:01:56well the yeah the thing i like to know is
0:01:58i should work people have not everyone is closer as you want have anonymity here
0:02:03who does not know what speaker detection is the users right
0:02:08okay
0:02:10how about a target speaker
0:02:12anyone not know what this is or non-target
0:02:14or test segment
0:02:16anyone that this is
0:02:18okay
0:02:20i just like a segment of speech data containing one or more unknown speakers
0:02:25and the last anyone not know what mixtures
0:02:30okay this is a set of speech corpora collectible the ldc to support speaker recognition
0:02:38okay
0:02:39so the data used for the evaluation was very big
0:02:44very complex a bigger than anything a nist as the speaker recognition prior to that
0:02:50and that could over thousand speakers at two thousand audio segments and forty one million
0:02:55trials
0:02:57make use of previous history collections is what was newly collected a mixer data which
0:03:02we just below that was
0:03:04including mixture one and two
0:03:06which at the five different languages arabic english mandarin russian and spanish
0:03:11mixer five
0:03:13so mixer five was used in sre wait for those are from the of the
0:03:17collection these are the mixer five speakers were not used in this way
0:03:22mixer six likewise was using the sri to use of the seven year so speakers
0:03:27that we're not using the sri ten
0:03:29oh used greybeard
0:03:31so people metrical we used gradient sre ten also but not really sticky and it
0:03:37was
0:03:38to be able to use it in this evaluation of that we were encouraged not
0:03:42really sticky
0:03:43as well as mixture seven which was newly collected with the objective of addressing new
0:03:50sources variability
0:03:52to analyze this part of the best evaluation
0:03:56and
0:03:57so i think i one over the so that there were nine core conditions men
0:04:01to focus the research
0:04:04first telephone wires train and test on telephone phone calls this is that like common
0:04:09condition
0:04:10clusters are used for maybe since the beginning
0:04:14microphone train and test on far-field microphone uses sort of the interview condition
0:04:22channel condition where we train on microphone and test on telephone of but limit or
0:04:27consideration phone calls
0:04:29another microphone condition where we trained on four or near field mikes and test on
0:04:34telephone also limited the phone calls
0:04:37and the speaking style condition we train on interview test on phone calls these restricted
0:04:40the microphones
0:04:42a language condition where we train on a multiple languages train it doesn't languages
0:04:48second one where we train and test on two languages of the single both microphones
0:04:52and phone calls
0:04:54oh
0:04:55and telephone
0:04:56a multisession train
0:04:59second multisession train the difference between these two is one was test on phone call
0:05:03the other test on interview
0:05:04and that's really
0:05:07so to look at
0:05:09actors a effective performance we really want to focus the evaluation on achieving a more
0:05:15measuring system robustness
0:05:17and also we saw yesterday that fact
0:05:21errors we're condition the three categories
0:05:24intrinsic extrinsic and metric
0:05:28as these are all actually tested in the best evaluation
0:05:32speech style where there is an interview or phone call
0:05:35a vocal effort
0:05:36i where there's normal vocal effort these recorded over a cell phone low vocal effort
0:05:41or high vocal effort
0:05:43in terms of the vocal effort
0:05:45i vocal effort low vocal effort reduced of your headsets
0:05:49morning and collected for internal phone calls
0:05:52and high vocal effort the headset and the noise butlers i don't those of the
0:05:57intentional artifact low vocal effort there was not always but a high side to encouraging
0:06:03the speaker to lower his or her voice
0:06:06another set question that remains whether this is a realistic approach is well that's what
0:06:12actually is produced
0:06:14regardless it seems the effects are interesting
0:06:18as we talk about intrinsic in terms of extrinsic factors oh yeah channel which is
0:06:23microphone versus telephone a different microphone types
0:06:26and telephone we had transmission different transmission and handsets
0:06:33and
0:06:34something new and
0:06:37a relatively interesting was changing the distance between interviewer and subject
0:06:44to see if there was some additional vocal effort that could be listed in that
0:06:48way
0:06:49reverberation here we have that was artificially added but in reality there were two different
0:06:54rooms one that was meant to be reverberant one that was meant to be not
0:06:58reverberant
0:06:59so that was both
0:07:01for natural reverberation
0:07:03as well as additive
0:07:05reverberation and additive noise
0:07:11in terms of additive noise than how we do this
0:07:14oh sorry in terms of a river how we did this using a procedure that
0:07:17was proposed by mitre
0:07:18but actually implemented by mit lincoln labs
0:07:21and was i don't of the participants at transfer
0:07:27the method was the transform collected signals that have
0:07:32reverberation qualities of a particular rooms
0:07:36a given a range of dimensions and service conditions
0:07:39as george show there were seven different reverberation conditions
0:07:43point one six or so to one point three or so rt sixty
0:07:50additive noise was also implemented by mit lincoln labs
0:07:54other two noise types one which is each be easy which is heating ventilation and
0:08:02air conditioning
0:08:03so this is a sort of standard office room background noise
0:08:07as well as the speech spectrum noise so this was a
0:08:10a gaussian noise filtered to these spectra spectrum of speech
0:08:16and there are two different noise levels well one fifteen db the other six db
0:08:22and these were
0:08:23see message weighted
0:08:25five percent correctly
0:08:28sources here if you
0:08:34as speech spectrum fifteen db
0:08:40i
0:08:44i
0:08:46i
0:08:50vol
0:08:53and the sixty V
0:08:59i
0:09:00i
0:09:03i
0:09:04i
0:09:08and now extracted sixteen
0:09:15i
0:09:18i
0:09:18i
0:09:24a
0:09:29okay
0:09:30yeah and in terms of parametric factor
0:09:33as there were a five different languages
0:09:36and there's also eating data from greybeard
0:09:39same data is returned as a so this is the reason why
0:09:42we were encouraged not to should be the key
0:09:45and there were multiple training session something that was new to the best in the
0:09:50past without multiple training sessions that were phone calls that we actually in this case
0:09:55that multiple training sessions over interviews somewheres the same speech over a microphone summers of
0:10:00maybe one or two microphones but different speech
0:10:12so something we've seen a few times
0:10:15but maybe i'll explain some is the primary metric was different for best originally conceived
0:10:21and
0:10:21nineteen ninety six
0:10:24to "'cause"
0:10:26something like sixteen years to implement
0:10:29a the false alarm rate and the corresponding miss rate of ten percent as set
0:10:35as distinct but i one of the advantages that was simple and clearly defined
0:10:40and the false alarm rate may be viewed as representing the cost of the wasted
0:10:44listening effort incurred by using the system a specified miss rate
0:10:49i in contrast equal error rate it does focus on the low false alarm region
0:10:53which is likely to be of interest to a number of applications
0:10:57and we can use the role of thirty to determine that with render target trials
0:11:02that we'd only need to enter target trials or so to get the required miss
0:11:07rate
0:11:13so let's look at just a few general performance trends i should note here that
0:11:16we are sharing general trends there were some things that were system specific but what
0:11:23we're trying to share here
0:11:24is are things that were a common across the systems that were submitted for this
0:11:28evaluation
0:11:32so in terms of language
0:11:35what we see is the green lines are baseline system so this is a system
0:11:42that was
0:11:43i meant to be the state-of-the-art prior to the start of the evaluation
0:11:49and the blue was a system submitted
0:11:52for the best evaluation
0:11:54the solid lines
0:11:57or english
0:11:58dashed lines or spanish
0:12:01two things to is that the system submitted for the
0:12:05a best evaluation shows better performance than the baseline across the range of operating points
0:12:11also performance on english and spanish data were comparable
0:12:16so not a big language effect
0:12:19so this is my to help
0:12:21and calibrate people's eyes
0:12:24to the new metric
0:12:27so here we see
0:12:30the false alarm rate of ten percent miss rate
0:12:33somewhere around twenty point eight
0:12:37percent
0:12:39second one that run point one percent
0:12:43and the third and fourth
0:12:49is that both point five percent
0:12:54here's another look at language this is for mixture one and two
0:12:58single system
0:13:00and the system performed mostly better on english and spanish and on the other languages
0:13:07and fact some systems perform better on spanish than english why this would be sort
0:13:12of a missed regions right
0:13:14and another chance to
0:13:17the chance to calibrate your eyes to the new metric
0:13:19so we can across the ten percent
0:13:21the intersections the primary metric for best
0:13:27so for speaking style
0:13:29we're looking at one system's performance again line is train and test on interview the
0:13:35green line is train on interview test on phone call the restraint on phonecall test
0:13:39on phone call
0:13:41and interview train and test gives best performance but there's a confounding factor here interview
0:13:49rooms were actually longer
0:13:51the test segments of this could explain why perform better
0:13:58so we did a test in history
0:14:01ten one vocal effort and found something somewhat surprising that low vocal effort perform better
0:14:07than normal high vocal effort expected high vocal effort to perform worse but we also
0:14:11expected low vocal effort before worse than normal vocal effort
0:14:16similarly and the best evaluation high vocal effort stands out across systems as part
0:14:21and low vocal effort stands out across systems is easy
0:14:27and this is consistent similar to what we thought of this return
0:14:33so there were a thing seven reverb conditions plus no reverb
0:14:40so these very widely in terms of how much reverb was applied for those who
0:14:48are not
0:14:50able to immediately imagine what something would sound like based on an rt sixty let
0:14:55me play a couple examples
0:15:00so this was the least reverberant
0:15:03condition
0:15:05very well only
0:15:17and the most reverberant condition
0:15:20i
0:15:24i
0:15:28i
0:15:30oh
0:15:33and the thing to notice on this plot the degradation corresponds tardy sixty time and
0:15:39that despite the mismatch that's that the train is on a noisy
0:15:45are not reverb speech and the test is on reverb speech despite this mismatch performance
0:15:50seem to be better with a small amount of reverb
0:15:54that was no reverb at all
0:15:56was also somewhat surprising
0:16:00curious in an additive noise again no noise and train but noise and test
0:16:06and when testing without noise fifteen db noise had little effect and sixteen to be
0:16:13a more
0:16:14one thing that was kind of interesting to us
0:16:17was that if you look at the
0:16:21oh for example the red in the dark blue line or the site and the
0:16:26green line and there's not a lot of difference
0:16:29and this is the price of expected speech spectrum noise to be more difficult to
0:16:32do with other than expect noise
0:16:35but as you can see there really was not a great you difference between the
0:16:39two spectra
0:16:43one other point i think to make have been made a suggestion that maybe one
0:16:48percent is more appropriate under certain circumstances and ten percent and this condition that performance
0:16:53was
0:16:55and good enough that we were not able to measure
0:16:58the primary metric because it never reached a ten percent that's right
0:17:05so it's analysis largest and most complex by a fair amount this leads speaker recognition
0:17:12evaluation today examine several factors affecting performance observed several surprising results some of which were
0:17:22recreation of the results in sre ten that there were new conditions for example additive
0:17:26noise and reverberation also provided some surprises which measures the earlier
0:17:31us to use of synthetic data this is very exciting to us because it's
0:17:35other times difficult expensive to collect data and
0:17:39difficult to control
0:17:41so if synthetic data turns out to be a reasonable way to evaluate systems and
0:17:47the future this should be very exciting and useful discovery
0:17:53and finally there is improvement observed over the baseline of all conditions
0:17:58so participants should be policemen that
0:18:02thank you
0:18:06i
0:18:12i
0:18:17okay
0:18:19i
0:18:20oh
0:18:22oh
0:18:24i
0:18:32yeah i have to do not sure
0:18:37oh
0:18:45i
0:18:47oh
0:18:54i
0:19:00i
0:19:03i
0:19:18yeah
0:19:19i
0:19:20i
0:19:23i
0:19:28i
0:19:32i
0:19:45oh
0:19:59i
0:19:59i
0:20:08i
0:20:09oh
0:20:10oh
0:20:10oh
0:20:11i
0:20:14yes
0:20:17yes so that was not explored in this evaluation but this is a fruitful area
0:20:21for future
0:20:22exploration
0:20:31sure
0:20:38i
0:20:39yes cost
0:20:43i
0:20:48yes
0:20:56yes
0:20:58so and should probably admit at this point that we have not done deeper analysis
0:21:05on basically any of this
0:21:09it was a large and complex evaluation so in order to be able to handle
0:21:13the main points
0:21:15we had to do what are tied to read from the best
0:21:18idea actually would like to explore this more
0:21:24i
0:21:26i
0:21:28yes
0:21:33yes and i guess i should point that out not just in the no noise
0:21:36but across all the rivers we have this active speech just with different
0:21:43yes
0:21:46i
0:21:49i think it was the same trend
0:21:52of which was surprising but
0:21:55maybe people can offer some
0:21:59explanation or some intuition is that what this might be
0:22:20i
0:22:26i
0:22:35i
0:22:38oh
0:22:39i
0:22:41i
0:22:46i
0:22:47i
0:22:48i
0:22:51oh
0:22:54interested look
0:22:57i
0:22:58yes
0:23:00i
0:23:01i see so two of them to make sure i understand what you're saying and
0:23:04that the target scores were two widely distributed you don't are simply sums the thought
0:23:18oh
0:23:21yes
0:23:23i
0:23:31okay