0:00:15 | okay the third talk we have is on the two thousand best speaker recognition interim |
---|
0:00:23 | assessment and crickets can present |
---|
0:00:42 | okay so we have |
---|
0:00:43 | for all results from best and how important is the calibrated test data |
---|
0:00:49 | so now i'm going to give you the background information to help you interpret what |
---|
0:00:54 | you're previously |
---|
0:00:59 | a quick on the best program it stands for biometrics exploitation of science and technology |
---|
0:01:06 | was my or program that once in two thousand and nine with the objective of |
---|
0:01:10 | advancing the state-of-the-art in biometrics including speaker recognition models anymore about it but if you |
---|
0:01:15 | want more information here's a link |
---|
0:01:20 | no best managed what we call the best the best |
---|
0:01:25 | which was be best evaluation speaker track for the best interim assessment the objective here |
---|
0:01:32 | was to measure progress in speaker recognition relative to performance prior to sort the program |
---|
0:01:38 | also to measure performance on data due to speaker recognition evaluations |
---|
0:01:45 | a cell for those already which clusterise |
---|
0:01:51 | please |
---|
0:01:53 | no one's one to close there is |
---|
0:01:56 | well the yeah the thing i like to know is |
---|
0:01:58 | i should work people have not everyone is closer as you want have anonymity here |
---|
0:02:03 | who does not know what speaker detection is the users right |
---|
0:02:08 | okay |
---|
0:02:10 | how about a target speaker |
---|
0:02:12 | anyone not know what this is or non-target |
---|
0:02:14 | or test segment |
---|
0:02:16 | anyone that this is |
---|
0:02:18 | okay |
---|
0:02:20 | i just like a segment of speech data containing one or more unknown speakers |
---|
0:02:25 | and the last anyone not know what mixtures |
---|
0:02:30 | okay this is a set of speech corpora collectible the ldc to support speaker recognition |
---|
0:02:38 | okay |
---|
0:02:39 | so the data used for the evaluation was very big |
---|
0:02:44 | very complex a bigger than anything a nist as the speaker recognition prior to that |
---|
0:02:50 | and that could over thousand speakers at two thousand audio segments and forty one million |
---|
0:02:55 | trials |
---|
0:02:57 | make use of previous history collections is what was newly collected a mixer data which |
---|
0:03:02 | we just below that was |
---|
0:03:04 | including mixture one and two |
---|
0:03:06 | which at the five different languages arabic english mandarin russian and spanish |
---|
0:03:11 | mixer five |
---|
0:03:13 | so mixer five was used in sre wait for those are from the of the |
---|
0:03:17 | collection these are the mixer five speakers were not used in this way |
---|
0:03:22 | mixer six likewise was using the sri to use of the seven year so speakers |
---|
0:03:27 | that we're not using the sri ten |
---|
0:03:29 | oh used greybeard |
---|
0:03:31 | so people metrical we used gradient sre ten also but not really sticky and it |
---|
0:03:37 | was |
---|
0:03:38 | to be able to use it in this evaluation of that we were encouraged not |
---|
0:03:42 | really sticky |
---|
0:03:43 | as well as mixture seven which was newly collected with the objective of addressing new |
---|
0:03:50 | sources variability |
---|
0:03:52 | to analyze this part of the best evaluation |
---|
0:03:56 | and |
---|
0:03:57 | so i think i one over the so that there were nine core conditions men |
---|
0:04:01 | to focus the research |
---|
0:04:04 | first telephone wires train and test on telephone phone calls this is that like common |
---|
0:04:09 | condition |
---|
0:04:10 | clusters are used for maybe since the beginning |
---|
0:04:14 | microphone train and test on far-field microphone uses sort of the interview condition |
---|
0:04:22 | channel condition where we train on microphone and test on telephone of but limit or |
---|
0:04:27 | consideration phone calls |
---|
0:04:29 | another microphone condition where we trained on four or near field mikes and test on |
---|
0:04:34 | telephone also limited the phone calls |
---|
0:04:37 | and the speaking style condition we train on interview test on phone calls these restricted |
---|
0:04:40 | the microphones |
---|
0:04:42 | a language condition where we train on a multiple languages train it doesn't languages |
---|
0:04:48 | second one where we train and test on two languages of the single both microphones |
---|
0:04:52 | and phone calls |
---|
0:04:54 | oh |
---|
0:04:55 | and telephone |
---|
0:04:56 | a multisession train |
---|
0:04:59 | second multisession train the difference between these two is one was test on phone call |
---|
0:05:03 | the other test on interview |
---|
0:05:04 | and that's really |
---|
0:05:07 | so to look at |
---|
0:05:09 | actors a effective performance we really want to focus the evaluation on achieving a more |
---|
0:05:15 | measuring system robustness |
---|
0:05:17 | and also we saw yesterday that fact |
---|
0:05:21 | errors we're condition the three categories |
---|
0:05:24 | intrinsic extrinsic and metric |
---|
0:05:28 | as these are all actually tested in the best evaluation |
---|
0:05:32 | speech style where there is an interview or phone call |
---|
0:05:35 | a vocal effort |
---|
0:05:36 | i where there's normal vocal effort these recorded over a cell phone low vocal effort |
---|
0:05:41 | or high vocal effort |
---|
0:05:43 | in terms of the vocal effort |
---|
0:05:45 | i vocal effort low vocal effort reduced of your headsets |
---|
0:05:49 | morning and collected for internal phone calls |
---|
0:05:52 | and high vocal effort the headset and the noise butlers i don't those of the |
---|
0:05:57 | intentional artifact low vocal effort there was not always but a high side to encouraging |
---|
0:06:03 | the speaker to lower his or her voice |
---|
0:06:06 | another set question that remains whether this is a realistic approach is well that's what |
---|
0:06:12 | actually is produced |
---|
0:06:14 | regardless it seems the effects are interesting |
---|
0:06:18 | as we talk about intrinsic in terms of extrinsic factors oh yeah channel which is |
---|
0:06:23 | microphone versus telephone a different microphone types |
---|
0:06:26 | and telephone we had transmission different transmission and handsets |
---|
0:06:33 | and |
---|
0:06:34 | something new and |
---|
0:06:37 | a relatively interesting was changing the distance between interviewer and subject |
---|
0:06:44 | to see if there was some additional vocal effort that could be listed in that |
---|
0:06:48 | way |
---|
0:06:49 | reverberation here we have that was artificially added but in reality there were two different |
---|
0:06:54 | rooms one that was meant to be reverberant one that was meant to be not |
---|
0:06:58 | reverberant |
---|
0:06:59 | so that was both |
---|
0:07:01 | for natural reverberation |
---|
0:07:03 | as well as additive |
---|
0:07:05 | reverberation and additive noise |
---|
0:07:11 | in terms of additive noise than how we do this |
---|
0:07:14 | oh sorry in terms of a river how we did this using a procedure that |
---|
0:07:17 | was proposed by mitre |
---|
0:07:18 | but actually implemented by mit lincoln labs |
---|
0:07:21 | and was i don't of the participants at transfer |
---|
0:07:27 | the method was the transform collected signals that have |
---|
0:07:32 | reverberation qualities of a particular rooms |
---|
0:07:36 | a given a range of dimensions and service conditions |
---|
0:07:39 | as george show there were seven different reverberation conditions |
---|
0:07:43 | point one six or so to one point three or so rt sixty |
---|
0:07:50 | additive noise was also implemented by mit lincoln labs |
---|
0:07:54 | other two noise types one which is each be easy which is heating ventilation and |
---|
0:08:02 | air conditioning |
---|
0:08:03 | so this is a sort of standard office room background noise |
---|
0:08:07 | as well as the speech spectrum noise so this was a |
---|
0:08:10 | a gaussian noise filtered to these spectra spectrum of speech |
---|
0:08:16 | and there are two different noise levels well one fifteen db the other six db |
---|
0:08:22 | and these were |
---|
0:08:23 | see message weighted |
---|
0:08:25 | five percent correctly |
---|
0:08:28 | sources here if you |
---|
0:08:34 | as speech spectrum fifteen db |
---|
0:08:40 | i |
---|
0:08:44 | i |
---|
0:08:46 | i |
---|
0:08:50 | vol |
---|
0:08:53 | and the sixty V |
---|
0:08:59 | i |
---|
0:09:00 | i |
---|
0:09:03 | i |
---|
0:09:04 | i |
---|
0:09:08 | and now extracted sixteen |
---|
0:09:15 | i |
---|
0:09:18 | i |
---|
0:09:18 | i |
---|
0:09:24 | a |
---|
0:09:29 | okay |
---|
0:09:30 | yeah and in terms of parametric factor |
---|
0:09:33 | as there were a five different languages |
---|
0:09:36 | and there's also eating data from greybeard |
---|
0:09:39 | same data is returned as a so this is the reason why |
---|
0:09:42 | we were encouraged not to should be the key |
---|
0:09:45 | and there were multiple training session something that was new to the best in the |
---|
0:09:50 | past without multiple training sessions that were phone calls that we actually in this case |
---|
0:09:55 | that multiple training sessions over interviews somewheres the same speech over a microphone summers of |
---|
0:10:00 | maybe one or two microphones but different speech |
---|
0:10:12 | so something we've seen a few times |
---|
0:10:15 | but maybe i'll explain some is the primary metric was different for best originally conceived |
---|
0:10:21 | and |
---|
0:10:21 | nineteen ninety six |
---|
0:10:24 | to "'cause" |
---|
0:10:26 | something like sixteen years to implement |
---|
0:10:29 | a the false alarm rate and the corresponding miss rate of ten percent as set |
---|
0:10:35 | as distinct but i one of the advantages that was simple and clearly defined |
---|
0:10:40 | and the false alarm rate may be viewed as representing the cost of the wasted |
---|
0:10:44 | listening effort incurred by using the system a specified miss rate |
---|
0:10:49 | i in contrast equal error rate it does focus on the low false alarm region |
---|
0:10:53 | which is likely to be of interest to a number of applications |
---|
0:10:57 | and we can use the role of thirty to determine that with render target trials |
---|
0:11:02 | that we'd only need to enter target trials or so to get the required miss |
---|
0:11:07 | rate |
---|
0:11:13 | so let's look at just a few general performance trends i should note here that |
---|
0:11:16 | we are sharing general trends there were some things that were system specific but what |
---|
0:11:23 | we're trying to share here |
---|
0:11:24 | is are things that were a common across the systems that were submitted for this |
---|
0:11:28 | evaluation |
---|
0:11:32 | so in terms of language |
---|
0:11:35 | what we see is the green lines are baseline system so this is a system |
---|
0:11:42 | that was |
---|
0:11:43 | i meant to be the state-of-the-art prior to the start of the evaluation |
---|
0:11:49 | and the blue was a system submitted |
---|
0:11:52 | for the best evaluation |
---|
0:11:54 | the solid lines |
---|
0:11:57 | or english |
---|
0:11:58 | dashed lines or spanish |
---|
0:12:01 | two things to is that the system submitted for the |
---|
0:12:05 | a best evaluation shows better performance than the baseline across the range of operating points |
---|
0:12:11 | also performance on english and spanish data were comparable |
---|
0:12:16 | so not a big language effect |
---|
0:12:19 | so this is my to help |
---|
0:12:21 | and calibrate people's eyes |
---|
0:12:24 | to the new metric |
---|
0:12:27 | so here we see |
---|
0:12:30 | the false alarm rate of ten percent miss rate |
---|
0:12:33 | somewhere around twenty point eight |
---|
0:12:37 | percent |
---|
0:12:39 | second one that run point one percent |
---|
0:12:43 | and the third and fourth |
---|
0:12:49 | is that both point five percent |
---|
0:12:54 | here's another look at language this is for mixture one and two |
---|
0:12:58 | single system |
---|
0:13:00 | and the system performed mostly better on english and spanish and on the other languages |
---|
0:13:07 | and fact some systems perform better on spanish than english why this would be sort |
---|
0:13:12 | of a missed regions right |
---|
0:13:14 | and another chance to |
---|
0:13:17 | the chance to calibrate your eyes to the new metric |
---|
0:13:19 | so we can across the ten percent |
---|
0:13:21 | the intersections the primary metric for best |
---|
0:13:27 | so for speaking style |
---|
0:13:29 | we're looking at one system's performance again line is train and test on interview the |
---|
0:13:35 | green line is train on interview test on phone call the restraint on phonecall test |
---|
0:13:39 | on phone call |
---|
0:13:41 | and interview train and test gives best performance but there's a confounding factor here interview |
---|
0:13:49 | rooms were actually longer |
---|
0:13:51 | the test segments of this could explain why perform better |
---|
0:13:58 | so we did a test in history |
---|
0:14:01 | ten one vocal effort and found something somewhat surprising that low vocal effort perform better |
---|
0:14:07 | than normal high vocal effort expected high vocal effort to perform worse but we also |
---|
0:14:11 | expected low vocal effort before worse than normal vocal effort |
---|
0:14:16 | similarly and the best evaluation high vocal effort stands out across systems as part |
---|
0:14:21 | and low vocal effort stands out across systems is easy |
---|
0:14:27 | and this is consistent similar to what we thought of this return |
---|
0:14:33 | so there were a thing seven reverb conditions plus no reverb |
---|
0:14:40 | so these very widely in terms of how much reverb was applied for those who |
---|
0:14:48 | are not |
---|
0:14:50 | able to immediately imagine what something would sound like based on an rt sixty let |
---|
0:14:55 | me play a couple examples |
---|
0:15:00 | so this was the least reverberant |
---|
0:15:03 | condition |
---|
0:15:05 | very well only |
---|
0:15:17 | and the most reverberant condition |
---|
0:15:20 | i |
---|
0:15:24 | i |
---|
0:15:28 | i |
---|
0:15:30 | oh |
---|
0:15:33 | and the thing to notice on this plot the degradation corresponds tardy sixty time and |
---|
0:15:39 | that despite the mismatch that's that the train is on a noisy |
---|
0:15:45 | are not reverb speech and the test is on reverb speech despite this mismatch performance |
---|
0:15:50 | seem to be better with a small amount of reverb |
---|
0:15:54 | that was no reverb at all |
---|
0:15:56 | was also somewhat surprising |
---|
0:16:00 | curious in an additive noise again no noise and train but noise and test |
---|
0:16:06 | and when testing without noise fifteen db noise had little effect and sixteen to be |
---|
0:16:13 | a more |
---|
0:16:14 | one thing that was kind of interesting to us |
---|
0:16:17 | was that if you look at the |
---|
0:16:21 | oh for example the red in the dark blue line or the site and the |
---|
0:16:26 | green line and there's not a lot of difference |
---|
0:16:29 | and this is the price of expected speech spectrum noise to be more difficult to |
---|
0:16:32 | do with other than expect noise |
---|
0:16:35 | but as you can see there really was not a great you difference between the |
---|
0:16:39 | two spectra |
---|
0:16:43 | one other point i think to make have been made a suggestion that maybe one |
---|
0:16:48 | percent is more appropriate under certain circumstances and ten percent and this condition that performance |
---|
0:16:53 | was |
---|
0:16:55 | and good enough that we were not able to measure |
---|
0:16:58 | the primary metric because it never reached a ten percent that's right |
---|
0:17:05 | so it's analysis largest and most complex by a fair amount this leads speaker recognition |
---|
0:17:12 | evaluation today examine several factors affecting performance observed several surprising results some of which were |
---|
0:17:22 | recreation of the results in sre ten that there were new conditions for example additive |
---|
0:17:26 | noise and reverberation also provided some surprises which measures the earlier |
---|
0:17:31 | us to use of synthetic data this is very exciting to us because it's |
---|
0:17:35 | other times difficult expensive to collect data and |
---|
0:17:39 | difficult to control |
---|
0:17:41 | so if synthetic data turns out to be a reasonable way to evaluate systems and |
---|
0:17:47 | the future this should be very exciting and useful discovery |
---|
0:17:53 | and finally there is improvement observed over the baseline of all conditions |
---|
0:17:58 | so participants should be policemen that |
---|
0:18:02 | thank you |
---|
0:18:06 | i |
---|
0:18:12 | i |
---|
0:18:17 | okay |
---|
0:18:19 | i |
---|
0:18:20 | oh |
---|
0:18:22 | oh |
---|
0:18:24 | i |
---|
0:18:32 | yeah i have to do not sure |
---|
0:18:37 | oh |
---|
0:18:45 | i |
---|
0:18:47 | oh |
---|
0:18:54 | i |
---|
0:19:00 | i |
---|
0:19:03 | i |
---|
0:19:18 | yeah |
---|
0:19:19 | i |
---|
0:19:20 | i |
---|
0:19:23 | i |
---|
0:19:28 | i |
---|
0:19:32 | i |
---|
0:19:45 | oh |
---|
0:19:59 | i |
---|
0:19:59 | i |
---|
0:20:08 | i |
---|
0:20:09 | oh |
---|
0:20:10 | oh |
---|
0:20:10 | oh |
---|
0:20:11 | i |
---|
0:20:14 | yes |
---|
0:20:17 | yes so that was not explored in this evaluation but this is a fruitful area |
---|
0:20:21 | for future |
---|
0:20:22 | exploration |
---|
0:20:31 | sure |
---|
0:20:38 | i |
---|
0:20:39 | yes cost |
---|
0:20:43 | i |
---|
0:20:48 | yes |
---|
0:20:56 | yes |
---|
0:20:58 | so and should probably admit at this point that we have not done deeper analysis |
---|
0:21:05 | on basically any of this |
---|
0:21:09 | it was a large and complex evaluation so in order to be able to handle |
---|
0:21:13 | the main points |
---|
0:21:15 | we had to do what are tied to read from the best |
---|
0:21:18 | idea actually would like to explore this more |
---|
0:21:24 | i |
---|
0:21:26 | i |
---|
0:21:28 | yes |
---|
0:21:33 | yes and i guess i should point that out not just in the no noise |
---|
0:21:36 | but across all the rivers we have this active speech just with different |
---|
0:21:43 | yes |
---|
0:21:46 | i |
---|
0:21:49 | i think it was the same trend |
---|
0:21:52 | of which was surprising but |
---|
0:21:55 | maybe people can offer some |
---|
0:21:59 | explanation or some intuition is that what this might be |
---|
0:22:20 | i |
---|
0:22:26 | i |
---|
0:22:35 | i |
---|
0:22:38 | oh |
---|
0:22:39 | i |
---|
0:22:41 | i |
---|
0:22:46 | i |
---|
0:22:47 | i |
---|
0:22:48 | i |
---|
0:22:51 | oh |
---|
0:22:54 | interested look |
---|
0:22:57 | i |
---|
0:22:58 | yes |
---|
0:23:00 | i |
---|
0:23:01 | i see so two of them to make sure i understand what you're saying and |
---|
0:23:04 | that the target scores were two widely distributed you don't are simply sums the thought |
---|
0:23:18 | oh |
---|
0:23:21 | yes |
---|
0:23:23 | i |
---|
0:23:31 | okay |
---|