0:00:15 | well everyone today i'm going to talk about the effects of the new testing paradigm |
---|
0:00:19 | on the nist sre twelve |
---|
0:00:21 | this work was done in collaboration with many colleagues |
---|
0:00:24 | i including alvin events john george and jack |
---|
0:00:30 | so before talking about what change nursery twelve let's just reminders also some things to |
---|
0:00:34 | say the same |
---|
0:00:36 | the task i industry twelve was text independent speaker detection |
---|
0:00:40 | by speaker detection i mean |
---|
0:00:43 | given some speech from a target speaker and some speech from a non target speaker |
---|
0:00:47 | determine whether the target speaker and the non-target speaker the same person |
---|
0:00:51 | evaluation consisted a long series of trials where a target trial is when the target |
---|
0:00:57 | speaker the non-target speaker were the same and non-target trial where the target speaker non-target |
---|
0:01:02 | speaker different so that much was the same |
---|
0:01:05 | something the change history twelve was the joint knowledge of target speakers allowed |
---|
0:01:09 | so on the past each trial had to be processed independently from one another |
---|
0:01:14 | but in a street well |
---|
0:01:16 | it was permissible to use knowledge of other target speakers for trial and this gave |
---|
0:01:23 | rise to a distinction in the non-target speakers |
---|
0:01:26 | and namely whether they were among the target speakers in which case they were considered |
---|
0:01:30 | the known to man target speaker |
---|
0:01:32 | or if they were among than not among the target speakers there are considered an |
---|
0:01:37 | unknown non-target speaker |
---|
0:01:41 | on this there were more and more very training data for each target speaker and |
---|
0:01:45 | the majority of the target speakers in the evaluation |
---|
0:01:47 | i had more than one segments for training |
---|
0:01:51 | and in those cases a often the training data itself was varied |
---|
0:01:57 | consisting of for example interviews recorded over a microphone phone calls recorded over a microphone |
---|
0:02:02 | or phone calls recorded over telephone channel |
---|
0:02:09 | most of the speakers in the sre twelve and was the target speakers are used |
---|
0:02:14 | in prior evaluations which is something that was very different and they were identified in |
---|
0:02:18 | advance and all their speech from these prior evaluations is made available |
---|
0:02:25 | of those eighteen hundred at a new data was collected from three hundred and twenty |
---|
0:02:30 | speakers roughly and roughly seventy of to we're not present in the prior evaluations and |
---|
0:02:35 | those speakers had a single phone conversation released at the time of the evaluation |
---|
0:02:43 | so we need these changes one question that may come up as y |
---|
0:02:47 | and there were several reasons among them was to explore methods realising large quantities of |
---|
0:02:53 | training data |
---|
0:02:55 | to allow participants extended period of time to work on modeling techniques |
---|
0:03:00 | to determine the benefit of allowing joint knowledge of target speakers particular the benefit of |
---|
0:03:05 | for performance and also to increase the efficiency of the data collection |
---|
0:03:13 | interest the data at the target speaker training data broke down into two cases if |
---|
0:03:17 | released in advance of the evaluation the target speaker training data consisted of prior evaluation |
---|
0:03:24 | data collected as part of the ldc mixtures of one three six and if released |
---|
0:03:29 | at the start of the evaluation |
---|
0:03:31 | the training data was a single phone conversation record as part of mixture seven |
---|
0:03:37 | for the test segments most of them came from a newly collected corpus country next |
---|
0:03:41 | which were phone calls a report over telephone channel |
---|
0:03:45 | from prior mixture speakers |
---|
0:03:48 | and they're also smaller number of phone conversations from the mixer seven corpus and |
---|
0:03:54 | and these are phone conversations recorded other over a telephone channel or microphone channel |
---|
0:04:02 | so there were many different types of trials include in the evaluation for example they |
---|
0:04:09 | were trials where the speech had |
---|
0:04:14 | noise added to it or it was reported in the naturally noisy environment |
---|
0:04:20 | but among the trials we wanted to emphasise some subsets of particular interest us so |
---|
0:04:26 | these recall common conditions there were five common conditions in the evaluation for today's presentation |
---|
0:04:32 | will just going to focus on two |
---|
0:04:34 | and those or interview speech in test without added noise and telephone channel speech and |
---|
0:04:40 | test without added noise |
---|
0:04:45 | so here we see in very round numbers the number of trials for each of |
---|
0:04:48 | these common conditions |
---|
0:04:50 | for common condition one again interview test with no added noise a roughly three thousand |
---|
0:04:57 | target trials forty six thousand non-target trials from no non-target speakers sdc two thousand trials |
---|
0:05:04 | from one or non-target speakers |
---|
0:05:08 | in the core test which was required of all participants and optional test was assumption |
---|
0:05:14 | the same but just with very a with a larger number of trials |
---|
0:05:21 | and you see the numbers there likewise |
---|
0:05:27 | target |
---|
0:05:28 | non-target |
---|
0:05:29 | speakers |
---|
0:05:30 | non-targets |
---|
0:05:34 | so let's look at some results |
---|
0:05:38 | so here we see common condition two |
---|
0:05:41 | which is telephone channel speech and test without added noise |
---|
0:05:45 | and this is the results from one leading system the others are similar |
---|
0:05:51 | and as might be expected better performance was able observed for known speakers that's the |
---|
0:05:56 | red line |
---|
0:05:58 | compared to the unknown speakers that's the black line |
---|
0:06:04 | one thing to note is that known speakers had multiple telephone conversations and sometimes even |
---|
0:06:10 | interviews |
---|
0:06:11 | as their training data |
---|
0:06:14 | so accuracy of the same system |
---|
0:06:16 | but on common condition one which is interview speech and test without added noise |
---|
0:06:22 | but unlike the last slide we saw |
---|
0:06:25 | there's not a lot of difference between the two curves |
---|
0:06:29 | and that gave was initially puzzling we wanted to know why and as it turns |
---|
0:06:35 | out |
---|
0:06:36 | and the known speakers for this common condition i were only known from a single |
---|
0:06:41 | telephone channel recording |
---|
0:06:43 | so where's and the previous slide the known speakers had a large amount of training |
---|
0:06:48 | data by which to know them i hear the speakers where only known by a |
---|
0:06:55 | single telephone channel |
---|
0:06:57 | some in addition to having a small amount of data the trials were cross channel |
---|
0:07:05 | so in addition to this concept of known and unknown non-target speakers |
---|
0:07:10 | other also known and unknown systems |
---|
0:07:12 | and what we mean here is that unknown systems presume that all of the non-target |
---|
0:07:17 | trials came from no non-target speakers |
---|
0:07:21 | and all systems presume that all the non-target trials were spoken by unknown non-target speakers |
---|
0:07:26 | so customers also that |
---|
0:07:28 | accuracy just a regular system only extended trials for common condition two |
---|
0:07:36 | and we see the thin dotted lines |
---|
0:07:40 | i'm not sure if we can actually see that especially in the back |
---|
0:07:43 | but those are ninety percent confidence bounds which suggest that there was a significant difference |
---|
0:07:49 | in performance |
---|
0:07:51 | between the known non-targets on the on a non-targets again read is the colour for |
---|
0:07:57 | the known non-targets black for the and known non-targets |
---|
0:08:04 | so here we see an unknown system again that's where the system always presume that |
---|
0:08:09 | the non-target speaker was a known and as might the expected |
---|
0:08:14 | there is little this difference observed between the two curves |
---|
0:08:21 | the accuracy and on system again that's where of the system presumes that all of |
---|
0:08:26 | the non-target speakers are unknown which is just say there are among the target speakers |
---|
0:08:32 | all of these are from the same site |
---|
0:08:36 | and you're actually compared to two slides back |
---|
0:08:40 | the performance differences is enhanced |
---|
0:08:47 | so |
---|
0:08:48 | summary sre twelve was an experiment with a new protocol and how speakers were made |
---|
0:08:53 | known to the systems |
---|
0:08:54 | after conversational telephone speech segments performance was improved when speakers are known to the system |
---|
0:09:01 | for interview test segments such improvement was not observed that was just do the setup |
---|
0:09:05 | of the evaluation |
---|
0:09:07 | he was not observable stuff to say that would be observed if the evaluation allows |
---|
0:09:13 | others actually a lot more information and that was covering the paper and other papers |
---|
0:09:22 | covering things that we learn from the evaluation so let me encourage you to |
---|
0:09:27 | look at those a more to contact us that address |
---|
0:09:34 | in addition |
---|
0:09:36 | considering future evaluations there is a question of whether allow enjoying knowledge of a target |
---|
0:09:43 | speakers is a good idea going forward |
---|
0:09:46 | one thing to note is that |
---|
0:09:48 | joint knowledge of target speakers makes result increasingly dependent on the target speaker selected introduced |
---|
0:09:54 | a trial independence |
---|
0:09:56 | so this makes estimating |
---|
0:10:01 | an error rates more difficult |
---|
0:10:03 | also something to consider is whether to continue having multi session and multichannel |
---|
0:10:10 | training for the target speakers |
---|
0:10:15 | so nist will resume a series b on the i-vector challenge in the a near |
---|
0:10:22 | future |
---|
0:10:24 | some interest he's is |
---|
0:10:27 | been expressed within the community regarding performing testing and acoustic environments different |
---|
0:10:35 | from those of prior evaluations joe made mention that |
---|
0:10:40 | some utility and that |
---|
0:10:43 | also one thing to note is that's |
---|
0:10:48 | in order to be able to conduct these types of evaluations it is necessary to |
---|
0:10:52 | collect realistic in challenging speech data |
---|
0:10:56 | which is both expensive and time-consuming |
---|
0:10:59 | but in order to do that and have even better evaluation lessons learned from sre |
---|
0:11:05 | twelve |
---|
0:11:07 | will be take into account and considered in the next evaluation so i probably have |
---|
0:11:11 | lots of time for questions |
---|
0:11:15 | so thank you |
---|
0:11:30 | so looking at your the |
---|
0:11:32 | c one in c to the common condition one into yes can you talk to |
---|
0:11:38 | the number of actual speakers were involved in the c one versus e c two |
---|
0:11:43 | no trials but speakers right |
---|
0:11:46 | the short answer is |
---|
0:11:49 | yes but not now "'cause" i don't have that information handy but we did look |
---|
0:11:52 | at that i can recall precisely |
---|
0:11:55 | well one of the things i did not yet another but i recall the c |
---|
0:11:59 | one had it on the order of about fifty forty three speakers involved only |
---|
0:12:03 | so |
---|
0:12:04 | i think we have |
---|
0:12:06 | comparing those two about the effects ago about the known and this couple things changing |
---|
0:12:11 | simultaneously the microphone in the television only yes hand the pool is much smaller "'cause" |
---|
0:12:16 | i think it was only true from drawn from |
---|
0:12:20 | mixture seven |
---|
0:12:21 | right |
---|
0:12:22 | so that's actually really excellent point that we try to emphasise during the evaluation workshop |
---|
0:12:26 | but i neglected to mention here |
---|
0:12:28 | is that the common conditions really we're not compare able at all |
---|
0:12:34 | in this evaluation so the speakers were different and the |
---|
0:12:43 | basically all the conditions change so it's i don't think you for noting that it's |
---|
0:12:47 | inappropriate to make those comparisons |
---|
0:12:49 | across common conditions within a common condition |
---|
0:12:52 | it was interesting to look at some of the sub |
---|
0:12:57 | some factor performance |
---|
0:13:08 | could you write just commenting if you're going to be following up on his or |
---|
0:13:11 | her as part of the nist unnecessary process |
---|
0:13:15 | so this is actually something we've been looking into |
---|
0:13:19 | pretty extensively the short answer is it's remains to be determined but the long answers |
---|
0:13:24 | this is something we're seeking to do |
---|
0:13:31 | okay |
---|
0:13:32 | make it i've a practise it |
---|
0:13:36 | are criteria you said at the end of the presentation they that they'll be focus |
---|
0:13:39 | on multichannel enrollment a training conditions |
---|
0:13:43 | once the question |
---|
0:13:44 | whether question is like cyanide in the last sre twelve you present at the workshop |
---|
0:13:50 | i think those any one thing that the |
---|
0:13:52 | my kind enrollment or telephone and enrollment it seems like focus wasn't neto maybe that |
---|
0:13:58 | just wasn't nothing just this time ramp up to still is a big challenge awfully |
---|
0:14:02 | so it just one if that was still going to be effective some continuing evaluations |
---|
0:14:08 | well that's a question and one of the things that we're very eager for is |
---|
0:14:12 | to get feedback from one |
---|
0:14:15 | from the community one thing that is |
---|
0:14:19 | time consuming and |
---|
0:14:21 | if not expensive the difficulty is setting up the evaluation even with the data |
---|
0:14:25 | and so |
---|
0:14:28 | we're much more likely to include that again of people will actually participate |
---|
0:14:34 | also got a second question if of got on a i'm not sure if you |
---|
0:14:37 | where the nn i-vector paragon that's come out for frame framework for sre twelve |
---|
0:14:44 | very impressive performance particular on telephone conditions as you mine i that the nn you |
---|
0:14:49 | need a lot of data for training and things very difficult to get that level |
---|
0:14:53 | one thing are afraid of is |
---|
0:14:56 | teams that might not have the infrastructure do such thing |
---|
0:15:00 | how would like here with the other things that do have the infrastructure in future |
---|
0:15:04 | evaluations is there are something that can be done about that such as the i-vector |
---|
0:15:08 | challenge with the i-vectors are presented |
---|
0:15:11 | just one and you've got thoughts on that |
---|
0:15:15 | in short no but that's a good question and |
---|
0:15:20 | something that model |
---|
0:15:21 | we |
---|
0:15:23 | perfectly willing to explore |
---|
0:15:31 | i just want to common to one o or of your conclusion point your i'm |
---|
0:15:35 | be happy to know but |
---|
0:15:37 | you have been mine with this point source of course to extend the v nist |
---|
0:15:41 | databases with new challenging conditions |
---|
0:15:45 | but i think it's also interesting to us |
---|
0:15:48 | increase the query actual conditions we have a lot of for to do on the |
---|
0:15:53 | act recognition by increasing cell use given number of speakers |
---|
0:15:57 | maybe buying one out of menu chewed |
---|
0:16:00 | and by adding |
---|
0:16:03 | in of the data per speaker of course it will |
---|
0:16:08 | for us to the reviewers over the evaluation protocol and look at the results per |
---|
0:16:14 | speaker like |
---|
0:16:16 | jodie the past the us also look also at the difference is that if you |
---|
0:16:23 | just |
---|
0:16:24 | select randomly one thousand test |
---|
0:16:28 | in a lot that the bayes to do you have some performance differences if you |
---|
0:16:34 | choice so one set compare to your the sets and a lot of things like |
---|
0:16:38 | that |
---|
0:16:39 | i think |
---|
0:16:47 | i |
---|