0:00:28 | Good morning ladies and gentleman, |
---|
0:00:30 | welcome to the third day of your Odyssey Workshop. |
---|
0:00:35 | Out of fifty one papers, twenty seven have been presented over the last two days |
---|
0:00:41 | and we have another twenty one to go. |
---|
0:00:44 | twenty four to go, if I'm doing the calculation right. Twenty four to go and |
---|
0:00:48 | yesterday was the... all papers were on... mainly on i-vectors. And we can say yesterday |
---|
0:00:53 | was the i-vector day. And today papers are... except one paper, there are two major |
---|
0:01:00 | sessions. One is language recognition evaluation and then features for speaker recognition. |
---|
0:01:08 | My name is Ambikairajah, I'm from the University of New South Wales in Sydney, Australia. |
---|
0:01:12 | I have the pleasure of introducing to you our plenary speaker for today, doctor Alvin |
---|
0:01:18 | Martin. |
---|
0:01:19 | Alvin will speak about the NIST speaker recognition evaluation plan for two thousand twelve and |
---|
0:01:26 | beyond. |
---|
0:01:28 | And he has coordinated NIST series of evaluation since nineteen ninety six in the areas |
---|
0:01:35 | of speaker recognition, language and dialect recognition. And the evaluation work he's involved |
---|
0:01:41 | collection and selection and preprocessing of data, and writing the evaluation plan, and evaluation of |
---|
0:01:50 | the results, coordinate in the workshop and... and many more tasks. |
---|
0:01:57 | He has served as a mathematician in the Multimodal information Group at NIST since nineteen |
---|
0:02:04 | ninety one and to two thousand eleven. |
---|
0:02:08 | Alvin holds a Ph.D. degree in mathematics from the Yale University. Please join me in |
---|
0:02:14 | welcoming doctor Alvin Martin |
---|
0:02:25 | Okay! Thank you! Thank you for that introduction and thank you for the invitation |
---|
0:02:32 | to do this talk. I'm here to talk about this speaker evaluations and, as you |
---|
0:02:39 | know, I have |
---|
0:02:42 | at NIST |
---|
0:02:43 | and I remain |
---|
0:02:46 | associated with NIST for this workshop, however |
---|
0:02:51 | I am here |
---|
0:02:53 | independently, so everything I say or |
---|
0:02:58 | I'm responsible for everything and no one else is, opinions are all my own. |
---|
0:03:11 | I guess I might... don't think I subject to any restrictions, but |
---|
0:03:15 | I'm at the clock. |
---|
0:03:21 | okay |
---|
0:03:25 | stay closer to this. An outline of the |
---|
0:03:29 | Topics I hope to cover... Gonna talk about some early history, things that preceded the |
---|
0:03:35 | evaluations, the current series of evaluation. The things that happened during the early times of |
---|
0:03:42 | the evaluation |
---|
0:03:44 | and giving kind of a history of the evaluations and in part of past Odysseys |
---|
0:03:52 | evaluation... who's involved with I should note my debt to Doug Reynolds who gave a |
---|
0:03:59 | similar |
---|
0:04:01 | talk on these matters four years ago in Stellenbosch and I will update one of |
---|
0:04:08 | the slides that |
---|
0:04:11 | he presented there. Gonna say some things from the point of view of an evaluation |
---|
0:04:18 | organiser, about ... about evaluation organisation. Say something about performance factors to look at, something |
---|
0:04:26 | about metrics which we've already talked about at the others workshop. Say something about progress |
---|
0:04:34 | measuring progress over time |
---|
0:04:36 | and when we talk about the future, quitting SRE twelve evaluation process currently going on, |
---|
0:04:44 | it will take place in the end of this year and then |
---|
0:04:47 | so about what might happen after this year |
---|
0:04:53 | The early history |
---|
0:04:57 | ones I would mention |
---|
0:04:59 | One thing that backed to the interesting speaker recognition evaluation success of speech recognition evaluation |
---|
0:05:08 | back |
---|
0:05:10 | in ... in the eighties and the early nineties, this |
---|
0:05:13 | very much |
---|
0:05:15 | involved in this, in this showed the benefits of independent evaluation on common data sets. |
---|
0:05:21 | I'll show a slide of that in a minute. |
---|
0:05:24 | I will mention the collection of various early corpora that were appropriate for speaker recognition: |
---|
0:05:30 | TIMIT, KING and YOHO, but most especially Switchboard. It was a multi-purpose corpus that was |
---|
0:05:37 | collected around nineteen ninety one, so one of the purposes that they had in mind |
---|
0:05:41 | was speaker recognition, collected conversations from a large number of speakers so that you have |
---|
0:05:49 | multiple conversations for each speaker. Success led to the collection later Switchboard two and similar |
---|
0:06:00 | collections. And in fact in the aftermath of Switchboard, The Linguistic Data Consortium was created |
---|
0:06:09 | in nineteen ninety two with the purpose of supporting a further speech and also text |
---|
0:06:16 | collections in the... in the United States, and onto the first Odyssey, all wasn't called |
---|
0:06:23 | Odyssey, it was Martigny in nineteen ninety four followed by several others. I will |
---|
0:06:30 | show pictures and make a few remarks on those. And andthere were early NIST evaluations. |
---|
0:06:36 | We date the current series of speaker evaluaqtions nineteen ninety six, but there were evaluations |
---|
0:06:41 | in ninety two ninety five. There was a DARPA program evaluation at several sites involving |
---|
0:06:47 | the DARPA programming in ninety two. In ninety five there was a preliminary evaluation that |
---|
0:06:53 | used Switchboard one data and at the six sites. But these earlier evaluations, the emphasis |
---|
0:07:00 | was rather on speaker identification |
---|
0:07:03 | on closed set rather than on open-set recognition that we've come... to know in ... |
---|
0:07:11 | in the series of evaluations |
---|
0:07:17 | So here's this favourite slide on speech recognition. The Benchmark Test history. So these... you |
---|
0:07:28 | know, these... the word error rate is on the lyrical scale, logarythmic scale |
---|
0:07:34 | start from nineteen eighty eight |
---|
0:07:39 | and this show best system performance of various evaluation, various conditions in... In successive years, |
---|
0:07:47 | or years when evaluations were held. So pointing out, of course, is the big fall |
---|
0:07:52 | in error rates when multiple sites participated on common corpora and we looked at error |
---|
0:07:59 | rates and |
---|
0:08:00 | with probably fixed conditions we could see progress being evident, specially this is showing the |
---|
0:08:06 | early series. This |
---|
0:08:10 | this... we came a mile over in the evaluation cycle research, collect data evaluate, show |
---|
0:08:16 | progress that gave inspiration to other evaluations and in particular, speakers |
---|
0:08:28 | okay, so now |
---|
0:08:31 | do some walk down memory lane |
---|
0:08:35 | the first |
---|
0:08:36 | workshop of this series was Martigny in nineteen ninety four |
---|
0:08:43 | It was called Workshop on automatic speaker recognition, identification and verification |
---|
0:08:48 | and that workshop, you know, was the very first of this |
---|
0:08:54 | recently will attended, but not as well as this one. And there were various presentations |
---|
0:08:59 | and there were many different corpora, many different performance measures and it was very difficult |
---|
0:09:04 | to make meaningful comparisons. I presented here one of the papers I presented papers that |
---|
0:09:10 | interest from the NIST evaluation point of view. There was a paper on public databases |
---|
0:09:17 | for speaker recognition and verification. It was given there |
---|
0:09:26 | And to pull the other of the early ones... Avignon, nineteen ninety eight. Speaker recognition |
---|
0:09:32 | and it's commercial and forensic application is what it was called. We called... also known |
---|
0:09:38 | as RLA2C from the French title. |
---|
0:09:43 | and one observation is that in terms of the talks there |
---|
0:09:47 | TIMIT was a preferred corpus |
---|
0:09:52 | for many was |
---|
0:09:53 | too clean, too easy corpus. I remember Doug making comments that he didn't wanna listen |
---|
0:09:58 | anymore. Papers that described results from TIMIT... there's also characterized by sometimes bitter debate over |
---|
0:10:08 | forensics and how good job forensic experts could do with that at speaker recognition |
---|
0:10:18 | there were |
---|
0:10:19 | several |
---|
0:10:23 | missed speaker evaluations related papers... actually, three of them that combined into |
---|
0:10:33 | this paper in speech communication |
---|
0:10:36 | from three presentations, perhaps most memorable was the presentation by George Doddington who told us |
---|
0:10:43 | all how to do the speaker recognition evaluation |
---|
0:10:48 | So, this was a talk that laid out the various principles, and most of the |
---|
0:10:53 | principles are kept and followed in our evaluation series, includes a discussion of the |
---|
0:11:00 | one golden rule of thirty |
---|
0:11:07 | Crete, two thousand one |
---|
0:11:10 | Two thousand one, A Speaker Odyssey, took the official name. Speaker recognition workshop. That was |
---|
0:11:15 | the first official Odyssey |
---|
0:11:17 | it was characterized by more emphasis on evaluation. There was an evaluation track that was |
---|
0:11:22 | persuaded, the NIST was |
---|
0:11:25 | involved with |
---|
0:11:31 | So, one of the presentations, the NIST presentation, I think I |
---|
0:11:37 | gave it... |
---|
0:11:40 | the history of NIST evaluations up to that ... that point and I will actually |
---|
0:11:46 | show a slide form there later on. |
---|
0:11:50 | another |
---|
0:11:51 | key presentation was... was one by several people from the Department of Defence: Phonetic, idiolectal |
---|
0:11:58 | and acoustic speaker recognition, that was... these remained their ideas that were being pursued at |
---|
0:12:03 | the time and that were influencing the course of research that point I think the |
---|
0:12:07 | name was Noan George had a lot to do with that. He had the paper |
---|
0:12:13 | on idiolectal techniques as well |
---|
0:12:21 | Toledo in two thousand and four, |
---|
0:12:26 | I think was really where Odyssey came of age |
---|
0:12:32 | It was... it was well attended, I think it probably remains |
---|
0:12:37 | the most |
---|
0:12:39 | highly attended of the Odysseys. It was the first Odyssey in which we had the |
---|
0:12:44 | NIST |
---|
0:12:45 | SRE workshop, held in conjunction at the same location. That was to be repeated in |
---|
0:12:51 | Puerto Rico in two thousand six and Bordeaux in two thousand ten. It was also |
---|
0:12:57 | the first |
---|
0:12:58 | Odyssey to include language recognition units. It had two notable key notes on forensic recognition |
---|
0:13:10 | earlier in Avignon ... these were two excellent well receieved parts and since then, Odyssey |
---|
0:13:17 | has been established biannual event that's been held every two years |
---|
0:13:26 | And that this data presentation, I think Mark Przybocki and I gave called The speaker |
---|
0:13:32 | recognition evaluation chronicles. And it was to be reprised, I think that about two years |
---|
0:13:39 | later in Puerto Rico. So, Odyssey has marched on |
---|
0:13:49 | Two thousand six was in Puerto Rico I find, incredibly, the picture of it. Two |
---|
0:13:55 | thousand and eight, Stellenbosch hosted by Niko. Twenty ten, two years ago we were in |
---|
0:14:02 | Brno. This is the logo designed by Honza's children. And now we're here in Singapore, |
---|
0:14:11 | and I think |
---|
0:14:12 | before we finish this workshop we will hear about plans for Odyssey in twenty fourteen. |
---|
0:14:22 | Okay! Let's move on to talk about organisation. |
---|
0:14:26 | think about evaluation. The thought are that is |
---|
0:14:30 | part of the organisation responsible for organising evaluations. And questions are which tasks are we |
---|
0:14:36 | to do, key principles, all this ... some of the milestones will be take directly |
---|
0:14:42 | taught. |
---|
0:14:44 | I've done different evaluations and talk about that participation |
---|
0:14:53 | So which speaker recognition problem? These are research evaluations, but what is the application, environment |
---|
0:15:02 | and alignment? Well, we know what we have done, but it won't be necessarily obvious |
---|
0:15:08 | before we started, but it would be access control, the important commercial application. It might |
---|
0:15:14 | have formed the model. It would raise s question of text independent or text dependent. |
---|
0:15:22 | There are some problems, I think we shuld do text- dependent. In part of the |
---|
0:15:26 | access control is the |
---|
0:15:27 | prior probability of target used to be high. |
---|
0:15:32 | their forensic aplications that could theoretically be or there's person spotting |
---|
0:15:39 | of course, is the way ... sometimes the way we went. Inherently in person spotting |
---|
0:15:43 | the prior probability target is low, it's text independent |
---|
0:15:49 | Well, in ninety six... and we'll all look at the ninety six evaluation plan. The |
---|
0:15:53 | separated NIST evaluations would concentrate on speaker spotting, emphasising low false alarm |
---|
0:16:01 | area of the |
---|
0:16:04 | performance curve |
---|
0:16:08 | Some of the principles have been the speaker spotting |
---|
0:16:12 | in our primary task |
---|
0:16:17 | we were research system oriented, you know. Application inspired but in it to research |
---|
0:16:25 | NIST traditionally, with some exceptions, doesn't do product testing in the English. You do the |
---|
0:16:31 | product testing to advance the technology. We searched the principle that we're gonna pool across |
---|
0:16:36 | target speakers |
---|
0:16:38 | people had to |
---|
0:16:41 | Get scores that will say that work independent on target speaker or having a performance |
---|
0:16:49 | curve rate every speaker and then just averaging performance curves |
---|
0:16:52 | and we emphasize the |
---|
0:16:58 | alarm rate region, both scores and decisions were required in that context with the other |
---|
0:17:06 | system |
---|
0:17:09 | NIko suggested that George is gonna talk about tomorrow that calibration matters. It is part |
---|
0:17:15 | ... part of the problem, the adress |
---|
0:17:20 | Some basics... Our evaluations were open to all willing participants to anyone that |
---|
0:17:27 | you know, follow rules. I could get... get the data and run all the trials |
---|
0:17:33 | and come to the workshop where research oriented we have tried to |
---|
0:17:40 | discourage commercialised competition. Now, we don't want people saying an advertisements, we the missed ideal |
---|
0:17:51 | our evaluation that's featured with the evaluation plans is specified applying all the rules or |
---|
0:17:58 | all the detailed evaluation, we'll look at one. |
---|
0:18:02 | Each evaluation is followed by workshop |
---|
0:18:05 | these workshops were limited to participants plus interested government organizations that every site or team |
---|
0:18:12 | that our participance was expected, we represented. At some of them we talked meaningfully about |
---|
0:18:18 | the evaluation system. The evaluation datasets that we subsequently published, made publicly available by the |
---|
0:18:30 | LDC. I would not give ... that remains the aim... remains the case the SRE |
---|
0:18:39 | o eight data is currently available. In particular, sites getting started in research may wanna |
---|
0:18:45 | be later... 'cause are able to obtain it. Typically, we'd like to have not the |
---|
0:18:50 | most recent eval, but the next most recent eval, in this case that's o eight, |
---|
0:18:53 | available publicly probably next year SRE o ten will be made available, heopefully the LRE |
---|
0:19:02 | o nine, to mention language eval, will soon become available |
---|
0:19:12 | okay |
---|
0:19:13 | with one hand on this web page |
---|
0:19:20 | hpage for the speaker eval, list of past speaker evals in for each year, you |
---|
0:19:24 | can click on and get the information on the evaluation trought that year |
---|
0:19:28 | started in nineteen ninety seven. For some reason, the nineteen ninety six evaluation plan things |
---|
0:19:37 | have been lost, but I asked Craig to search for it and he found it, |
---|
0:19:42 | so I hope that will get put out, but that mean |
---|
0:19:47 | what went into the evaluation plan, the first evaluation plan of the current series, which |
---|
0:19:52 | we said the emphasis will be on issues of handset variation and test segment duration |
---|
0:19:56 | in traditional goals as were said to drive the technology forward, measure state-of-the-art, find the |
---|
0:20:02 | most promising |
---|
0:20:04 | approaches |
---|
0:20:06 | Task has been task of the hypothesized speaker, segment of conversational speech on the telephone. |
---|
0:20:11 | That's been expanded, of course, in recent years. Interestingly, are you surprised to see this? |
---|
0:20:18 | The research objective, given our overall ten percent miss rate for target speakers, is to |
---|
0:20:24 | minimize the overall false alarm rate. |
---|
0:20:29 | That is, actually, what we said in ninety six. It is not what we emphasized |
---|
0:20:33 | in the year since |
---|
0:20:34 | until |
---|
0:20:38 | this past year, as you heard in the best evaluation, that's was made the official... |
---|
0:20:43 | Craig is gonna talk about the best evaluation tomorrow, so in that sense, come full |
---|
0:20:51 | circle. |
---|
0:20:54 | but this also mentions that performance is expressed in terms of the detection cost function. |
---|
0:21:00 | And the researchers than minimize DCF. They also specify the research objective that I am |
---|
0:21:06 | natural emphasize and I don't think we'd achieve the... achieve uniform performance across all target |
---|
0:21:11 | speakers. There have been some investigations about classes of speakers and |
---|
0:21:18 | sometimes attributed Doddington different |
---|
0:21:22 | types of speakers in different levels of difficulty |
---|
0:21:31 | so again the task is given up |
---|
0:21:34 | speaker... target speaker in that segment |
---|
0:21:37 | is the hypothesis if that speaker's true or false |
---|
0:21:43 | two measured performances in two related ways. Detection performance from the decisions and detection performance |
---|
0:21:50 | characterized by roc. |
---|
0:21:53 | word is roc |
---|
0:21:56 | here is the dcf formula we're all familiar with. We have parameters cost |
---|
0:22:06 | which was once expressed as ten, also false alarm as one and the prior probability |
---|
0:22:11 | target |
---|
0:22:13 | expressed as point zero one. We also... in this old computerized DCF for a range |
---|
0:22:18 | of p target in a sense where to return to that promise in the current |
---|
0:22:24 | evaluation site, |
---|
0:22:29 | Here we say our ROC will be constructed by pooling decision scores |
---|
0:22:35 | these scores will then be sorted and plotted on PROC plots. |
---|
0:22:42 | PROC are ROCs plotted on normal probability |
---|
0:22:47 | plots. So this was in nineteen ninety six, the term for what we now |
---|
0:22:53 | all refer to as |
---|
0:22:56 | as DET plots |
---|
0:23:01 | we talk about various conditions ... results by duration not this type decision previous task |
---|
0:23:12 | or reqiure explicit decisions |
---|
0:23:14 | and that scores of multiple target speakers are pooled before plotting the PROCs. So that |
---|
0:23:21 | requires score normalization across speakers. So that was the key emphasis that was new in |
---|
0:23:27 | the ninety six evaluation |
---|
0:23:30 | previously. Now we honor the term DET curve following the nineteeen ninety seven Eurospeech papers, |
---|
0:23:38 | which preserved ... used the term DET curve, the detection error tradeoff. I think George |
---|
0:23:45 | had a role in choosing tyhat name |
---|
0:23:49 | George turning to one person involved, another is... you may know as Tom Crystal. Incouraging |
---|
0:23:56 | the use of ...of this kind of curve that linearizes |
---|
0:24:01 | a performance curves assuming normal distributions |
---|
0:24:05 | and |
---|
0:24:08 | I was surprised to find that there's a Wikipedia page for DET plots. So, this |
---|
0:24:16 | is the page showing the linearizing effect. |
---|
0:24:23 | okay, now we talk about milestones |
---|
0:24:27 | These are sorted down, others may choose different ones, but you know We realized that |
---|
0:24:33 | we had earlier evaluations in ninety two and ninety five, the first in the series |
---|
0:24:37 | was in ninety six. |
---|
0:24:38 | Two thousand is first that we had a language other than English, we used the |
---|
0:24:41 | AHUMADA, |
---|
0:24:42 | Spanish data, along with other data. Two thousand one was |
---|
0:24:48 | rather late, we were in the United States with first evaluation cellular phone data. Two |
---|
0:24:55 | thousand one we also started providing ASR transcripts, errorful transcripts. We had kind of limited |
---|
0:25:01 | forensic evaluation using a small FBI database in two thousand two. Also, two thousand two |
---|
0:25:08 | there was the SuperSid workshop, one of the projects that Johns Hopkins workshop; it followed |
---|
0:25:14 | the SRE and helped to advance the technology. Other Baltimore workshops that followed up on |
---|
0:25:22 | speaker recognition. Many people here participated. Two thousand five |
---|
0:25:27 | first multiple languages, bilingual speakers |
---|
0:25:34 | in the eval... Also the first microphone recordings of telephone calls and therefore included some |
---|
0:25:43 | cross channel trials. Interview data, like with the mixer corporate day in two thousand eight |
---|
0:25:48 | have been used in two thousand ten. Two thousand ten involved the |
---|
0:25:53 | new DCF and the cost function stressing even lower false alarm rate, a little more |
---|
0:25:58 | about that later. Also in two thousand ten there are lots of things coming out |
---|
0:26:03 | in the recent years. We have been collecting high and low vocal effort data; also |
---|
0:26:09 | some data that look at aging. Two thousand ten also featured HASR, the human assisted |
---|
0:26:14 | speaker recognition. Evaluation small set that invited some systems that involve human as well as |
---|
0:26:23 | automatic systems. |
---|
0:26:25 | Twenty eleven is best. We had a broad range of test conditions, including add noise |
---|
0:26:30 | and reverb, Craig will be telling you about that tomorrow. |
---|
0:26:34 | Twenty twelve, it's gonna involve target speakers to find beforehand |
---|
0:26:43 | participation |
---|
0:26:47 | participation |
---|
0:26:48 | grown |
---|
0:26:52 | begin with. The number in fifty eight is... we have it in... these numbers are |
---|
0:26:57 | all a little fuzzy in terms of what's a site, what's a team, but I |
---|
0:27:03 | think of these numbers like... these are the ones that Doug used a few years |
---|
0:27:06 | ago and updated them. Fifty eight in twenty ten |
---|
0:27:10 | Doug, the MIT has provided... I think we're not doing physical notebooks anymore, but when |
---|
0:27:16 | we did, provided a cover pictures of the notebooks that the sold |
---|
0:27:22 | sure wanted to. One thing to note for understandable reasons, I guess, is the big |
---|
0:27:28 | increase in participation after two thousand one |
---|
0:27:32 | and the point I should notice is handling |
---|
0:27:36 | the scores of participating sites becomes a management problem. To a lot more work doing |
---|
0:27:41 | the evaluation of fifty eight participants than one dozen participants, and you know, |
---|
0:27:48 | this is the... actually this is a |
---|
0:27:50 | can't handle scores of the participants, that is handling this |
---|
0:27:55 | trial scores of all these participants, it doesn't matter if score is of participants, they're |
---|
0:28:00 | score participants |
---|
0:28:05 | so this is one of Doug's cover slides from two thousand four showing logos of |
---|
0:28:11 | all the sites and in the centre is a DET curve |
---|
0:28:15 | condition of primary interest, common condition well |
---|
0:28:21 | systems |
---|
0:28:23 | from two thousand six |
---|
0:28:27 | than Doug for those efforts |
---|
0:28:29 | So here it is, the graph, |
---|
0:28:32 | ninety two and ninety five were |
---|
0:28:34 | outside the series and had limited number of participants. Twenty eleven is the best evaluation, |
---|
0:28:40 | it also was limited to a very few participants |
---|
0:28:45 | otherwise, you didn't see the train... particularly those that trained after two thousand one growing |
---|
0:28:50 | to the |
---|
0:28:52 | fifty eight alongtime twenty ten. Twenty twelve evaluation to the registration is open, was being |
---|
0:29:00 | open over the summer and last count I had is thirty eight and I expect |
---|
0:29:04 | that's going to grow |
---|
0:29:09 | So, this is a slide |
---|
0:29:12 | from |
---|
0:29:14 | two thousand one presentation at Odyssey that describe the evaluations up to that point |
---|
0:29:23 | in the center is the number of target speakers and trials, so the first |
---|
0:29:28 | six evaluation on Switchboard one had forty speakers that had really a lot of conversations |
---|
0:29:36 | and one of the trains in the other evals restored more speakers up to eight |
---|
0:29:39 | hundred by two thousand |
---|
0:29:44 | we... each case to find a primary condition |
---|
0:29:50 | whether we are basing that on the number of handsets in training |
---|
0:29:56 | or whether we... can we emphasize different number... different phone number trials, we were looking |
---|
0:30:01 | at the issues of electret versus |
---|
0:30:04 | a carbon button, that was a big issue is the days of landline towers. So, |
---|
0:30:12 | this specifies the primary conditions and evaluation features for these early evaluations |
---|
0:30:23 | here is an attempt without putting in numbers to update some of that for |
---|
0:30:31 | evaluations after two thousand one |
---|
0:30:36 | we end up pulling primary condition of a common condition in that everyone but that |
---|
0:30:44 | the true for the official chart that we first evaluate all other conditions a when |
---|
0:30:52 | we introduced different languages to the common condition involved English only all the kinds of |
---|
0:30:59 | handsets so time to trade it on know and how well a problems that |
---|
0:31:06 | and on the right you see some of the other features that came out anew. |
---|
0:31:10 | Cellular data was added, multilingual data |
---|
0:31:15 | came on in two thousand five |
---|
0:31:23 | two thousand six we had some microphone tests |
---|
0:31:27 | and then |
---|
0:31:28 | things only got more complicated in the most recent evaluations |
---|
0:31:32 | on terms of common conditions, in two thousand eight we had eight common conditions |
---|
0:31:37 | two thousand ten we had nine common conditions. Two thousand twelve five common conditions that |
---|
0:31:46 | classified |
---|
0:31:47 | so in eight, we contrasted English in bilingual and contrasted interview |
---|
0:31:53 | in conversational telephone speech |
---|
0:31:56 | in two thousand ten we were contrasting nineteen telephone channels, interviewing conversation speech and high, |
---|
0:32:02 | low and normal vocal effort. Two thousand twelve we get interview test without noise or |
---|
0:32:08 | with added noise or repeated with added noise or with conversation phone test collected in |
---|
0:32:14 | a noisy environment. |
---|
0:32:19 | two thousand eight and ten involved interviews with the mics collected over multiple microphone channels |
---|
0:32:29 | two thousand ten, of course, added high and low vocal effort |
---|
0:32:33 | effort in aging with the Greybeard corporate in two thousandten also introduced HASR. Two thousand |
---|
0:32:40 | twelve offered more about target speakers, specified in advance. |
---|
0:32:49 | So, something about performance factors. |
---|
0:32:52 | I'll try not to say too much of this, but in terms of what we've |
---|
0:32:55 | looked at over the years, we've tried to look at demographic factors |
---|
0:33:00 | like sex and in general, there have been exceptions. The performance has been a bit |
---|
0:33:05 | better in male speakers than female. Kind early I would look at age and Geordge |
---|
0:33:11 | more recently has done a study of age and recent evaluation; he may say something |
---|
0:33:16 | about that tomorrow. Education factor... haven't looked in too much. One very interesting thing in |
---|
0:33:21 | getting the early evaluation is to look at mean pitch. |
---|
0:33:24 | people's |
---|
0:33:27 | test segments and training. And |
---|
0:33:31 | if he's put a non-target trials between |
---|
0:33:34 | similar pitch or pitch not.. it means it's similar, not close. The difference... and even |
---|
0:33:41 | more interesting, look at target trials, where the meet pitch was the same or not |
---|
0:33:46 | similar pressure person and all that it seriously that's all |
---|
0:33:52 | speaking style |
---|
0:33:56 | conversational telephone interview, particularly .... A lot of data has been collected on that. Vocal |
---|
0:34:04 | effort, more recently. The questions about |
---|
0:34:06 | defining vocal effort and have it coillected. Aging switchboard with the reviewed corpus ... limited |
---|
0:34:14 | time collecting it is difficult. These are the intrinsic factors related to the speaker. |
---|
0:34:22 | The other category, extrinsic factors relates to the collection by microphone or telephone channel. Telephone |
---|
0:34:28 | channel, landline, cellular, VOIP is something we work on. Earlier times, since days, carbon versus |
---|
0:34:36 | electret. Telephone handset type; various types are various microphones in the recent evaluation of matched |
---|
0:34:43 | to mismatched microphones. Placement of the microphone relative to the speaker and |
---|
0:34:49 | background noise and room reverberation |
---|
0:34:53 | talk about that tomorrow and it kind of takes the best |
---|
0:34:59 | And finally, parametric factors. Duration of training test, and also the number of training segments, |
---|
0:35:06 | the training sessions which |
---|
0:35:10 | evaluations that have eight sessions of training for telephone speech could greatly improve performance. We |
---|
0:35:16 | tried carrying along for many years, ten seconds is short duration of things, but there's |
---|
0:35:20 | also the increase in duration, especially in twenty twelve, we're gonna have lots of sessions |
---|
0:35:26 | and duration |
---|
0:35:28 | in training and I think, perhaps the emphasis is larger than |
---|
0:35:33 | seen the effects of multiple sessions and more data in evaluation. English, of course, has |
---|
0:35:42 | been the predominant language, but several of the evaluations include a variety of other languages |
---|
0:35:48 | and one of the hopes is that performance is good in every language as English. |
---|
0:35:53 | We at first suspected the reason that overall performance had been better in English is |
---|
0:35:58 | due to the regularity and more quantity of the data available in Englis. Cross language |
---|
0:36:04 | trials are a separate challenge |
---|
0:36:08 | okay |
---|
0:36:09 | the metrics |
---|
0:36:13 | Mention equal error rate, it is with us, it's part of our lives, in it's |
---|
0:36:17 | substance. I tried to discourage it, but ... It is easy to understand |
---|
0:36:28 | in some ways |
---|
0:36:30 | at least amount of data |
---|
0:36:33 | but, you know, doesn't deal with calibration issues and basically the operating point of equal |
---|
0:36:38 | error rate is not the operating point of applications |
---|
0:36:44 | high target |
---|
0:36:49 | Prior probablities target or may have load not really equal. Decision cost has been our |
---|
0:36:58 | main state bread and butter, we'll hear more about that. CLLR has been championed by |
---|
0:37:04 | Niko, we talked about it |
---|
0:37:07 | monday and we've talked about just looking at false alarm rate, it affects miss rate, |
---|
0:37:12 | which we return to in best. So, you all know about the decision cost function, |
---|
0:37:18 | it's the sum of the specified parameters |
---|
0:37:22 | First we normalize it by the cost of a system that has no intelligence, but |
---|
0:37:28 | simply always decides yes, always decides no, so the worst possible score is one. |
---|
0:37:36 | So the parameters that were mentioned in ninety six, these were the parameters form ninety |
---|
0:37:40 | six to two thousand eight, |
---|
0:37:43 | twenty will reach for domain, conditions for core and extended test. |
---|
0:37:48 | we changed |
---|
0:37:49 | what's the miss is one, false alarm is one, target point zero one. |
---|
0:37:56 | the driving force, and a lot of people |
---|
0:38:00 | were upset, their scepticism with |
---|
0:38:06 | create systems before that. I think the outcome has been relatively satisfactory, I tink people |
---|
0:38:14 | feel that they developed a good systems |
---|
0:38:17 | before this |
---|
0:38:22 | Niko talked about |
---|
0:38:24 | cllr, he noted that George suggested limiting cllr to |
---|
0:38:31 | to false alarm rate, covers a broad range of operating points. |
---|
0:38:37 | Fixed miss rate, we said, has it's roots in ninety six, but |
---|
0:38:40 | is used in twenty twelve. It's practical for applications, it may be viewed as cost |
---|
0:38:45 | for listening to false alarms. Some conditions... conditions were really good, you see, can't get |
---|
0:38:53 | a ten percent miss rate, maybe appropriate for one percent miss rate. |
---|
0:38:59 | recording progress |
---|
0:39:01 | How do we do that? it's always difficult to assure test set comparability, if you're |
---|
0:39:07 | collecting data the same way as before, is it really equal tested? Well, we encourage |
---|
0:39:11 | participants in the evaluations to run their prior systems, moth old systems. |
---|
0:39:15 | a new data, which gives us some measure |
---|
0:39:18 | But, even more, it's been a problem with changing technologies, you know, ninety six landline |
---|
0:39:24 | phones predominated, we dealed with carbon and electret. |
---|
0:39:28 | now, the world is largely cellular, we need to explore VOIP , present the new |
---|
0:39:34 | channel. So, the technology makes changing and with progress we will make the test harder. |
---|
0:39:41 | Always want to add new evaluation conditions, new bells and whistles. |
---|
0:39:44 | More channel types, more speaking styles, languages... the size of the evaluation data measures. |
---|
0:39:52 | In two thousand eleven, we explored externally added noise and reverb. The noise will continue |
---|
0:39:58 | in this year. So, Doug attempted in two thousand |
---|
0:40:04 | eight to |
---|
0:40:06 | to look at this, to explore existing condition, the course of years and looka at |
---|
0:40:10 | the best system. |
---|
0:40:12 | and here is an updated version of his slide, showing for more or less fixed |
---|
0:40:17 | conditins. |
---|
0:40:20 | logarythmic of a |
---|
0:40:24 | DCF, I believe |
---|
0:40:26 | where things worked. This numbers go up to two thousand six |
---|
0:40:32 | with added data in the right, two thousand eight showed |
---|
0:40:37 | some continued progress on various test conditions. Then in twenty ten |
---|
0:40:43 | we threw in the new measure. That really messes things up, numbers went up, but |
---|
0:40:49 | they're not directly comparable. This is the current |
---|
0:40:58 | of our history slide tracking progress |
---|
0:41:02 | so, let's, you know, turn to the future |
---|
0:41:08 | SRE twelve |
---|
0:41:10 | target speakers |
---|
0:41:11 | at most are specified in advance. There are speakers in recent past evaluations. I think |
---|
0:41:17 | it's something in the order of two thousand. That it's best why it is potential |
---|
0:41:23 | target speakers. So, sites can know about these targets, they have all the data, they |
---|
0:41:28 | can |
---|
0:41:29 | develop their systems to take advantage of that. All prior speeches are available for training. |
---|
0:41:34 | There would be some new target speakers with training data provided at evaluation times; that's |
---|
0:41:39 | one check on the effect of providing the targets in advance. We also have a |
---|
0:41:46 | test segments that will include non-target speakers. |
---|
0:41:53 | that is the big change for twenty twelve. Also, new interview speech will be provided, |
---|
0:41:59 | and was mentioned yesterday, in sixteen bit |
---|
0:42:01 | linear PCM |
---|
0:42:04 | some of the test phone calls are gonna be collected specifically in noisy environments |
---|
0:42:11 | And moreover, we're gonna have artificial |
---|
0:42:14 | noise, you know - added noise, test was done best... some test segment. Another challenge |
---|
0:42:23 | in this community. But, will this be an effectively easier task |
---|
0:42:29 | because we find the targets in advance and subsets |
---|
0:42:33 | It's... it mix it into these partially ... close that trial, you know, you are |
---|
0:42:40 | allowed to know not only about the one target, but these two thousand other targets, |
---|
0:42:45 | will that make a difference? We had, you know, we have open workshops. you know, |
---|
0:42:49 | workshops where the participants... we debate these things. Last December this got debated how much |
---|
0:42:57 | will this |
---|
0:42:58 | change the system? Will it make the problem too easy? |
---|
0:43:04 | It was ... we could have conditions when people were asked to assume |
---|
0:43:10 | that segment is target so it... since things fully close that.... or to assume no |
---|
0:43:15 | information about targets other than that of the actual trial. |
---|
0:43:20 | clearly speaker siding is in the past, so people do this, their results provide basis |
---|
0:43:25 | for comparison. This is what's to be |
---|
0:43:31 | investigate to be seen in SRE twelve. In terms of metrics, log- likelihood ratios now |
---|
0:43:40 | are required. And since we're doing that, no hard decisions are asked for. |
---|
0:43:48 | in terms of primary metric |
---|
0:43:53 | you know, just use the |
---|
0:43:56 | the dcf of |
---|
0:43:57 | point ten, but Niko pointed out that if you're not really required to calibrate your |
---|
0:44:06 | log likelihood ratios if you're only using it at one operating point. |
---|
0:44:12 | so therefore |
---|
0:44:17 | to require calibration and stability, we're gonna actually have two DCS and take the average |
---|
0:44:24 | of them. Also, cllr is alternative. Cllr m ten ... Niko referred to |
---|
0:44:32 | the limit cllr trials with |
---|
0:44:41 | high miss rate |
---|
0:44:45 | so |
---|
0:44:47 | the formula for TCF. We have three parameters, but we're working right at this one |
---|
0:44:51 | parameter beta and so |
---|
0:44:55 | the cost function of the simple average TCF one , TCF two, where cost of |
---|
0:45:01 | one where target priors are either point zero one aas is twenty ten, or point |
---|
0:45:07 | zero one |
---|
0:45:09 | order things to |
---|
0:45:11 | that would be |
---|
0:45:13 | the official metric |
---|
0:45:17 | and finally |
---|
0:45:20 | future hold |
---|
0:45:22 | that of course |
---|
0:45:24 | none of us knows, but |
---|
0:45:27 | twenty twelve , the outcome will determine whether all this |
---|
0:45:34 | idea of prespecified targets will be |
---|
0:45:37 | an effective one, that doesn't make the problem too easy or bigger, now we're gonna |
---|
0:45:41 | see |
---|
0:45:43 | Artificially added noise will be included in noise and reverb added may be part of |
---|
0:45:49 | the future. |
---|
0:45:50 | HASR twelve will be repeated, HASR ten had both other tests of fifteen hundred trials |
---|
0:45:56 | or a hundred and fifty. HASR twelve will have even twenty or two hundred, and |
---|
0:46:04 | anyone, you know, those with forensic interest, but anyone interested in to involve human assisted |
---|
0:46:10 | systems is invited to participate in HASR twelve, I would like to get more participations |
---|
0:46:15 | this year |
---|
0:46:17 | HASR extraction |
---|
0:46:21 | and answer is |
---|
0:46:24 | to just bigger |
---|
0:46:27 | fifty or more particiating sites. Data volume is now getting up to terabytes. |
---|
0:46:34 | best evaluation that so much, this year will be in twenty twelve, because we're only |
---|
0:46:41 | providing |
---|
0:46:41 | most priors will be run test data but that... you know, the numbers are |
---|
0:46:48 | segments are in the hundreds of thousands and the number trials |
---|
0:46:53 | is going to be in millions, tens of millions, even hundreds of millions before the |
---|
0:46:58 | optional full sets of trials, so |
---|
0:47:05 | likely you see the schedule moving to an every three year-one but details really need |
---|
0:47:11 | to be |
---|
0:47:14 | a lot more |
---|
0:47:19 | I don't know it, but I think that's where |
---|
0:47:23 | I'll finish |
---|
0:47:25 | discussion |
---|
0:48:00 | segments for a speaker |
---|
0:48:03 | know about our speakers |
---|
0:56:12 | so |
---|
0:56:31 | I didin't say they are normal curve |
---|
0:58:08 | Right, so LDC has an agreement with the sponsors suppert the lre in sre evaluations, |
---|
0:58:14 | that we will keep the most recent evaluation set blind, hold it back from publications |
---|
0:58:20 | and the general LDC catalog until the new data set has been created and so |
---|
0:58:25 | part of the timing of the publication of those eval sets is |
---|
0:58:30 | the requirement to have a new blind set |
---|
0:58:35 | is the current evaluation set and I |
---|
0:58:38 | ascending we have |
---|
0:58:39 | answers. We can raise that issue with them, and give them the key back they |
---|
0:58:43 | were getting |
---|
0:58:56 | right, so the SRE eval set is just being finished now |
---|
0:59:02 | as soon as that's finalised |
---|
0:59:05 | sre ten will be put into the queue for publication. Sre... sre ten, will be |
---|
0:59:10 | put to queue for publication |
---|
0:59:13 | it is sort of rolling in circle |
---|
0:59:21 | you'll have to ask the sponsor about that, I can't speak of their motivation, only |
---|
0:59:25 | they're contractually obligated to delay the publication of discussion. |
---|
0:59:54 | right well LDC is also balancing the needs of the consortium as a whole and |
---|
0:59:59 | so we... we are staging publications in the catalog, balancing a number of factors, so |
---|
1:00:05 | the speaker recognition and language |
---|
1:00:07 | are the one of the communities that we support. I hear your concern, we can |
---|
1:00:11 | certainly raise this issue with the sponsors and see if there's any |
---|
1:00:16 | any ... if we can provide. But, at this point, I think this is the |
---|
1:00:20 | strategy that spending |
---|
1:03:02 | this one |
---|