0:00:15 | and i talk about the nist language recognition evaluations a past and future this is |
---|
0:00:20 | work done with colleagues |
---|
0:00:22 | of an john georgian jack |
---|
0:00:27 | so there are two tasks |
---|
0:00:30 | and language recognition |
---|
0:00:32 | identification which is choose among and specified target languages and detection is the speech and |
---|
0:00:39 | the target language |
---|
0:00:42 | and the lre tasks that have been part of the nist evaluations have evolved over |
---|
0:00:49 | time |
---|
0:00:50 | the early l or ease and ninety six three and two thousand five focused on |
---|
0:00:55 | identification |
---|
0:00:57 | and the recent salaries focused on detection |
---|
0:01:01 | the most recent lre and the next lre will focus on detection limited to language |
---|
0:01:08 | pair |
---|
0:01:11 | i and the rationale for the change is that we believe the two class problem |
---|
0:01:16 | is can conceptually simpler |
---|
0:01:19 | and represents the fundamental challenge |
---|
0:01:22 | and the improve performance over time has required ever increasing data to reliably estimate error |
---|
0:01:28 | rates |
---|
0:01:32 | there are three category distinctions and lre |
---|
0:01:37 | dialect which might be thought of as speech patterns of a particular group |
---|
0:01:42 | language which is a dialect with an army in the navy |
---|
0:01:48 | and linguistic variety a way to dodge the issue |
---|
0:01:57 | like the task that category distinctions what we're actually trying to recognise change over time |
---|
0:02:03 | in earlier ease there was a distinction between language and dialects |
---|
0:02:07 | and in fact there were separate dialect and language test in those years except pro |
---|
0:02:11 | three |
---|
0:02:13 | and recent years and in the next lre we've may no distinction between languages and |
---|
0:02:19 | dialects |
---|
0:02:20 | and instead test confusable linguistic variety clusters |
---|
0:02:26 | and among the reasons for the changes that there is no accepted language dialect criteria |
---|
0:02:31 | and that dialect is used in consistent ways for example |
---|
0:02:36 | chinese dialects are i'm sorry chinese languages are mutually intelligible |
---|
0:02:43 | but hindi or i'll start chinese dialects are mutually intelligible but hindi and urdu distinctions |
---|
0:02:49 | are primarily and non-linguistic |
---|
0:02:57 | there are three data collection approaches the that have been used in lre |
---|
0:03:01 | one we might refer to as color where someone's paid to make a single phone |
---|
0:03:05 | call and his or her speech is used |
---|
0:03:08 | a class based model |
---|
0:03:09 | repeat someone to make many calls in the speech of the interlocutor is used |
---|
0:03:13 | and then broadcast where you find narrowband speech and radio broadcasts |
---|
0:03:23 | really ovaries took the colour approach and recent ovaries in two thousand nine eleven in |
---|
0:03:29 | the next to larry |
---|
0:03:30 | will combine their clack |
---|
0:03:31 | and broadcast approaches |
---|
0:03:35 | and the reason for the changes that the large number of unique speakers of each |
---|
0:03:38 | i'm sorry there are a large number of unique speech or speakers needed for each |
---|
0:03:42 | language |
---|
0:03:43 | and single speaker phone calls will become increasingly expect expensive to collect an experiment showed |
---|
0:03:51 | that broadcast could be used and language recognition evaluation |
---|
0:03:57 | to produce comparable for performance results |
---|
0:04:03 | so there are two broad classes of metric sort of been used see that which |
---|
0:04:08 | we see here is a weighted linear combination of the miss and false alarms and |
---|
0:04:13 | see that language pair with a linear combination of miss and false alarms but for |
---|
0:04:18 | each language pair |
---|
0:04:21 | the earlier larry's you see that's the very early l are easy you see that |
---|
0:04:26 | the more recent lre is used to never see that and the most recent mallory |
---|
0:04:30 | used average see that over language pairs |
---|
0:04:36 | and the primary reason to change the metric is changed is has been to reflect |
---|
0:04:42 | a new task focuses |
---|
0:04:46 | so here we see |
---|
0:04:49 | the average see that for thirty seconds ten seconds and three seconds |
---|
0:04:55 | where the red line is thirty seconds |
---|
0:04:59 | that's thirty seconds of speech |
---|
0:05:00 | ten seconds of speech |
---|
0:05:03 | three seconds of speech |
---|
0:05:04 | then we see performance improvements over years with some caviar |
---|
0:05:09 | in particular the ones we just discussed that the task change from identification to detection |
---|
0:05:15 | other languages change from you the year |
---|
0:05:18 | and the data sources changed |
---|
0:05:20 | from |
---|
0:05:21 | calls |
---|
0:05:24 | solely calls in these years two calls and broadcasts |
---|
0:05:27 | two thousand nine |
---|
0:05:30 | and we see in two thousand nine for example on the thirty second |
---|
0:05:34 | speech segments |
---|
0:05:35 | that they were few errors observed |
---|
0:05:38 | and leading systems |
---|
0:05:43 | so here we see how leading systems for a language pair american english indian english |
---|
0:05:50 | this is the most study pair in the sense that |
---|
0:05:53 | it started back in two thousand five |
---|
0:05:55 | and we seek an good performance improvement over time where the blue is |
---|
0:06:00 | them in see that language pair for |
---|
0:06:03 | thirty sec sorry a blue is for of the real seven |
---|
0:06:08 | readily real nine |
---|
0:06:09 | in green lre eleven and here we see thirty seconds ten seconds and three seconds |
---|
0:06:15 | i consistent improvement |
---|
0:06:19 | for hindi urdu the pictures less rosie |
---|
0:06:24 | language pair remains challenging especially for the shorter durations |
---|
0:06:29 | and the improvement we've seen over time is limited i again especially for the three |
---|
0:06:33 | seconds |
---|
0:06:36 | we suspect that's it's really in large part due to the problematic language distinction although |
---|
0:06:43 | human test showed some consistency |
---|
0:06:46 | with annotator judgements that they're also some consistency issues that were observed |
---|
0:06:54 | here we see results for dari firstly |
---|
0:07:01 | and we see improvement from lre online celery eleven in the thirty seconds and the |
---|
0:07:07 | three seconds |
---|
0:07:12 | and here we see the russian ukrainian language pair |
---|
0:07:16 | and were |
---|
0:07:18 | noticing |
---|
0:07:22 | reversion trend |
---|
0:07:23 | where lre eleven actually so worse performance |
---|
0:07:27 | and we expect that this may have been due to change and data source between |
---|
0:07:31 | the |
---|
0:07:32 | training and evaluation data |
---|
0:07:37 | so in summary nist has coordinated ovaries since nineteen ninety six |
---|
0:07:41 | and have a emphasized detecting target language classes of interest some recent years |
---|
0:07:47 | but the nature of the real english classes of the vault earlier evaluations achieved i |
---|
0:07:52 | performance a broad language classes with separate dialect tests in this leads to the change |
---|
0:07:59 | and later |
---|
0:08:01 | the change was to move away from the language dialect distinction |
---|
0:08:04 | towards pairwise testing of closely related varieties |
---|
0:08:10 | so for future evaluations the next a value language recognition evaluation is planned for twenty |
---|
0:08:15 | fifteen with pairwise testing in within six broad language clusters |
---|
0:08:22 | utilizing newly collected cts and broadcast news speech sounds are broadcast narrowband speech |
---|
0:08:29 | the system output will be a vector of log likelihoods |
---|
0:08:33 | which is a change from the |
---|
0:08:35 | past evaluations |
---|
0:08:37 | for each cluster will average performance overall there's on the cluster and the overall measure |
---|
0:08:43 | will be the mean of the six cluster actual decisions |
---|
0:08:48 | and it's open to all participants so for more information please jointly other email in |
---|
0:08:54 | this by contacting us there |
---|
0:08:57 | thank you very much |
---|
0:09:16 | so |
---|
0:09:17 | what the pairwise fisher |
---|
0:09:22 | so the pairwise measure is actually going to be different in |
---|
0:09:26 | and the next lre then and the last one but we will continued emphasized language |
---|
0:09:30 | pairs as a research task |
---|
0:09:35 | we believe that this is |
---|
0:09:40 | a |
---|
0:09:44 | we believe this is a focusing on the core problem |
---|
0:09:47 | and language recognition |
---|
0:09:49 | i want to say that |
---|
0:09:52 | solving chinese english |
---|
0:09:55 | distinction is no longer interesting |
---|
0:09:59 | but maybe two varieties of english is more interesting |
---|
0:10:05 | task |
---|
0:10:15 | i wasn't there two thousand eleven i and i would be into six do you |
---|
0:10:20 | still make the bolts because you were talking about |
---|
0:10:24 | c get which is fine just to make the poles |
---|
0:10:27 | as well |
---|
0:10:29 | i try to recall but i want to say twenty eleven was the first worked |
---|
0:10:34 | representation without any that plots are that's cool |
---|
0:10:38 | but you could you control dimples for detection yes and then i would be to |
---|
0:10:43 | see what you put along the axes |
---|
0:10:49 | i think that point probabilities are what are you going say probability of false alarm |
---|
0:10:53 | oregon say probability or indian english given the fact that smirk |
---|
0:11:01 | i would i would so for the latter one |
---|
0:11:06 | thank you |
---|
0:11:08 | i still wanna go back one point with this is i and the pair maybe |
---|
0:11:13 | someone |
---|
0:11:14 | isn't getting what |
---|
0:11:15 | give me a system that operates that way i mean to where you by saying |
---|
0:11:19 | that you telling |
---|
0:11:20 | basically detection system years used |
---|
0:11:23 | i data much label by language |
---|
0:11:27 | where is the pairwise thing come into that i once the system level i understand |
---|
0:11:31 | from |
---|
0:11:32 | maybe for research perspective so |
---|
0:11:36 | you get distinction is what's just operate it more than one which systems that way |
---|
0:11:40 | right |
---|
0:11:43 | that's the that's interesting question it's difficult for me to first one i think there's |
---|
0:11:49 | a tradeoff between |
---|
0:11:50 | we application focused and being research focused |
---|
0:11:54 | not to say that they're entirely different but i think in this case it's a |
---|
0:11:57 | tradeoff and so really more towards the research currently |
---|
0:12:18 | so you said you are gonna ask us to pretty to give you a factor |
---|
0:12:23 | of language log-likelihoods yes and then you're going to subtract |
---|
0:12:28 | two of those to get the score that would differentiate between pairs of languages such |
---|
0:12:33 | as |
---|
0:12:35 | so that's very nice because |
---|
0:12:39 | the single vector likelihoods is a lot smaller than all the possible pairs so that |
---|
0:12:47 | that's a nice compact score format yes i think the only request is that you |
---|
0:12:52 | submit all pairs |
---|
0:12:55 | so sorry just as i was making a joke sorry of |
---|
0:13:01 | so |
---|
0:13:03 | are you gonna concentrate again on heart decisions so you |
---|
0:13:07 | you gonna have a seat get set up at the threshold of zero so is |
---|
0:13:10 | that you gonna the that the criterion is then just gonna depend on whether the |
---|
0:13:14 | score is |
---|
0:13:15 | on that side of the side of the threshold |
---|
0:13:18 | so |
---|
0:13:19 | that then you gonna then it's not gonna method what the scale of the log-likelihood |
---|
0:13:23 | vector is the has always comes are then you lose that one dimension of calibration |
---|
0:13:29 | then it's just |
---|
0:13:31 | the location of that vector in log-likelihood space matters but not the scale |
---|
0:13:36 | yes understand you |
---|
0:13:38 | if you somehow |
---|
0:13:40 | do multiple operating points like you did in the sre |
---|
0:13:46 | then you would get a handle on the scale |
---|
0:13:49 | the scale factor as well |
---|
0:13:50 | okay thank you have this is something to consider one planning |
---|
0:13:56 | next |
---|
0:14:08 | well |
---|
0:14:13 | i |
---|
0:14:15 | in two years we had this out-of-language problems and now other than the new evaluations |
---|
0:14:22 | came out to you allowed people to the wall on this topic |
---|
0:14:28 | so with the detection task it still possible to have a out of we can |
---|
0:14:36 | not only above is an alternative so you can have |
---|
0:14:42 | french or whatever the map that you have some is we is not closed set |
---|
0:14:46 | up you have a unknown language you also rate we will i want to say |
---|
0:14:50 | we can double |
---|
0:14:53 | you we can self there were say twenty languages you could have a twenty |
---|
0:14:57 | dimensional vector and for the closed and twenty one dimensional vector for the for the |
---|
0:15:02 | open |
---|
0:15:03 | do you have other information on the time lies on the skies and yes so |
---|
0:15:09 | i right now were deliberating between having a during workshop and the summer workshop |
---|
0:15:16 | so that would be the first half of the this your first have in the |
---|
0:15:24 | case of the during workshop for the second half of the cases where the summer |
---|
0:15:27 | workshop |
---|
0:15:36 | okay |
---|