0:00:15and i talk about the nist language recognition evaluations a past and future this is
0:00:20work done with colleagues
0:00:22of an john georgian jack
0:00:27so there are two tasks
0:00:30and language recognition
0:00:32identification which is choose among and specified target languages and detection is the speech and
0:00:39the target language
0:00:42and the lre tasks that have been part of the nist evaluations have evolved over
0:00:49time
0:00:50the early l or ease and ninety six three and two thousand five focused on
0:00:55identification
0:00:57and the recent salaries focused on detection
0:01:01the most recent lre and the next lre will focus on detection limited to language
0:01:08pair
0:01:11i and the rationale for the change is that we believe the two class problem
0:01:16is can conceptually simpler
0:01:19and represents the fundamental challenge
0:01:22and the improve performance over time has required ever increasing data to reliably estimate error
0:01:28rates
0:01:32there are three category distinctions and lre
0:01:37dialect which might be thought of as speech patterns of a particular group
0:01:42language which is a dialect with an army in the navy
0:01:48and linguistic variety a way to dodge the issue
0:01:57like the task that category distinctions what we're actually trying to recognise change over time
0:02:03in earlier ease there was a distinction between language and dialects
0:02:07and in fact there were separate dialect and language test in those years except pro
0:02:11three
0:02:13and recent years and in the next lre we've may no distinction between languages and
0:02:19dialects
0:02:20and instead test confusable linguistic variety clusters
0:02:26and among the reasons for the changes that there is no accepted language dialect criteria
0:02:31and that dialect is used in consistent ways for example
0:02:36chinese dialects are i'm sorry chinese languages are mutually intelligible
0:02:43but hindi or i'll start chinese dialects are mutually intelligible but hindi and urdu distinctions
0:02:49are primarily and non-linguistic
0:02:57there are three data collection approaches the that have been used in lre
0:03:01one we might refer to as color where someone's paid to make a single phone
0:03:05call and his or her speech is used
0:03:08a class based model
0:03:09repeat someone to make many calls in the speech of the interlocutor is used
0:03:13and then broadcast where you find narrowband speech and radio broadcasts
0:03:23really ovaries took the colour approach and recent ovaries in two thousand nine eleven in
0:03:29the next to larry
0:03:30will combine their clack
0:03:31and broadcast approaches
0:03:35and the reason for the changes that the large number of unique speakers of each
0:03:38i'm sorry there are a large number of unique speech or speakers needed for each
0:03:42language
0:03:43and single speaker phone calls will become increasingly expect expensive to collect an experiment showed
0:03:51that broadcast could be used and language recognition evaluation
0:03:57to produce comparable for performance results
0:04:03so there are two broad classes of metric sort of been used see that which
0:04:08we see here is a weighted linear combination of the miss and false alarms and
0:04:13see that language pair with a linear combination of miss and false alarms but for
0:04:18each language pair
0:04:21the earlier larry's you see that's the very early l are easy you see that
0:04:26the more recent lre is used to never see that and the most recent mallory
0:04:30used average see that over language pairs
0:04:36and the primary reason to change the metric is changed is has been to reflect
0:04:42a new task focuses
0:04:46so here we see
0:04:49the average see that for thirty seconds ten seconds and three seconds
0:04:55where the red line is thirty seconds
0:04:59that's thirty seconds of speech
0:05:00ten seconds of speech
0:05:03three seconds of speech
0:05:04then we see performance improvements over years with some caviar
0:05:09in particular the ones we just discussed that the task change from identification to detection
0:05:15other languages change from you the year
0:05:18and the data sources changed
0:05:20from
0:05:21calls
0:05:24solely calls in these years two calls and broadcasts
0:05:27two thousand nine
0:05:30and we see in two thousand nine for example on the thirty second
0:05:34speech segments
0:05:35that they were few errors observed
0:05:38and leading systems
0:05:43so here we see how leading systems for a language pair american english indian english
0:05:50this is the most study pair in the sense that
0:05:53it started back in two thousand five
0:05:55and we seek an good performance improvement over time where the blue is
0:06:00them in see that language pair for
0:06:03thirty sec sorry a blue is for of the real seven
0:06:08readily real nine
0:06:09in green lre eleven and here we see thirty seconds ten seconds and three seconds
0:06:15i consistent improvement
0:06:19for hindi urdu the pictures less rosie
0:06:24language pair remains challenging especially for the shorter durations
0:06:29and the improvement we've seen over time is limited i again especially for the three
0:06:33seconds
0:06:36we suspect that's it's really in large part due to the problematic language distinction although
0:06:43human test showed some consistency
0:06:46with annotator judgements that they're also some consistency issues that were observed
0:06:54here we see results for dari firstly
0:07:01and we see improvement from lre online celery eleven in the thirty seconds and the
0:07:07three seconds
0:07:12and here we see the russian ukrainian language pair
0:07:16and were
0:07:18noticing
0:07:22reversion trend
0:07:23where lre eleven actually so worse performance
0:07:27and we expect that this may have been due to change and data source between
0:07:31the
0:07:32training and evaluation data
0:07:37so in summary nist has coordinated ovaries since nineteen ninety six
0:07:41and have a emphasized detecting target language classes of interest some recent years
0:07:47but the nature of the real english classes of the vault earlier evaluations achieved i
0:07:52performance a broad language classes with separate dialect tests in this leads to the change
0:07:59and later
0:08:01the change was to move away from the language dialect distinction
0:08:04towards pairwise testing of closely related varieties
0:08:10so for future evaluations the next a value language recognition evaluation is planned for twenty
0:08:15fifteen with pairwise testing in within six broad language clusters
0:08:22utilizing newly collected cts and broadcast news speech sounds are broadcast narrowband speech
0:08:29the system output will be a vector of log likelihoods
0:08:33which is a change from the
0:08:35past evaluations
0:08:37for each cluster will average performance overall there's on the cluster and the overall measure
0:08:43will be the mean of the six cluster actual decisions
0:08:48and it's open to all participants so for more information please jointly other email in
0:08:54this by contacting us there
0:08:57thank you very much
0:09:16so
0:09:17what the pairwise fisher
0:09:22so the pairwise measure is actually going to be different in
0:09:26and the next lre then and the last one but we will continued emphasized language
0:09:30pairs as a research task
0:09:35we believe that this is
0:09:40a
0:09:44we believe this is a focusing on the core problem
0:09:47and language recognition
0:09:49i want to say that
0:09:52solving chinese english
0:09:55distinction is no longer interesting
0:09:59but maybe two varieties of english is more interesting
0:10:05task
0:10:15i wasn't there two thousand eleven i and i would be into six do you
0:10:20still make the bolts because you were talking about
0:10:24c get which is fine just to make the poles
0:10:27as well
0:10:29i try to recall but i want to say twenty eleven was the first worked
0:10:34representation without any that plots are that's cool
0:10:38but you could you control dimples for detection yes and then i would be to
0:10:43see what you put along the axes
0:10:49i think that point probabilities are what are you going say probability of false alarm
0:10:53oregon say probability or indian english given the fact that smirk
0:11:01i would i would so for the latter one
0:11:06thank you
0:11:08i still wanna go back one point with this is i and the pair maybe
0:11:13someone
0:11:14isn't getting what
0:11:15give me a system that operates that way i mean to where you by saying
0:11:19that you telling
0:11:20basically detection system years used
0:11:23i data much label by language
0:11:27where is the pairwise thing come into that i once the system level i understand
0:11:31from
0:11:32maybe for research perspective so
0:11:36you get distinction is what's just operate it more than one which systems that way
0:11:40right
0:11:43that's the that's interesting question it's difficult for me to first one i think there's
0:11:49a tradeoff between
0:11:50we application focused and being research focused
0:11:54not to say that they're entirely different but i think in this case it's a
0:11:57tradeoff and so really more towards the research currently
0:12:18so you said you are gonna ask us to pretty to give you a factor
0:12:23of language log-likelihoods yes and then you're going to subtract
0:12:28two of those to get the score that would differentiate between pairs of languages such
0:12:33as
0:12:35so that's very nice because
0:12:39the single vector likelihoods is a lot smaller than all the possible pairs so that
0:12:47that's a nice compact score format yes i think the only request is that you
0:12:52submit all pairs
0:12:55so sorry just as i was making a joke sorry of
0:13:01so
0:13:03are you gonna concentrate again on heart decisions so you
0:13:07you gonna have a seat get set up at the threshold of zero so is
0:13:10that you gonna the that the criterion is then just gonna depend on whether the
0:13:14score is
0:13:15on that side of the side of the threshold
0:13:18so
0:13:19that then you gonna then it's not gonna method what the scale of the log-likelihood
0:13:23vector is the has always comes are then you lose that one dimension of calibration
0:13:29then it's just
0:13:31the location of that vector in log-likelihood space matters but not the scale
0:13:36yes understand you
0:13:38if you somehow
0:13:40do multiple operating points like you did in the sre
0:13:46then you would get a handle on the scale
0:13:49the scale factor as well
0:13:50okay thank you have this is something to consider one planning
0:13:56next
0:14:08well
0:14:13i
0:14:15in two years we had this out-of-language problems and now other than the new evaluations
0:14:22came out to you allowed people to the wall on this topic
0:14:28so with the detection task it still possible to have a out of we can
0:14:36not only above is an alternative so you can have
0:14:42french or whatever the map that you have some is we is not closed set
0:14:46up you have a unknown language you also rate we will i want to say
0:14:50we can double
0:14:53you we can self there were say twenty languages you could have a twenty
0:14:57dimensional vector and for the closed and twenty one dimensional vector for the for the
0:15:02open
0:15:03do you have other information on the time lies on the skies and yes so
0:15:09i right now were deliberating between having a during workshop and the summer workshop
0:15:16so that would be the first half of the this your first have in the
0:15:24case of the during workshop for the second half of the cases where the summer
0:15:27workshop
0:15:36okay