and i talk about the nist language recognition evaluations a past and future this is

work done with colleagues

of an john georgian jack

so there are two tasks

and language recognition

identification which is choose among and specified target languages and detection is the speech and

the target language

and the lre tasks that have been part of the nist evaluations have evolved over

time

the early l or ease and ninety six three and two thousand five focused on

identification

and the recent salaries focused on detection

the most recent lre and the next lre will focus on detection limited to language

pair

i and the rationale for the change is that we believe the two class problem

is can conceptually simpler

and represents the fundamental challenge

and the improve performance over time has required ever increasing data to reliably estimate error

rates

there are three category distinctions and lre

dialect which might be thought of as speech patterns of a particular group

language which is a dialect with an army in the navy

and linguistic variety a way to dodge the issue

like the task that category distinctions what we're actually trying to recognise change over time

in earlier ease there was a distinction between language and dialects

and in fact there were separate dialect and language test in those years except pro

three

and recent years and in the next lre we've may no distinction between languages and

dialects

and instead test confusable linguistic variety clusters

and among the reasons for the changes that there is no accepted language dialect criteria

and that dialect is used in consistent ways for example

chinese dialects are i'm sorry chinese languages are mutually intelligible

but hindi or i'll start chinese dialects are mutually intelligible but hindi and urdu distinctions

are primarily and non-linguistic

there are three data collection approaches the that have been used in lre

one we might refer to as color where someone's paid to make a single phone

call and his or her speech is used

a class based model

repeat someone to make many calls in the speech of the interlocutor is used

and then broadcast where you find narrowband speech and radio broadcasts

really ovaries took the colour approach and recent ovaries in two thousand nine eleven in

the next to larry

will combine their clack

and broadcast approaches

and the reason for the changes that the large number of unique speakers of each

i'm sorry there are a large number of unique speech or speakers needed for each

language

and single speaker phone calls will become increasingly expect expensive to collect an experiment showed

that broadcast could be used and language recognition evaluation

to produce comparable for performance results

so there are two broad classes of metric sort of been used see that which

we see here is a weighted linear combination of the miss and false alarms and

see that language pair with a linear combination of miss and false alarms but for

each language pair

the earlier larry's you see that's the very early l are easy you see that

the more recent lre is used to never see that and the most recent mallory

used average see that over language pairs

and the primary reason to change the metric is changed is has been to reflect

a new task focuses

so here we see

the average see that for thirty seconds ten seconds and three seconds

where the red line is thirty seconds

that's thirty seconds of speech

ten seconds of speech

three seconds of speech

then we see performance improvements over years with some caviar

in particular the ones we just discussed that the task change from identification to detection

other languages change from you the year

and the data sources changed

from

calls

solely calls in these years two calls and broadcasts

two thousand nine

and we see in two thousand nine for example on the thirty second

speech segments

that they were few errors observed

and leading systems

so here we see how leading systems for a language pair american english indian english

this is the most study pair in the sense that

it started back in two thousand five

and we seek an good performance improvement over time where the blue is

them in see that language pair for

thirty sec sorry a blue is for of the real seven

readily real nine

in green lre eleven and here we see thirty seconds ten seconds and three seconds

i consistent improvement

for hindi urdu the pictures less rosie

language pair remains challenging especially for the shorter durations

and the improvement we've seen over time is limited i again especially for the three

seconds

we suspect that's it's really in large part due to the problematic language distinction although

human test showed some consistency

with annotator judgements that they're also some consistency issues that were observed

here we see results for dari firstly

and we see improvement from lre online celery eleven in the thirty seconds and the

three seconds

and here we see the russian ukrainian language pair

and were

noticing

reversion trend

where lre eleven actually so worse performance

and we expect that this may have been due to change and data source between

the

training and evaluation data

so in summary nist has coordinated ovaries since nineteen ninety six

and have a emphasized detecting target language classes of interest some recent years

but the nature of the real english classes of the vault earlier evaluations achieved i

performance a broad language classes with separate dialect tests in this leads to the change

and later

the change was to move away from the language dialect distinction

towards pairwise testing of closely related varieties

so for future evaluations the next a value language recognition evaluation is planned for twenty

fifteen with pairwise testing in within six broad language clusters

utilizing newly collected cts and broadcast news speech sounds are broadcast narrowband speech

the system output will be a vector of log likelihoods

which is a change from the

past evaluations

for each cluster will average performance overall there's on the cluster and the overall measure

will be the mean of the six cluster actual decisions

and it's open to all participants so for more information please jointly other email in

this by contacting us there

thank you very much

so

what the pairwise fisher

so the pairwise measure is actually going to be different in

and the next lre then and the last one but we will continued emphasized language

pairs as a research task

we believe that this is

a

we believe this is a focusing on the core problem

and language recognition

i want to say that

solving chinese english

distinction is no longer interesting

but maybe two varieties of english is more interesting

task

i wasn't there two thousand eleven i and i would be into six do you

still make the bolts because you were talking about

c get which is fine just to make the poles

as well

i try to recall but i want to say twenty eleven was the first worked

representation without any that plots are that's cool

but you could you control dimples for detection yes and then i would be to

see what you put along the axes

i think that point probabilities are what are you going say probability of false alarm

oregon say probability or indian english given the fact that smirk

i would i would so for the latter one

thank you

i still wanna go back one point with this is i and the pair maybe

someone

isn't getting what

give me a system that operates that way i mean to where you by saying

that you telling

basically detection system years used

i data much label by language

where is the pairwise thing come into that i once the system level i understand

from

maybe for research perspective so

you get distinction is what's just operate it more than one which systems that way

right

that's the that's interesting question it's difficult for me to first one i think there's

a tradeoff between

we application focused and being research focused

not to say that they're entirely different but i think in this case it's a

tradeoff and so really more towards the research currently

so you said you are gonna ask us to pretty to give you a factor

of language log-likelihoods yes and then you're going to subtract

two of those to get the score that would differentiate between pairs of languages such

as

so that's very nice because

the single vector likelihoods is a lot smaller than all the possible pairs so that

that's a nice compact score format yes i think the only request is that you

submit all pairs

so sorry just as i was making a joke sorry of

so

are you gonna concentrate again on heart decisions so you

you gonna have a seat get set up at the threshold of zero so is

that you gonna the that the criterion is then just gonna depend on whether the

score is

on that side of the side of the threshold

so

that then you gonna then it's not gonna method what the scale of the log-likelihood

vector is the has always comes are then you lose that one dimension of calibration

then it's just

the location of that vector in log-likelihood space matters but not the scale

yes understand you

if you somehow

do multiple operating points like you did in the sre

then you would get a handle on the scale

the scale factor as well

okay thank you have this is something to consider one planning

next

well

i

in two years we had this out-of-language problems and now other than the new evaluations

came out to you allowed people to the wall on this topic

so with the detection task it still possible to have a out of we can

not only above is an alternative so you can have

french or whatever the map that you have some is we is not closed set

up you have a unknown language you also rate we will i want to say

we can double

you we can self there were say twenty languages you could have a twenty

dimensional vector and for the closed and twenty one dimensional vector for the for the

open

do you have other information on the time lies on the skies and yes so

i right now were deliberating between having a during workshop and the summer workshop

so that would be the first half of the this your first have in the

case of the during workshop for the second half of the cases where the summer

workshop

okay