and i talk about the nist language recognition evaluations a past and future this is
work done with colleagues
of an john georgian jack
so there are two tasks
and language recognition
identification which is choose among and specified target languages and detection is the speech and
the target language
and the lre tasks that have been part of the nist evaluations have evolved over
time
the early l or ease and ninety six three and two thousand five focused on
identification
and the recent salaries focused on detection
the most recent lre and the next lre will focus on detection limited to language
pair
i and the rationale for the change is that we believe the two class problem
is can conceptually simpler
and represents the fundamental challenge
and the improve performance over time has required ever increasing data to reliably estimate error
rates
there are three category distinctions and lre
dialect which might be thought of as speech patterns of a particular group
language which is a dialect with an army in the navy
and linguistic variety a way to dodge the issue
like the task that category distinctions what we're actually trying to recognise change over time
in earlier ease there was a distinction between language and dialects
and in fact there were separate dialect and language test in those years except pro
three
and recent years and in the next lre we've may no distinction between languages and
dialects
and instead test confusable linguistic variety clusters
and among the reasons for the changes that there is no accepted language dialect criteria
and that dialect is used in consistent ways for example
chinese dialects are i'm sorry chinese languages are mutually intelligible
but hindi or i'll start chinese dialects are mutually intelligible but hindi and urdu distinctions
are primarily and non-linguistic
there are three data collection approaches the that have been used in lre
one we might refer to as color where someone's paid to make a single phone
call and his or her speech is used
a class based model
repeat someone to make many calls in the speech of the interlocutor is used
and then broadcast where you find narrowband speech and radio broadcasts
really ovaries took the colour approach and recent ovaries in two thousand nine eleven in
the next to larry
will combine their clack
and broadcast approaches
and the reason for the changes that the large number of unique speakers of each
i'm sorry there are a large number of unique speech or speakers needed for each
language
and single speaker phone calls will become increasingly expect expensive to collect an experiment showed
that broadcast could be used and language recognition evaluation
to produce comparable for performance results
so there are two broad classes of metric sort of been used see that which
we see here is a weighted linear combination of the miss and false alarms and
see that language pair with a linear combination of miss and false alarms but for
each language pair
the earlier larry's you see that's the very early l are easy you see that
the more recent lre is used to never see that and the most recent mallory
used average see that over language pairs
and the primary reason to change the metric is changed is has been to reflect
a new task focuses
so here we see
the average see that for thirty seconds ten seconds and three seconds
where the red line is thirty seconds
that's thirty seconds of speech
ten seconds of speech
three seconds of speech
then we see performance improvements over years with some caviar
in particular the ones we just discussed that the task change from identification to detection
other languages change from you the year
and the data sources changed
from
calls
solely calls in these years two calls and broadcasts
two thousand nine
and we see in two thousand nine for example on the thirty second
speech segments
that they were few errors observed
and leading systems
so here we see how leading systems for a language pair american english indian english
this is the most study pair in the sense that
it started back in two thousand five
and we seek an good performance improvement over time where the blue is
them in see that language pair for
thirty sec sorry a blue is for of the real seven
readily real nine
in green lre eleven and here we see thirty seconds ten seconds and three seconds
i consistent improvement
for hindi urdu the pictures less rosie
language pair remains challenging especially for the shorter durations
and the improvement we've seen over time is limited i again especially for the three
seconds
we suspect that's it's really in large part due to the problematic language distinction although
human test showed some consistency
with annotator judgements that they're also some consistency issues that were observed
here we see results for dari firstly
and we see improvement from lre online celery eleven in the thirty seconds and the
three seconds
and here we see the russian ukrainian language pair
and were
noticing
reversion trend
where lre eleven actually so worse performance
and we expect that this may have been due to change and data source between
the
training and evaluation data
so in summary nist has coordinated ovaries since nineteen ninety six
and have a emphasized detecting target language classes of interest some recent years
but the nature of the real english classes of the vault earlier evaluations achieved i
performance a broad language classes with separate dialect tests in this leads to the change
and later
the change was to move away from the language dialect distinction
towards pairwise testing of closely related varieties
so for future evaluations the next a value language recognition evaluation is planned for twenty
fifteen with pairwise testing in within six broad language clusters
utilizing newly collected cts and broadcast news speech sounds are broadcast narrowband speech
the system output will be a vector of log likelihoods
which is a change from the
past evaluations
for each cluster will average performance overall there's on the cluster and the overall measure
will be the mean of the six cluster actual decisions
and it's open to all participants so for more information please jointly other email in
this by contacting us there
thank you very much
so
what the pairwise fisher
so the pairwise measure is actually going to be different in
and the next lre then and the last one but we will continued emphasized language
pairs as a research task
we believe that this is
a
we believe this is a focusing on the core problem
and language recognition
i want to say that
solving chinese english
distinction is no longer interesting
but maybe two varieties of english is more interesting
task
i wasn't there two thousand eleven i and i would be into six do you
still make the bolts because you were talking about
c get which is fine just to make the poles
as well
i try to recall but i want to say twenty eleven was the first worked
representation without any that plots are that's cool
but you could you control dimples for detection yes and then i would be to
see what you put along the axes
i think that point probabilities are what are you going say probability of false alarm
oregon say probability or indian english given the fact that smirk
i would i would so for the latter one
thank you
i still wanna go back one point with this is i and the pair maybe
someone
isn't getting what
give me a system that operates that way i mean to where you by saying
that you telling
basically detection system years used
i data much label by language
where is the pairwise thing come into that i once the system level i understand
from
maybe for research perspective so
you get distinction is what's just operate it more than one which systems that way
right
that's the that's interesting question it's difficult for me to first one i think there's
a tradeoff between
we application focused and being research focused
not to say that they're entirely different but i think in this case it's a
tradeoff and so really more towards the research currently
so you said you are gonna ask us to pretty to give you a factor
of language log-likelihoods yes and then you're going to subtract
two of those to get the score that would differentiate between pairs of languages such
as
so that's very nice because
the single vector likelihoods is a lot smaller than all the possible pairs so that
that's a nice compact score format yes i think the only request is that you
submit all pairs
so sorry just as i was making a joke sorry of
so
are you gonna concentrate again on heart decisions so you
you gonna have a seat get set up at the threshold of zero so is
that you gonna the that the criterion is then just gonna depend on whether the
score is
on that side of the side of the threshold
so
that then you gonna then it's not gonna method what the scale of the log-likelihood
vector is the has always comes are then you lose that one dimension of calibration
then it's just
the location of that vector in log-likelihood space matters but not the scale
yes understand you
if you somehow
do multiple operating points like you did in the sre
then you would get a handle on the scale
the scale factor as well
okay thank you have this is something to consider one planning
next
well
i
in two years we had this out-of-language problems and now other than the new evaluations
came out to you allowed people to the wall on this topic
so with the detection task it still possible to have a out of we can
not only above is an alternative so you can have
french or whatever the map that you have some is we is not closed set
up you have a unknown language you also rate we will i want to say
we can double
you we can self there were say twenty languages you could have a twenty
dimensional vector and for the closed and twenty one dimensional vector for the for the
open
do you have other information on the time lies on the skies and yes so
i right now were deliberating between having a during workshop and the summer workshop
so that would be the first half of the this your first have in the
case of the during workshop for the second half of the cases where the summer
workshop
okay