oh
okay i'm going to describe ldc is efforts to create the lre eleven corpus for
the nist
two thousand one language recognition evaluation
right
so first a review the requirements for data for that's
for corpus
the process by which ldc selected languages for inclusion corpus
a review or data collection procedures for broadcast telephone speech
and then how we selected the segments that would be subject and auditing and then
i'll spend some time talking about the auditing process in particular in particular reviewing
the steps we should to assess interauditor agreement on language classification
and then finally conclude with a summary of the released corpus
so the requirements for two thousand eleven and first word to distribute previous lre datasets
to write evaluation participant so this included the previous test sets and also very large
training corpus that was prepared for lre two thousand nine
yeah primarily consists of a very large only partially audited broadcast news
corpus
the bulk of our efforts for lre eleven was new resource creation starting with lre
two thousand nine
there was departure from the traditional corpus development effort for galleries in that we in
addition to you telephone speech collection we also included
data from broadcast sources separately narrowband segments from broadcast sources and a number of languages
ten lre eleven and also included broadcast speech
for the target was to collect data from twenty four languages
in lre eleven and targeting both genres for most of the languages with one exception
so we have four varieties of arabic
in lre eleven corpus
for modern standard arabic because this is a formal variety of knots are typically native
language of an individual speaker
we did not like telephone speech for modern standard arabic and then for the three
arabic dialectal rates iraqi levantine immigrant be we did not collect any broadcast segments only
targeted
in speech otherwise all languages have shown
so as i mentioned target was twenty four languages
or something that michael dialects
we just use the term varieties
and our goal was to have at least some of these varieties are known to
be mutually intelligible to some extent by at least some humans
we targeted four hundred segments for each of the twenty four languages with the these
two unique sources per language and the way we define source was where the broadcast
sources in particular a source is a provider program
so cn and larry king is a different source then see that headline news
the stylus different speakers are different
alright so our goal is twenty four languages
to select these languages we started with doing some background research literature looking at information
sources
particularly rely on a lot
and we compile a list of candidate languages and assign a confusability index score to
each of the candidate languages there are three possible scores
a zero reflects a language that's not likely to be confusable any of the other
candidate languages on the list
one is possible confusion with another candidate language on the list the languages are gender
genetically related accounts some systems is not humans
may
confuse the languages to some extent
and then issue is for languages that are likely confusable with another candidate language these
are languages where the literature suggests that there's no mutual intelligibility to some things to
i between language pairs
so there are review process we ended up a candidate set of thirty eight languages
which was whittled down to the twenty four five evaluation languages
with input from nist in the sponsor and also considering things like how feasible what
it actually be for us to collect data and fine
use a table of the languages that we ended up selecting for lre eleven you
can see that all of the arabic varieties have a usability score of two because
they were believed to be
mutually intelligible the other arabic varieties
a language by american english received a confusability score one
with the assumption that at least a potential to be confusable with indian english
and then there are few languages the received a confusability score zero for instance mandarin
there are no known computers in the selected list like just remember
alright and moving on to the collection strategy
for broadcast collection we targeted
multiple data providers of multiple sources for broadcast so we have a small amount of
data that's been collected previously impartially use for earlier lre evaluations
that have not been exposed and so we use some data from the voice of
america broadcast collection
but most of the broadcast recordings used for lre eleven were newly collected
so we have a some archive audio from ldc is local satellite data collection
but also hundreds of new collection that the re three of our collection sites in
philadelphia in tunis and hong kong we maintain his multiple collection sites in order to
get access to programming
that is simply not available
given the satellite feeds that we can access in philadelphia
we
we can the broadcast collection believing that we would be able to target is sufficient
data not twenty four languages to support a variety of other eleven needed it down
very quickly that wasn't the case and so we quickly scramble to put together an
additional collection facility new delhi
actually developed for this collection of portable broadcast collection platform
so essentially a small suitcase
that contains all of the components required were are partner facility do essentially plug and
record and so we partnered with a group the new delhi and the ended up
collecting
language that it up i languages for us and rape a scalar within about thirty
days
to full scale collection
we also
found as be collection but on
they were falling short of our targets for some of the languages
and decided to pursue collection of streaming radio sources for a number of languages to
supplement the collection
and in this case we did some sample recordings using here speakers to verify that
particular source contain sufficient content in the target language and ended up collecting data for
a week or so i've heard
one of the challenges that having all these different input streams for broadcast data is
that we end up with a variety of audio for
is that need to be reconciled downstream
for that election we used what we call based collection model
oh is a native speaker and formants
and the reason we use this a base model is to use the recruitment version
and become apparent a moment
but
that we higher for this study also end up serving as auditor so that they
do and a language judgements
on a collection our target was to identify to define flax for each of the
lre languages
and construct each class to make a single call to each other fifteen and thirty
individuals within their existing social network
so an individual we when we recruited people to be class for the study part
of the job description was you know a lot of other people who speaker language
and you can convince them to do a phone call with you
and how to recorded research purposes
so prior to the call being recorded the cali here is there a message saying
this "'cause" going to be recorded research
if you re push one which one and then the recording begins
because we were collected we were recruiting these class in philadelphia primarily in some cases
the multiple last for language knew each other and there is a chance that their
social networks with overlap and we wanted to call these to be distinct and so
we took some steps to ensure
that the call is did not overlap and language
because that
or in every all we excluded that call sides from work
we also require the class to make at least some of their calls in the
us we permitted down to call overseas
and most of them did we also require them to make some other calls within
us to avoid any biuniqueness of channel language conditions of all the time holes
work originating from thailand
then there would be a particular channel characteristic that could be associated with high we
wanted to obfuscate that
all of the telephone speech collected was
collected via lpc is existing on telephone platform
eight khz a bit you
alright so now we are collected data for recordings
and we need to process the material for human audited
so we first run all of the selected files through a speech activity detection system
in order to just english speech sources silence music other kinds of non-speech
based on the sad detection for telephone speech data we extract two segments
all of each being thirty to thirty five seconds duration
but for the broadcast data we need to do an additional bandwidth filtering
so using bruno spend with detector we run over the full recordings for the broadcast
data
and then from the intersection of the speech
plus narrowband segments
we identify continuous regions of the of thirty three or more seconds
from the broadcast data
for
segments that are the speech and you yeah
that are greater than thirty seconds we identify a single thirty three second segment within
that the same
that region
we do not select multiple segments from the longer region because we want to avoid
having multiple segments of speech from this
a single speaker in the collection
a given the large number of languages and a large number of segments with the
salary and in some cases it was necessary for us to reduce the segment duration
down to as low as ten so
rather than the thirty three seconds
so this is just
a graphical depiction of that selection process three
speech file we wanna sad system
distinguishing speech from non-speech if we have a speech regions be rather than but detector
identify the narrowband segments in our goal is
specifically say
with at least thirty three seconds of speech that are narrowband
alright so be identified segments are then converted into an auditor friendly format that works
well the web based auditing tool that are auditors use
that's sixteen khz sixteen bit for the broadcast data eight khz single channel for the
telephone speech again we exclude class call aside from the auditing process
and all of this process data is also then converted to pcm wave files so
that it can be easily rendered in a browser the accuracy is
orders are presented within with entire segments for judgements are typically they're listening to thirty
three seconds of speech
for broadcast
similar now for telephone segments
so we did some additional things with the lre data prior to presenting and two
daughters for judgement
with the specific goal of being able to assess inter auditor agreement for language judgements
so what we're going to baseline are segments that are
that it to be in the auditor's language so hindi auditor i'm being presented with
a recording that's expected to be in handy because you know somebody said they were
hindi speaker and we collected their speech
for be telephone speech
segments
last work only auditors were only listening to segments that were from holy is
from another class
this was just sort of minimise the chance that they would just
the segments because they knew the person's voice
so on top of this baseline that auditors were listening to they were also given
an additional distractor segments
so up to ten percent additional segments were added to their auditing pilot mine
that were drawn from a non confusable language so let's say i think the auditor
i might have similarity or some english for some mandarin segments brown into my auditing
okay
and really this was done to keep auditors on their toes so that occasionally that
we get a segment that was in a completely different language and they can just
sort of falsely and we all possible
we also added up to ten percent dual segments of these are segments that were
also assigned other auditors
so that we would get interannotator agreement
numbers for that
and then for all the varieties that have another confusable language in the collection of
all these body languages
we i additional confusable segments to the auditor's a kid
or possibly confusable right use like polish or slow but
the auditors judged ten percent additional over the baseline from the body language
for my confusable varieties like low and high they judge twenty five percent over the
baseline
and then for no confusable varieties like indian or do they just all the segments
from the body language
E individual can make a very high to it because the collection is happening sort
of a non linear fashion so getting here that an auditor was working on might
be all telephone speech frames for instance
but this was sort of our target for the auditing kit construction
briefly the auditor's
were selected first via
a preliminary online screening process
so okay
had a lot of little survey asking them questions about their language background and then
lead to an online test listening
spectrum
segments
but included in the target language but also some of these distractor segments
potentially confusable language segments
some of the feedback that we got on screening test how does also to point
out areas where additional auditor training was needed or where we need to verify the
language labels
i'm to make the auditing task here
about a hundred and thirty people to the screening test at for the past and
they were hired in given additional training and the part of the training consisted of
training there here's
to distinguish narrowband for my where
speech be a
a signal quality perception
then
the goal of the auditing task is to ensure that segments contain speech
arg in the target variety or narrowband
contain only one speaker
on the audio quality is acceptable
and that also ask questions about have you heard this person's voice
before in segments that you previously charged with the reliability and i wish
the solo a given the thousands and thousands of segments of people are judging on
the we just abandon the questions here
so the words about how to be consistency
i'll just get to the bottom point that the numbers reported here
or from segments that were assigned during the normalizing process all this dual annotation we
conducted was not done post hoc it was done as part of the regular everyday
so let's look for
step within language agreement so this is comparing multiple judgements
where the expected language of the segment was also the language of the auditor's
yeah
we're asking what is the language label agreement so this is for instance a case
where two and all the speakers are judging that's
that we expect to be
and you know naively we want this number to be close to one hundred percent
well it's not always hundred percent so for the arabic varieties which we know are
highly confusable one another we see very poor treatment for instance so
be modern standard arabic charges only read with one another forty two percent
time
and whether a segment was actually modern standard error
the dialectal
right are higher separate levantine arabic almost everyone agree
that is like
presented them
some other highlights here for hindi and word
we also see here
well
agreements to around ninety percent but not surprising given these language pairs
oh are related
now looking at dual annotation results of this is looking at the exact same segments
what is the agreement just on the language questions so we had nine hundred fifty
one cases where the order
said no that's non-target language
a fifteen hundred cases where C yeah that's my target language the two hundred fourteen
cases where one auditors that it's my language the other auditors said no it's not
and it can break this number down you'll see that the disagreement comes mostly from
three languages
modern standard arabic yeah
very well dual annotation agreements
and then agreement for can be word so not surprising that these languages that are
causing trouble
and finally looking at cross language agreement so this is looking at judgements where a
segment was
confirmed by one auditor to be in their language
language was the expected language was the one we believe that the segment bn
and that's a
was then judged by auditor from another language you also said that segment was in
their language
right so this is like a hindi speaker listens to segment that we think it's
and hindi they say yeah that's can be we play that same segment for and
we review auditor and they say yeah that's or do
so
we see some interesting cross-language disagreement here so
for the varieties where
and we expect languages modern standard error
levantine
listening
ninety percent
i think that
is there
right
similar numbers for
my
so this one down here
we see some confusion between american english
and
which ones are just might both somewhat surprising but this is actually an asymmetrical
confusion
what's going on is that
is the expected language is american english but the auditor is an indian english order
they're likely to explain that segment has their own language but the reverse doesn't
an american english auditor does not flaming indian english segment to be american english
we see a similar kind of asymmetry for can be a word
so wrapping up with respect to data distribution redistributed the data to nist and C
six incremental releases
packages contain full audio recordings
the auditor version of the segments
and then the audio results for segments that particular criteria is the segment in the
target language does it contains speech is all the speech from one speaker
the answers to all that is needed to yes
and then for but
the entire segment sound like narrowband signal we delivered both yes and no segment judgements
along with the full segment metadata tables in this could sub sample on the segments
from
so the evaluation
so this is just table two summarizes the
total
a delivery so here are four hundred segment target for all the two languages allow
and ukrainian where we had a real struggle to find all five
so in conclusion we prepared significant points you telephone a broadcast data
in twenty four languages which included several
confusable varieties
we needed to dawson are collection strategies to support corpus requirements
there's a type of your should be at for auditors be over twenty two thousand
on the judgements yielding about ten thousand usable lre segments
the auditing kids were constructed just for consistency analysis
we found that the within language agreement was typically over ninety five percent it's a
few exceptions and it wouldn't
we did see cross-language confusion particularly for their of it but you languages
i'm in an asymmetrical a confusion with high level with american english indian english hindi
urdu and with farsi dari
and this corpus supported lre twenty eleven evaluation ultimately published in and sees
and decomposition but sponsors
okay thank you
oh
oh
that's right
so if we had only one auditor judgement for segment
the segments gonna comes up avr that was deliberate if we had multiple judgements and
they were all in agreement that was deliver if we have described that judgements
those segments were withheld from what was delivered to nist
those described in segments will be included in ultimate a general publication for lre eleven
one it appears in these
that might be interesting data for research
along with the metadata
oh
right so it's
someone asymmetrical so there are certain varieties that people are more accepting if there are
linguistically similar to their own
well they don't auditors that they typically tell this is this is
not only could be tell that it wasn't moroccan let's say they could tell specifically
that it was correctly
the real confusion comes in with modern standard error
which is really not spoken natively by anyone
and also modern standard arabic spoken in a broadcast course
sources that we were collecting
may contain some dialectal elements so if you're doing an interview with someone from around
some already dialect may prevent to what you know was reported to be modern standard
arabic so that sort of a
confusing fact
and analysing
oh