0:00:15 | oh |
---|
0:00:31 | okay i'm going to describe ldc is efforts to create the lre eleven corpus for |
---|
0:00:37 | the nist |
---|
0:00:38 | two thousand one language recognition evaluation |
---|
0:00:58 | right |
---|
0:00:59 | so first a review the requirements for data for that's |
---|
0:01:03 | for corpus |
---|
0:01:05 | the process by which ldc selected languages for inclusion corpus |
---|
0:01:10 | a review or data collection procedures for broadcast telephone speech |
---|
0:01:15 | and then how we selected the segments that would be subject and auditing and then |
---|
0:01:21 | i'll spend some time talking about the auditing process in particular in particular reviewing |
---|
0:01:25 | the steps we should to assess interauditor agreement on language classification |
---|
0:01:32 | and then finally conclude with a summary of the released corpus |
---|
0:01:37 | so the requirements for two thousand eleven and first word to distribute previous lre datasets |
---|
0:01:44 | to write evaluation participant so this included the previous test sets and also very large |
---|
0:01:50 | training corpus that was prepared for lre two thousand nine |
---|
0:01:54 | yeah primarily consists of a very large only partially audited broadcast news |
---|
0:02:00 | corpus |
---|
0:02:01 | the bulk of our efforts for lre eleven was new resource creation starting with lre |
---|
0:02:08 | two thousand nine |
---|
0:02:09 | there was departure from the traditional corpus development effort for galleries in that we in |
---|
0:02:17 | addition to you telephone speech collection we also included |
---|
0:02:22 | data from broadcast sources separately narrowband segments from broadcast sources and a number of languages |
---|
0:02:28 | ten lre eleven and also included broadcast speech |
---|
0:02:33 | for the target was to collect data from twenty four languages |
---|
0:02:39 | in lre eleven and targeting both genres for most of the languages with one exception |
---|
0:02:45 | so we have four varieties of arabic |
---|
0:02:48 | in lre eleven corpus |
---|
0:02:50 | for modern standard arabic because this is a formal variety of knots are typically native |
---|
0:02:57 | language of an individual speaker |
---|
0:02:59 | we did not like telephone speech for modern standard arabic and then for the three |
---|
0:03:04 | arabic dialectal rates iraqi levantine immigrant be we did not collect any broadcast segments only |
---|
0:03:10 | targeted |
---|
0:03:11 | in speech otherwise all languages have shown |
---|
0:03:14 | so as i mentioned target was twenty four languages |
---|
0:03:17 | or something that michael dialects |
---|
0:03:21 | we just use the term varieties |
---|
0:03:23 | and our goal was to have at least some of these varieties are known to |
---|
0:03:28 | be mutually intelligible to some extent by at least some humans |
---|
0:03:33 | we targeted four hundred segments for each of the twenty four languages with the these |
---|
0:03:38 | two unique sources per language and the way we define source was where the broadcast |
---|
0:03:43 | sources in particular a source is a provider program |
---|
0:03:48 | so cn and larry king is a different source then see that headline news |
---|
0:03:53 | the stylus different speakers are different |
---|
0:03:58 | alright so our goal is twenty four languages |
---|
0:04:01 | to select these languages we started with doing some background research literature looking at information |
---|
0:04:06 | sources |
---|
0:04:07 | particularly rely on a lot |
---|
0:04:09 | and we compile a list of candidate languages and assign a confusability index score to |
---|
0:04:15 | each of the candidate languages there are three possible scores |
---|
0:04:18 | a zero reflects a language that's not likely to be confusable any of the other |
---|
0:04:24 | candidate languages on the list |
---|
0:04:26 | one is possible confusion with another candidate language on the list the languages are gender |
---|
0:04:32 | genetically related accounts some systems is not humans |
---|
0:04:38 | may |
---|
0:04:39 | confuse the languages to some extent |
---|
0:04:42 | and then issue is for languages that are likely confusable with another candidate language these |
---|
0:04:47 | are languages where the literature suggests that there's no mutual intelligibility to some things to |
---|
0:04:54 | i between language pairs |
---|
0:04:56 | so there are review process we ended up a candidate set of thirty eight languages |
---|
0:05:01 | which was whittled down to the twenty four five evaluation languages |
---|
0:05:05 | with input from nist in the sponsor and also considering things like how feasible what |
---|
0:05:10 | it actually be for us to collect data and fine |
---|
0:05:14 | use a table of the languages that we ended up selecting for lre eleven you |
---|
0:05:21 | can see that all of the arabic varieties have a usability score of two because |
---|
0:05:25 | they were believed to be |
---|
0:05:27 | mutually intelligible the other arabic varieties |
---|
0:05:30 | a language by american english received a confusability score one |
---|
0:05:36 | with the assumption that at least a potential to be confusable with indian english |
---|
0:05:43 | and then there are few languages the received a confusability score zero for instance mandarin |
---|
0:05:49 | there are no known computers in the selected list like just remember |
---|
0:05:57 | alright and moving on to the collection strategy |
---|
0:06:01 | for broadcast collection we targeted |
---|
0:06:04 | multiple data providers of multiple sources for broadcast so we have a small amount of |
---|
0:06:10 | data that's been collected previously impartially use for earlier lre evaluations |
---|
0:06:16 | that have not been exposed and so we use some data from the voice of |
---|
0:06:20 | america broadcast collection |
---|
0:06:23 | but most of the broadcast recordings used for lre eleven were newly collected |
---|
0:06:29 | so we have a some archive audio from ldc is local satellite data collection |
---|
0:06:35 | but also hundreds of new collection that the re three of our collection sites in |
---|
0:06:40 | philadelphia in tunis and hong kong we maintain his multiple collection sites in order to |
---|
0:06:47 | get access to programming |
---|
0:06:49 | that is simply not available |
---|
0:06:52 | given the satellite feeds that we can access in philadelphia |
---|
0:06:57 | we |
---|
0:06:58 | we can the broadcast collection believing that we would be able to target is sufficient |
---|
0:07:04 | data not twenty four languages to support a variety of other eleven needed it down |
---|
0:07:10 | very quickly that wasn't the case and so we quickly scramble to put together an |
---|
0:07:16 | additional collection facility new delhi |
---|
0:07:19 | actually developed for this collection of portable broadcast collection platform |
---|
0:07:23 | so essentially a small suitcase |
---|
0:07:26 | that contains all of the components required were are partner facility do essentially plug and |
---|
0:07:33 | record and so we partnered with a group the new delhi and the ended up |
---|
0:07:38 | collecting |
---|
0:07:40 | language that it up i languages for us and rape a scalar within about thirty |
---|
0:07:45 | days |
---|
0:07:46 | to full scale collection |
---|
0:07:49 | we also |
---|
0:07:51 | found as be collection but on |
---|
0:07:54 | they were falling short of our targets for some of the languages |
---|
0:07:58 | and decided to pursue collection of streaming radio sources for a number of languages to |
---|
0:08:05 | supplement the collection |
---|
0:08:07 | and in this case we did some sample recordings using here speakers to verify that |
---|
0:08:12 | particular source contain sufficient content in the target language and ended up collecting data for |
---|
0:08:20 | a week or so i've heard |
---|
0:08:22 | one of the challenges that having all these different input streams for broadcast data is |
---|
0:08:28 | that we end up with a variety of audio for |
---|
0:08:31 | is that need to be reconciled downstream |
---|
0:08:36 | for that election we used what we call based collection model |
---|
0:08:42 | oh is a native speaker and formants |
---|
0:08:46 | and the reason we use this a base model is to use the recruitment version |
---|
0:08:51 | and become apparent a moment |
---|
0:08:53 | but |
---|
0:08:54 | that we higher for this study also end up serving as auditor so that they |
---|
0:08:59 | do and a language judgements |
---|
0:09:01 | on a collection our target was to identify to define flax for each of the |
---|
0:09:06 | lre languages |
---|
0:09:07 | and construct each class to make a single call to each other fifteen and thirty |
---|
0:09:13 | individuals within their existing social network |
---|
0:09:17 | so an individual we when we recruited people to be class for the study part |
---|
0:09:23 | of the job description was you know a lot of other people who speaker language |
---|
0:09:27 | and you can convince them to do a phone call with you |
---|
0:09:31 | and how to recorded research purposes |
---|
0:09:34 | so prior to the call being recorded the cali here is there a message saying |
---|
0:09:38 | this "'cause" going to be recorded research |
---|
0:09:41 | if you re push one which one and then the recording begins |
---|
0:09:46 | because we were collected we were recruiting these class in philadelphia primarily in some cases |
---|
0:09:52 | the multiple last for language knew each other and there is a chance that their |
---|
0:09:56 | social networks with overlap and we wanted to call these to be distinct and so |
---|
0:10:00 | we took some steps to ensure |
---|
0:10:02 | that the call is did not overlap and language |
---|
0:10:05 | because that |
---|
0:10:07 | or in every all we excluded that call sides from work |
---|
0:10:13 | we also require the class to make at least some of their calls in the |
---|
0:10:17 | us we permitted down to call overseas |
---|
0:10:20 | and most of them did we also require them to make some other calls within |
---|
0:10:24 | us to avoid any biuniqueness of channel language conditions of all the time holes |
---|
0:10:31 | work originating from thailand |
---|
0:10:35 | then there would be a particular channel characteristic that could be associated with high we |
---|
0:10:42 | wanted to obfuscate that |
---|
0:10:44 | all of the telephone speech collected was |
---|
0:10:48 | collected via lpc is existing on telephone platform |
---|
0:10:52 | eight khz a bit you |
---|
0:10:56 | alright so now we are collected data for recordings |
---|
0:11:00 | and we need to process the material for human audited |
---|
0:11:04 | so we first run all of the selected files through a speech activity detection system |
---|
0:11:09 | in order to just english speech sources silence music other kinds of non-speech |
---|
0:11:15 | based on the sad detection for telephone speech data we extract two segments |
---|
0:11:20 | all of each being thirty to thirty five seconds duration |
---|
0:11:25 | but for the broadcast data we need to do an additional bandwidth filtering |
---|
0:11:30 | so using bruno spend with detector we run over the full recordings for the broadcast |
---|
0:11:37 | data |
---|
0:11:38 | and then from the intersection of the speech |
---|
0:11:41 | plus narrowband segments |
---|
0:11:43 | we identify continuous regions of the of thirty three or more seconds |
---|
0:11:49 | from the broadcast data |
---|
0:11:51 | for |
---|
0:11:52 | segments that are the speech and you yeah |
---|
0:11:56 | that are greater than thirty seconds we identify a single thirty three second segment within |
---|
0:12:01 | that the same |
---|
0:12:02 | that region |
---|
0:12:03 | we do not select multiple segments from the longer region because we want to avoid |
---|
0:12:09 | having multiple segments of speech from this |
---|
0:12:11 | a single speaker in the collection |
---|
0:12:15 | a given the large number of languages and a large number of segments with the |
---|
0:12:19 | salary and in some cases it was necessary for us to reduce the segment duration |
---|
0:12:24 | down to as low as ten so |
---|
0:12:27 | rather than the thirty three seconds |
---|
0:12:31 | so this is just |
---|
0:12:32 | a graphical depiction of that selection process three |
---|
0:12:35 | speech file we wanna sad system |
---|
0:12:38 | distinguishing speech from non-speech if we have a speech regions be rather than but detector |
---|
0:12:44 | identify the narrowband segments in our goal is |
---|
0:12:47 | specifically say |
---|
0:12:52 | with at least thirty three seconds of speech that are narrowband |
---|
0:13:00 | alright so be identified segments are then converted into an auditor friendly format that works |
---|
0:13:06 | well the web based auditing tool that are auditors use |
---|
0:13:09 | that's sixteen khz sixteen bit for the broadcast data eight khz single channel for the |
---|
0:13:15 | telephone speech again we exclude class call aside from the auditing process |
---|
0:13:21 | and all of this process data is also then converted to pcm wave files so |
---|
0:13:27 | that it can be easily rendered in a browser the accuracy is |
---|
0:13:30 | orders are presented within with entire segments for judgements are typically they're listening to thirty |
---|
0:13:36 | three seconds of speech |
---|
0:13:38 | for broadcast |
---|
0:13:40 | similar now for telephone segments |
---|
0:13:44 | so we did some additional things with the lre data prior to presenting and two |
---|
0:13:51 | daughters for judgement |
---|
0:13:54 | with the specific goal of being able to assess inter auditor agreement for language judgements |
---|
0:14:01 | so what we're going to baseline are segments that are |
---|
0:14:05 | that it to be in the auditor's language so hindi auditor i'm being presented with |
---|
0:14:11 | a recording that's expected to be in handy because you know somebody said they were |
---|
0:14:16 | hindi speaker and we collected their speech |
---|
0:14:20 | for be telephone speech |
---|
0:14:23 | segments |
---|
0:14:24 | last work only auditors were only listening to segments that were from holy is |
---|
0:14:31 | from another class |
---|
0:14:33 | this was just sort of minimise the chance that they would just |
---|
0:14:37 | the segments because they knew the person's voice |
---|
0:14:41 | so on top of this baseline that auditors were listening to they were also given |
---|
0:14:46 | an additional distractor segments |
---|
0:14:49 | so up to ten percent additional segments were added to their auditing pilot mine |
---|
0:14:55 | that were drawn from a non confusable language so let's say i think the auditor |
---|
0:15:00 | i might have similarity or some english for some mandarin segments brown into my auditing |
---|
0:15:06 | okay |
---|
0:15:07 | and really this was done to keep auditors on their toes so that occasionally that |
---|
0:15:11 | we get a segment that was in a completely different language and they can just |
---|
0:15:15 | sort of falsely and we all possible |
---|
0:15:18 | we also added up to ten percent dual segments of these are segments that were |
---|
0:15:23 | also assigned other auditors |
---|
0:15:26 | so that we would get interannotator agreement |
---|
0:15:28 | numbers for that |
---|
0:15:30 | and then for all the varieties that have another confusable language in the collection of |
---|
0:15:37 | all these body languages |
---|
0:15:39 | we i additional confusable segments to the auditor's a kid |
---|
0:15:45 | or possibly confusable right use like polish or slow but |
---|
0:15:49 | the auditors judged ten percent additional over the baseline from the body language |
---|
0:15:55 | for my confusable varieties like low and high they judge twenty five percent over the |
---|
0:16:00 | baseline |
---|
0:16:01 | and then for no confusable varieties like indian or do they just all the segments |
---|
0:16:06 | from the body language |
---|
0:16:08 | E individual can make a very high to it because the collection is happening sort |
---|
0:16:12 | of a non linear fashion so getting here that an auditor was working on might |
---|
0:16:17 | be all telephone speech frames for instance |
---|
0:16:20 | but this was sort of our target for the auditing kit construction |
---|
0:16:26 | briefly the auditor's |
---|
0:16:28 | were selected first via |
---|
0:16:30 | a preliminary online screening process |
---|
0:16:33 | so okay |
---|
0:16:35 | had a lot of little survey asking them questions about their language background and then |
---|
0:16:39 | lead to an online test listening |
---|
0:16:42 | spectrum |
---|
0:16:43 | segments |
---|
0:16:44 | but included in the target language but also some of these distractor segments |
---|
0:16:49 | potentially confusable language segments |
---|
0:16:53 | some of the feedback that we got on screening test how does also to point |
---|
0:16:58 | out areas where additional auditor training was needed or where we need to verify the |
---|
0:17:03 | language labels |
---|
0:17:04 | i'm to make the auditing task here |
---|
0:17:07 | about a hundred and thirty people to the screening test at for the past and |
---|
0:17:12 | they were hired in given additional training and the part of the training consisted of |
---|
0:17:16 | training there here's |
---|
0:17:18 | to distinguish narrowband for my where |
---|
0:17:20 | speech be a |
---|
0:17:21 | a signal quality perception |
---|
0:17:23 | then |
---|
0:17:26 | the goal of the auditing task is to ensure that segments contain speech |
---|
0:17:32 | arg in the target variety or narrowband |
---|
0:17:34 | contain only one speaker |
---|
0:17:36 | on the audio quality is acceptable |
---|
0:17:39 | and that also ask questions about have you heard this person's voice |
---|
0:17:44 | before in segments that you previously charged with the reliability and i wish |
---|
0:17:48 | the solo a given the thousands and thousands of segments of people are judging on |
---|
0:17:53 | the we just abandon the questions here |
---|
0:17:58 | so the words about how to be consistency |
---|
0:18:02 | i'll just get to the bottom point that the numbers reported here |
---|
0:18:06 | or from segments that were assigned during the normalizing process all this dual annotation we |
---|
0:18:12 | conducted was not done post hoc it was done as part of the regular everyday |
---|
0:18:20 | so let's look for |
---|
0:18:21 | step within language agreement so this is comparing multiple judgements |
---|
0:18:25 | where the expected language of the segment was also the language of the auditor's |
---|
0:18:30 | yeah |
---|
0:18:31 | we're asking what is the language label agreement so this is for instance a case |
---|
0:18:34 | where two and all the speakers are judging that's |
---|
0:18:37 | that we expect to be |
---|
0:18:39 | and you know naively we want this number to be close to one hundred percent |
---|
0:18:44 | well it's not always hundred percent so for the arabic varieties which we know are |
---|
0:18:50 | highly confusable one another we see very poor treatment for instance so |
---|
0:18:56 | be modern standard arabic charges only read with one another forty two percent |
---|
0:19:02 | time |
---|
0:19:02 | and whether a segment was actually modern standard error |
---|
0:19:06 | the dialectal |
---|
0:19:07 | right are higher separate levantine arabic almost everyone agree |
---|
0:19:12 | that is like |
---|
0:19:13 | presented them |
---|
0:19:16 | some other highlights here for hindi and word |
---|
0:19:19 | we also see here |
---|
0:19:21 | well |
---|
0:19:22 | agreements to around ninety percent but not surprising given these language pairs |
---|
0:19:27 | oh are related |
---|
0:19:32 | now looking at dual annotation results of this is looking at the exact same segments |
---|
0:19:37 | what is the agreement just on the language questions so we had nine hundred fifty |
---|
0:19:42 | one cases where the order |
---|
0:19:44 | said no that's non-target language |
---|
0:19:47 | a fifteen hundred cases where C yeah that's my target language the two hundred fourteen |
---|
0:19:52 | cases where one auditors that it's my language the other auditors said no it's not |
---|
0:19:57 | and it can break this number down you'll see that the disagreement comes mostly from |
---|
0:20:00 | three languages |
---|
0:20:02 | modern standard arabic yeah |
---|
0:20:04 | very well dual annotation agreements |
---|
0:20:07 | and then agreement for can be word so not surprising that these languages that are |
---|
0:20:13 | causing trouble |
---|
0:20:14 | and finally looking at cross language agreement so this is looking at judgements where a |
---|
0:20:19 | segment was |
---|
0:20:20 | confirmed by one auditor to be in their language |
---|
0:20:24 | language was the expected language was the one we believe that the segment bn |
---|
0:20:28 | and that's a |
---|
0:20:29 | was then judged by auditor from another language you also said that segment was in |
---|
0:20:34 | their language |
---|
0:20:35 | right so this is like a hindi speaker listens to segment that we think it's |
---|
0:20:39 | and hindi they say yeah that's can be we play that same segment for and |
---|
0:20:43 | we review auditor and they say yeah that's or do |
---|
0:20:46 | so |
---|
0:20:47 | we see some interesting cross-language disagreement here so |
---|
0:20:52 | for the varieties where |
---|
0:20:54 | and we expect languages modern standard error |
---|
0:20:58 | levantine |
---|
0:20:59 | listening |
---|
0:21:01 | ninety percent |
---|
0:21:02 | i think that |
---|
0:21:03 | is there |
---|
0:21:05 | right |
---|
0:21:07 | similar numbers for |
---|
0:21:09 | my |
---|
0:21:14 | so this one down here |
---|
0:21:17 | we see some confusion between american english |
---|
0:21:21 | and |
---|
0:21:22 | which ones are just might both somewhat surprising but this is actually an asymmetrical |
---|
0:21:28 | confusion |
---|
0:21:30 | what's going on is that |
---|
0:21:32 | is the expected language is american english but the auditor is an indian english order |
---|
0:21:37 | they're likely to explain that segment has their own language but the reverse doesn't |
---|
0:21:42 | an american english auditor does not flaming indian english segment to be american english |
---|
0:21:48 | we see a similar kind of asymmetry for can be a word |
---|
0:21:58 | so wrapping up with respect to data distribution redistributed the data to nist and C |
---|
0:22:04 | six incremental releases |
---|
0:22:07 | packages contain full audio recordings |
---|
0:22:10 | the auditor version of the segments |
---|
0:22:12 | and then the audio results for segments that particular criteria is the segment in the |
---|
0:22:18 | target language does it contains speech is all the speech from one speaker |
---|
0:22:23 | the answers to all that is needed to yes |
---|
0:22:26 | and then for but |
---|
0:22:28 | the entire segment sound like narrowband signal we delivered both yes and no segment judgements |
---|
0:22:34 | along with the full segment metadata tables in this could sub sample on the segments |
---|
0:22:38 | from |
---|
0:22:39 | so the evaluation |
---|
0:22:41 | so this is just table two summarizes the |
---|
0:22:45 | total |
---|
0:22:47 | a delivery so here are four hundred segment target for all the two languages allow |
---|
0:22:53 | and ukrainian where we had a real struggle to find all five |
---|
0:22:58 | so in conclusion we prepared significant points you telephone a broadcast data |
---|
0:23:04 | in twenty four languages which included several |
---|
0:23:07 | confusable varieties |
---|
0:23:09 | we needed to dawson are collection strategies to support corpus requirements |
---|
0:23:13 | there's a type of your should be at for auditors be over twenty two thousand |
---|
0:23:16 | on the judgements yielding about ten thousand usable lre segments |
---|
0:23:21 | the auditing kids were constructed just for consistency analysis |
---|
0:23:25 | we found that the within language agreement was typically over ninety five percent it's a |
---|
0:23:30 | few exceptions and it wouldn't |
---|
0:23:32 | we did see cross-language confusion particularly for their of it but you languages |
---|
0:23:37 | i'm in an asymmetrical a confusion with high level with american english indian english hindi |
---|
0:23:45 | urdu and with farsi dari |
---|
0:23:47 | and this corpus supported lre twenty eleven evaluation ultimately published in and sees |
---|
0:23:54 | and decomposition but sponsors |
---|
0:23:57 | okay thank you |
---|
0:23:59 | oh |
---|
0:24:23 | oh |
---|
0:24:29 | that's right |
---|
0:24:30 | so if we had only one auditor judgement for segment |
---|
0:24:34 | the segments gonna comes up avr that was deliberate if we had multiple judgements and |
---|
0:24:38 | they were all in agreement that was deliver if we have described that judgements |
---|
0:24:43 | those segments were withheld from what was delivered to nist |
---|
0:24:47 | those described in segments will be included in ultimate a general publication for lre eleven |
---|
0:24:53 | one it appears in these |
---|
0:24:56 | that might be interesting data for research |
---|
0:25:00 | along with the metadata |
---|
0:25:20 | oh |
---|
0:25:25 | right so it's |
---|
0:25:28 | someone asymmetrical so there are certain varieties that people are more accepting if there are |
---|
0:25:33 | linguistically similar to their own |
---|
0:25:35 | well they don't auditors that they typically tell this is this is |
---|
0:25:40 | not only could be tell that it wasn't moroccan let's say they could tell specifically |
---|
0:25:43 | that it was correctly |
---|
0:25:45 | the real confusion comes in with modern standard error |
---|
0:25:49 | which is really not spoken natively by anyone |
---|
0:25:54 | and also modern standard arabic spoken in a broadcast course |
---|
0:25:59 | sources that we were collecting |
---|
0:26:01 | may contain some dialectal elements so if you're doing an interview with someone from around |
---|
0:26:07 | some already dialect may prevent to what you know was reported to be modern standard |
---|
0:26:12 | arabic so that sort of a |
---|
0:26:16 | confusing fact |
---|
0:26:18 | and analysing |
---|
0:26:26 | oh |
---|