0:00:15 | speaker that's probably collection system |
---|
0:00:22 | so |
---|
0:00:38 | okay i'm presenting this on behalf of kevin walker you |
---|
0:00:41 | wasn't able to ten |
---|
0:00:43 | due to a very normal the version to sixteen hour plane right |
---|
0:00:49 | so i'm going to see how well this is a kind of a departure from |
---|
0:00:52 | the other talks and the session and the conference as a whole but i think |
---|
0:00:56 | of interest of this community nonetheless so i'm going to briefly describe the rats program |
---|
0:01:01 | and its goals |
---|
0:01:03 | and then really delve into the data creation process for rats that'd talk a little |
---|
0:01:08 | bit about how or generating the content that's used in the rats program the system |
---|
0:01:15 | that we built to produce degraded audio quality recordings for the program |
---|
0:01:22 | talk a bit about the annotation process and then focus on some details of the |
---|
0:01:25 | speaker id evaluations just about to start |
---|
0:01:28 | within rats |
---|
0:01:31 | so by way of introduction wraps is a three year darpa program |
---|
0:01:36 | that's targeting speech an extremely noisy and highly distorted channels |
---|
0:01:42 | specifically is targeting noise not background noise |
---|
0:01:46 | but noise in the signal sort of |
---|
0:01:51 | radio transmissions is |
---|
0:01:53 | and of the target kind of |
---|
0:01:54 | therefore evaluation tasks within rats and speech activity detection language id speaker id hubert spot |
---|
0:02:01 | there are five very challenging languages that we're targeting |
---|
0:02:04 | and phase one of rats the training and the test data is based on material |
---|
0:02:09 | that ldc is providing later phases will also test |
---|
0:02:15 | on operational data although there won't be any training data from the operational environment |
---|
0:02:21 | so |
---|
0:02:22 | in order to produce |
---|
0:02:25 | data that is operationally relevant ldc needed to understand a little bit about the nature |
---|
0:02:30 | of this data so talking to the community we understood the operational data to have |
---|
0:02:37 | a really wide range of noise character |
---|
0:02:40 | so from the |
---|
0:02:41 | structural properties of the data what we're thinking about is something like |
---|
0:02:46 | radio chatter from a taxicab driver |
---|
0:02:50 | this radio channels they're always out of the background and |
---|
0:02:52 | you're calibrateds |
---|
0:02:54 | are also sort of ham radio data that's a good approximation of the structural properties |
---|
0:02:59 | of the data were targeting in terms of density of tell how long the terms |
---|
0:03:02 | are the very short they're very rapid back and for a turn-taking there's lots of |
---|
0:03:08 | intervening silence and they're also occasional bursts of excited speech |
---|
0:03:12 | in terms of the types of noise of interest to the program air traffic control |
---|
0:03:17 | transmissions is a good approximation of the type of noise that were |
---|
0:03:21 | i'm interested in so we get things like side and steering various types of channel |
---|
0:03:26 | interference |
---|
0:03:27 | and also the use of push-to-talk devices which can introduce squelch |
---|
0:03:32 | and so in our collection we also want to target data that's more or less |
---|
0:03:37 | understandable by a human |
---|
0:03:39 | but nonetheless |
---|
0:03:41 | side of the range so we want data that's challenging for human to understand the |
---|
0:03:45 | not impossible that's impossible for human |
---|
0:03:48 | you know we can't really and pursue it beyond that |
---|
0:03:51 | in terms of the nature the speech we wanted to be communicative and transactional and |
---|
0:03:56 | ideally goal oriented |
---|
0:03:58 | it may be too part here multiparty speech half duplex full duplex or even try |
---|
0:04:04 | so like a asr stands take communication that a police department use |
---|
0:04:10 | what we are targeting narrowband wideband and spread spectrum |
---|
0:04:16 | and also a real variety of geographical and topographical environments that my that the radio |
---|
0:04:23 | channel performance in the transmission quality |
---|
0:04:26 | with lots of that |
---|
0:04:28 | around interference as well |
---|
0:04:30 | the speakers may be stationary where they may be in motion in the listening post |
---|
0:04:35 | may also be emotions you can imagine a drone flying overhead |
---|
0:04:39 | surveillance area collecting data |
---|
0:04:41 | and also speakers may know one another |
---|
0:04:44 | so skip over the over you jump into the types of data that we're targeting |
---|
0:04:49 | so we made the to use of found in data so there is some data |
---|
0:04:53 | that you can get on the web that has the sorts of noise properties retargeting |
---|
0:04:57 | address this is mostly shortwave transmissions |
---|
0:05:00 | in that a lot of ham radio operators |
---|
0:05:04 | post videos on you to a of their setup and so is just a stationary |
---|
0:05:09 | image of their setup but you get the audio track of these sure way transmissions |
---|
0:05:13 | that they're receiving |
---|
0:05:16 | the really interesting |
---|
0:05:18 | we're also doing limited collection of sort of short wave transmissions at ldc |
---|
0:05:23 | we made a fairly heavy use of existing data set |
---|
0:05:27 | interest program primarily because many of these data sets were already richly annotated with the |
---|
0:05:32 | features of interest |
---|
0:05:34 | so for instance which are all of the exposed nist speaker recognition test sets is |
---|
0:05:39 | primarily english but they have speaker id verification already and we |
---|
0:05:45 | no more or less what the languages for these recordings similarly use the expose nist |
---|
0:05:50 | lre test sets |
---|
0:05:52 | also several the existing ldc corpora like callfriend that exist in various languages |
---|
0:05:58 | and it is just partially verified for language and speaker id the fisher levantine a |
---|
0:06:04 | corpus of telephone speech that has both language and speaker verification |
---|
0:06:09 | and also some broadcast recordings where we know the language more or less but don't |
---|
0:06:15 | for instance to the speaker |
---|
0:06:17 | the bulk of the data and the ldc is producing rats program is new data |
---|
0:06:21 | collection either locally in philadelphia work from vendors around world and this is primarily telephone |
---|
0:06:27 | speech although we're doing some live recordings as well |
---|
0:06:31 | are targeting two types of data general conversation simulators and also some scenario these recordings |
---|
0:06:38 | where people are engaged in some collaborative problem solving task like playing a game of |
---|
0:06:41 | twenty questions |
---|
0:06:43 | or engaging in a scavenger hunt with one another |
---|
0:06:46 | and importantly a fundamental keystone of our system is that we always would like to |
---|
0:06:53 | have a clean recording for purposes of manual annotation |
---|
0:06:59 | and then are ideas that this clean recording is rebranding |
---|
0:07:03 | in order to introduce the kinds of signal degradation that the program targets |
---|
0:07:08 | so in order to perform that i generate that signal degradation we developed a multi |
---|
0:07:13 | communication channel collection platform we wanna this platform to be capable of transmitting speech over |
---|
0:07:19 | radio communication links where the transmission itself introduces the type of noise conditions in signal |
---|
0:07:26 | quality variation of are interested in program |
---|
0:07:30 | the platform that we developed is capable of simultaneous transmission of up to eight different |
---|
0:07:36 | radio channels for each channels targeting a different height degree of voice |
---|
0:07:42 | and again preserving the clean input channel to facilitate the manual annotation process |
---|
0:07:48 | now there's a wrinkle here which is that it and this need to doing annotation |
---|
0:07:54 | on the clean channel this requires |
---|
0:07:57 | very careful process to a line |
---|
0:07:59 | and to project annotations from the clean channel onto the age and degraded channels and |
---|
0:08:06 | that's a very challenging problem |
---|
0:08:09 | some other design principles |
---|
0:08:12 | we wanted the system to be able to be used for either live sessions were |
---|
0:08:16 | retransmissions |
---|
0:08:17 | we want a wide range of channel types with different modulations bandwidths |
---|
0:08:22 | different types of interference |
---|
0:08:25 | we also wherever possible one and the actual components of the system to have some |
---|
0:08:29 | operational relevance we just some research into the kinds of |
---|
0:08:32 | sets |
---|
0:08:33 | and you know push-to-talk devices and that sort of thing that might be actually used |
---|
0:08:39 | in operational environment |
---|
0:08:41 | the radio channels themselves were configured well first we selected a transceivers |
---|
0:08:47 | who's the R P ranged from point five to twelve lots |
---|
0:08:51 | but the transceivers and receivers are equipped with multiple omnidirectional low gain antennas |
---|
0:08:58 | and the transceivers we selected are designed for half duplex analog communication also because this |
---|
0:09:04 | is what we found was primarily used |
---|
0:09:07 | in the real world data |
---|
0:09:09 | and importantly they operate on a shared channel model so they can either be in |
---|
0:09:13 | transmit motor receive mode but they can |
---|
0:09:15 | be in both simultaneously |
---|
0:09:18 | so this is some of the radio channels and that we developed and really that |
---|
0:09:24 | this table is just to give you a feel for |
---|
0:09:27 | the range of transmitters and receivers in a particular the bandwidth variation in the different |
---|
0:09:33 | types of modulation that we were targeting not gonna have time to go into these |
---|
0:09:38 | into too much detail |
---|
0:09:40 | okay so the image here is fairly complex and this is the case are transmit |
---|
0:09:46 | station |
---|
0:09:47 | so i was one through the protocol for transmission briefly so we start with a |
---|
0:09:51 | wrong transmit control computer |
---|
0:09:56 | here |
---|
0:09:58 | the there's a demon running on the transmit station control computer that's querying the database |
---|
0:10:02 | for recordings that are available for retransmission |
---|
0:10:06 | what it finds recording the control computer initiates a remote recording on the receive station |
---|
0:10:13 | control computer |
---|
0:10:15 | and it also initiates a local reference recording |
---|
0:10:18 | that we have just as a baseline |
---|
0:10:22 | it also sponsor a subprocess to drive a computer-controlled push-to-talk relay bank |
---|
0:10:29 | and that is controlled based on a signal relay output so that's this portion of |
---|
0:10:36 | the device |
---|
0:10:41 | when the systems in transmit mode begins playing the output over the source recording output |
---|
0:10:47 | over the specified audio devices |
---|
0:10:49 | and the depiction of the |
---|
0:10:50 | i audio devices this down here |
---|
0:10:53 | the single relay is configured for of |
---|
0:10:55 | fast attack |
---|
0:10:56 | one sustain gradual release |
---|
0:10:59 | and there's a very wide |
---|
0:11:01 | rather utterances and this is just sort of maximise the amount of speech begets transmitted |
---|
0:11:06 | through the system we also introduced a single power supply and power distribution i'm to |
---|
0:11:13 | avoid having the battery problem with the various handsets that part of the transmission system |
---|
0:11:19 | oh we also introduced in isolation transformer bank |
---|
0:11:23 | which is here essentially to isolate the system from upstream electronic equipment |
---|
0:11:30 | and the next slide shows you sort of a similar diagram for the received station |
---|
0:11:34 | and this is mostly just to indicate the variety of receivers that we have |
---|
0:11:39 | form |
---|
0:11:41 | so after recordings are generated |
---|
0:11:45 | essentially they're uploaded to our server and then we initiate this really like be post |
---|
0:11:49 | processing sequence |
---|
0:11:50 | to align the files and also detect any regions of non transmission a compact and |
---|
0:11:55 | second |
---|
0:11:56 | so that if you feel for what the resulting recordings sound like on a plane |
---|
0:12:02 | resamples from each of the channels |
---|
0:12:04 | so first we start with channels can be is evaluated have channels |
---|
0:12:10 | oh and the reference recording first |
---|
0:12:13 | i |
---|
0:12:16 | i |
---|
0:12:17 | i |
---|
0:12:19 | there's channel i |
---|
0:12:29 | yeah |
---|
0:12:38 | okay so channel these are single sideband channel this one is one of the more |
---|
0:12:42 | challenging and channels for the rest a cyst |
---|
0:12:48 | i |
---|
0:12:49 | i |
---|
0:12:51 | i |
---|
0:12:54 | the distortion channel B and then channel H is a narrowband that |
---|
0:13:07 | channel is another |
---|
0:13:10 | channels |
---|
0:13:19 | and then are |
---|
0:13:20 | channel |
---|
0:13:22 | channel |
---|
0:13:26 | i |
---|
0:13:28 | i |
---|
0:13:31 | okay a channel F is or frequency hopping spread |
---|
0:13:35 | i |
---|
0:13:40 | i |
---|
0:13:42 | channel i |
---|
0:13:44 | wideband |
---|
0:13:53 | right system real challenges here these are actually recordings that were transmitted in their entirety |
---|
0:13:59 | these are |
---|
0:14:00 | like white intelligible but they take some getting used to there are much more difficult |
---|
0:14:06 | recordings in a in a set of data |
---|
0:14:09 | so after |
---|
0:14:11 | the clean signals transmitted we have nine resulting audio files |
---|
0:14:15 | clean channel the integrated channels we have a right |
---|
0:14:18 | slide that indicates the retransmission start time |
---|
0:14:21 | and all the sort source file parameter |
---|
0:14:23 | we also have what we call a slot which is essentially timestamps on the push |
---|
0:14:27 | to talk button on and off of that some for each of the individual channels |
---|
0:14:32 | and then we have the reference |
---|
0:14:34 | in addition but on the clean channel only and now we need to create annotation |
---|
0:14:39 | on each of the degraded channels |
---|
0:14:42 | projected from the clean channel as well as very accurate cross channel alignments |
---|
0:14:47 | ideally we'd also like to be able to flag any segments that are impossible for |
---|
0:14:52 | humans understand |
---|
0:14:53 | it's not really fair |
---|
0:14:54 | to evaluate system performance on |
---|
0:14:57 | segments that human can even understand |
---|
0:15:00 | so a perfect world is easy right so we start with a |
---|
0:15:04 | source recording |
---|
0:15:06 | yeah it's we've got perfect alignment on me degraded channel recordings |
---|
0:15:12 | and see the regions are not transmission very cleanly |
---|
0:15:15 | but that's not really the way things work |
---|
0:15:21 | in the real world we have any number of challenges on the retransmission so we |
---|
0:15:25 | have things like channel specific lab |
---|
0:15:28 | there is a bit like |
---|
0:15:30 | some of the channels |
---|
0:15:31 | so there's still a in the segment correspondences |
---|
0:15:34 | and it's not |
---|
0:15:35 | the late at the same |
---|
0:15:37 | all set up for each channel and so we have to do some channel specific |
---|
0:15:41 | manipulation to account for that lack |
---|
0:15:43 | we also have things like |
---|
0:15:46 | to read in the non transmission regions |
---|
0:15:48 | so these are all regions where the transmitter was then engaged but you can see |
---|
0:15:54 | that for channel and a the duration is shorter |
---|
0:15:58 | then for some of the other channels |
---|
0:15:59 | is we have to account for that |
---|
0:16:01 | we also have the occasional failure on a particular channel four sessions of here cases |
---|
0:16:06 | where in |
---|
0:16:08 | channels just were engaged during the transmission |
---|
0:16:11 | and we have the most pernicious problem which are these channel specific dropouts |
---|
0:16:17 | where everything's marching one just on for some reason a just conked out |
---|
0:16:22 | segment |
---|
0:16:23 | and so we have to have ways to detect these all of these issues |
---|
0:16:26 | this is not a real challenge and managing the corpus |
---|
0:16:29 | what we've done is collaborate with the rats performers to develop a number of techniques |
---|
0:16:34 | to help better manage the data so dan ellis the columbia just develop on two |
---|
0:16:39 | algorithms skewview sex Q that identify what the initial offsets for each channel should be |
---|
0:16:47 | brain the cross-channel alignment |
---|
0:16:50 | interesting |
---|
0:16:52 | ldc also developed our own internal processed using a retina scanners |
---|
0:16:59 | i'm to identify long time transmission regions on the channels |
---|
0:17:03 | and this |
---|
0:17:07 | this is sort of two and channel four channel |
---|
0:17:09 | the rmse scans only allows to detect longer transmission regions about two seconds or greater |
---|
0:17:16 | and we'd really like to be able to also detect |
---|
0:17:19 | dropouts that are very short the sound quite a bit and so the grass community |
---|
0:17:25 | is working on a robust |
---|
0:17:27 | a channel specific energy detector no transmission region detector |
---|
0:17:31 | they can detect be shorter dropouts |
---|
0:17:34 | quickly moving on to the annotation tasks that are right channels better annotation sre better |
---|
0:17:41 | lyman across the channels now we annotate |
---|
0:17:44 | so there are five or annotation task |
---|
0:17:47 | for speech activity were reading an audio segment around on the clean channel for lid |
---|
0:17:53 | we're simply listening to the speech segments in judging them is in or out of |
---|
0:17:57 | the target language for keyword for creating a time line |
---|
0:18:01 | transcript for the speech segments |
---|
0:18:03 | and then for the speaker id task we're listening to portions of all internal in |
---|
0:18:08 | channel recordings associated with one speaker id in verifying that it's indeed the same person |
---|
0:18:14 | we're also on a portion of the data the test data in particular |
---|
0:18:18 | doing intelligibility but it so this is where we're having are annotators native speaker annotators |
---|
0:18:23 | listened to the degraded recording segments |
---|
0:18:26 | the speech segments and saying whether they're actually intelligible or not and this turns out |
---|
0:18:31 | to be a very heart task for humans to do an agreement among humans on |
---|
0:18:35 | intelligibility is extremely or |
---|
0:18:37 | we also do most of education system outputs identified any real problems in the annotation |
---|
0:18:43 | data |
---|
0:18:45 | annotation release format is really simple we've got the final metadata and then for each |
---|
0:18:50 | of the annotations what the annotation is |
---|
0:18:53 | and then importantly what its provenance is because reusing some existing data and sort of |
---|
0:18:58 | borrowing annotations from previously developed corpora we indicate whether the annotation is newly created whether |
---|
0:19:07 | it's a legacy annotation or whether it's an automatic annotation for instance from a speech |
---|
0:19:11 | activity detection system |
---|
0:19:14 | so now we've got our annotations on the clean channel we've gotta alignments across the |
---|
0:19:18 | degraded channels now we need to project the annotations onto those degraded channels we start |
---|
0:19:23 | out with the green is |
---|
0:19:25 | speech yellow as non-speech |
---|
0:19:28 | we project that each of the degraded channels that have already been aligned |
---|
0:19:33 | we identify did not i'm transmission regions is indicated by a push to talk about |
---|
0:19:38 | slugs |
---|
0:19:39 | we adjust for the rest the lack that happens |
---|
0:19:43 | pacific channels |
---|
0:19:46 | we run or rms can send find the files that failed transmissions entirely in exclude |
---|
0:19:51 | those from a corpus |
---|
0:19:53 | and then finally we run R and G detectors on a transmission detectors and find |
---|
0:19:59 | any segments where |
---|
0:20:01 | or |
---|
0:20:01 | but more push to talk button lots a there was a transmission but actually there's |
---|
0:20:05 | no signal |
---|
0:20:06 | and so we select those and now we have annotations for each of the degraded |
---|
0:20:11 | channels as well |
---|
0:20:12 | so as a result each file for each segment we have one of five values |
---|
0:20:19 | we have S for speech |
---|
0:20:21 | there was a transmission of speech and S is there was a transmission non-speech T |
---|
0:20:27 | is there is a transmission but has been labeled as to whether it contains speech |
---|
0:20:31 | or not |
---|
0:20:31 | and she is there was no transmission and then this R X |
---|
0:20:35 | setting which is |
---|
0:20:36 | we detected a transmission failure automatically |
---|
0:20:41 | okay now quickly moving to the syndicate particular this evaluation |
---|
0:20:46 | is just getting underway the dry run evaluation is actually happening next week |
---|
0:20:52 | for sid we're defining a progress at which is two hundred fifty speakers |
---|
0:20:57 | with ten sessions for each speaker nominally this is fifty speakers per language although it |
---|
0:21:02 | won't actually play out that way six of the sessions per speaker going to be |
---|
0:21:07 | sequestered by the evaluation team which is S A I C doesn't be used for |
---|
0:21:10 | enrollment |
---|
0:21:11 | the other four sessions per speaker are used for test |
---|
0:21:15 | there's a dev test set that has the same characteristics as the progress |
---|
0:21:19 | set |
---|
0:21:20 | and then there's this additional generally used dataset which is two hundred fifty speakers that |
---|
0:21:24 | have these two sessions each |
---|
0:21:26 | and the performers can do whatever they like with this generally is that |
---|
0:21:31 | see within rats is being evaluated is an open test |
---|
0:21:34 | paradigm systems need to provide independent decisions about each of the target speakers |
---|
0:21:39 | from the candidate ten percent candidate speakers without any |
---|
0:21:43 | knowledge |
---|
0:21:44 | of the impostors are in the test data |
---|
0:21:48 | all speakers in the test will be enrolled in the test some samples will be |
---|
0:21:52 | used as impostors |
---|
0:21:54 | for the other trials |
---|
0:21:56 | and the performers need to have agreed to avoid using the enrollment samples for any |
---|
0:22:00 | purpose other that the target speaker enrollment so they can be used for training |
---|
0:22:05 | in the trials involving that speaker |
---|
0:22:08 | where also we also distribute the nist sre data sort of background on modeling data |
---|
0:22:14 | for the performance of that data has been pushed to the retransmission system |
---|
0:22:19 | so far we delivered something like fifteen hundred |
---|
0:22:23 | single speaker call these are people who started out with the goal of making calls |
---|
0:22:28 | and dropped out the collection so most people drop out |
---|
0:22:31 | and ninety percent of the people drop election after all we don't a hundred and |
---|
0:22:36 | thirty seven speakers that have to the nine whole speech |
---|
0:22:41 | hundred eighty three speakers that have ten calls each and our goal is again two |
---|
0:22:47 | hundred fifty speakers with at least two thousand other two hundred fifty that have that |
---|
0:22:54 | the slide |
---|
0:22:55 | just summarizes the total amount of data to be processed through the rest system to |
---|
0:22:58 | date so we use this is about a month out of date so i think |
---|
0:23:03 | we can add five hundred to the bottom line here |
---|
0:23:05 | so we transmitted over three thousand hours probably closer to thirty five hundred hours now |
---|
0:23:11 | a source data yielding about sixteen thousand hours or more of degraded audio channels in |
---|
0:23:17 | this includes |
---|
0:23:19 | four hundred hours of data labeled for so |
---|
0:23:21 | i seven hundred twenty labeled language id and about four hundred hours of keyword spotting |
---|
0:23:25 | transcripts |
---|
0:23:28 | i'll come to the conclusion since i'm running out of time so in summary over |
---|
0:23:33 | the past |
---|
0:23:34 | you have i guess lpc is designed in the late this multi radio channel collection |
---|
0:23:39 | platform we've undertaken a very large scale data collection including retransmission an annotation |
---|
0:23:46 | of five very challenging languages |
---|
0:23:48 | we retrieve retransmitted over three thousand hours of data yielding more than six thousand hours |
---|
0:23:52 | of degraded signal |
---|
0:23:55 | see that over fifteen thousand hours of clean signal data and generated corresponding degraded channel |
---|
0:24:01 | annotations |
---|
0:24:02 | we've developed independently and also with lots of the input from the rats performers several |
---|
0:24:07 | algorithms to improve the overall quality of the transmitted data |
---|
0:24:11 | we supported lots of new request for a new kinds of annotation collection |
---|
0:24:16 | this is dry run evaluation is starting next week and people are very nervous and |
---|
0:24:21 | this is really our data |
---|
0:24:23 | i'm very eager to see what else |
---|
0:24:27 | and thank you |
---|
0:24:28 | i |
---|
0:24:41 | oh |
---|
0:24:42 | oh |
---|
0:24:47 | oh |
---|
0:24:56 | we would like to |
---|
0:24:59 | the receivers the listening post in a moving vehicle looking at that time assessment or |
---|
0:25:04 | something but we don't have the funding to support that model so the transmitters and |
---|
0:25:10 | receivers are at ldc there about thirty meters apart but there are significant structural barriers |
---|
0:25:16 | in between the transmit and the receive station |
---|
0:25:20 | so there's |
---|
0:25:21 | like the core of the building is between the transmitted receive station that's the best |
---|
0:25:25 | we could do with the resources available we are pursuing for base to address a |
---|
0:25:29 | novel channel selection that may involve please see the listening post |
---|
0:25:35 | and a more remote location |
---|
0:25:37 | or even doing some of extra collection listening post motion |
---|
0:25:50 | i |
---|