Speech Transcript - The RATS Radio Traffic Collection System

0:00:15	speaker that's probably collection system
0:00:22	so
0:00:38	okay i'm presenting this on behalf of kevin walker you
0:00:41	wasn't able to ten
0:00:43	due to a very normal the version to sixteen hour plane right
0:00:49	so i'm going to see how well this is a kind of a departure from
0:00:52	the other talks and the session and the conference as a whole but i think
0:00:56	of interest of this community nonetheless so i'm going to briefly describe the rats program
0:01:01	and its goals
0:01:03	and then really delve into the data creation process for rats that'd talk a little
0:01:08	bit about how or generating the content that's used in the rats program the system
0:01:15	that we built to produce degraded audio quality recordings for the program
0:01:22	talk a bit about the annotation process and then focus on some details of the
0:01:25	speaker id evaluations just about to start
0:01:28	within rats
0:01:31	so by way of introduction wraps is a three year darpa program
0:01:36	that's targeting speech an extremely noisy and highly distorted channels
0:01:42	specifically is targeting noise not background noise
0:01:46	but noise in the signal sort of
0:01:51	radio transmissions is
0:01:53	and of the target kind of
0:01:54	therefore evaluation tasks within rats and speech activity detection language id speaker id hubert spot
0:02:01	there are five very challenging languages that we're targeting
0:02:04	and phase one of rats the training and the test data is based on material
0:02:09	that ldc is providing later phases will also test
0:02:15	on operational data although there won't be any training data from the operational environment
0:02:21	so
0:02:22	in order to produce
0:02:25	data that is operationally relevant ldc needed to understand a little bit about the nature
0:02:30	of this data so talking to the community we understood the operational data to have
0:02:37	a really wide range of noise character
0:02:40	so from the
0:02:41	structural properties of the data what we're thinking about is something like
0:02:46	radio chatter from a taxicab driver
0:02:50	this radio channels they're always out of the background and
0:02:52	you're calibrateds
0:02:54	are also sort of ham radio data that's a good approximation of the structural properties
0:02:59	of the data were targeting in terms of density of tell how long the terms
0:03:02	are the very short they're very rapid back and for a turn-taking there's lots of
0:03:08	intervening silence and they're also occasional bursts of excited speech
0:03:12	in terms of the types of noise of interest to the program air traffic control
0:03:17	transmissions is a good approximation of the type of noise that were
0:03:21	i'm interested in so we get things like side and steering various types of channel
0:03:26	interference
0:03:27	and also the use of push-to-talk devices which can introduce squelch
0:03:32	and so in our collection we also want to target data that's more or less
0:03:37	understandable by a human
0:03:39	but nonetheless
0:03:41	side of the range so we want data that's challenging for human to understand the
0:03:45	not impossible that's impossible for human
0:03:48	you know we can't really and pursue it beyond that
0:03:51	in terms of the nature the speech we wanted to be communicative and transactional and
0:03:56	ideally goal oriented
0:03:58	it may be too part here multiparty speech half duplex full duplex or even try
0:04:04	so like a asr stands take communication that a police department use
0:04:10	what we are targeting narrowband wideband and spread spectrum
0:04:16	and also a real variety of geographical and topographical environments that my that the radio
0:04:23	channel performance in the transmission quality
0:04:26	with lots of that
0:04:28	around interference as well
0:04:30	the speakers may be stationary where they may be in motion in the listening post
0:04:35	may also be emotions you can imagine a drone flying overhead
0:04:39	surveillance area collecting data
0:04:41	and also speakers may know one another
0:04:44	so skip over the over you jump into the types of data that we're targeting
0:04:49	so we made the to use of found in data so there is some data
0:04:53	that you can get on the web that has the sorts of noise properties retargeting
0:04:57	address this is mostly shortwave transmissions
0:05:00	in that a lot of ham radio operators
0:05:04	post videos on you to a of their setup and so is just a stationary
0:05:09	image of their setup but you get the audio track of these sure way transmissions
0:05:13	that they're receiving
0:05:16	the really interesting
0:05:18	we're also doing limited collection of sort of short wave transmissions at ldc
0:05:23	we made a fairly heavy use of existing data set
0:05:27	interest program primarily because many of these data sets were already richly annotated with the
0:05:32	features of interest
0:05:34	so for instance which are all of the exposed nist speaker recognition test sets is
0:05:39	primarily english but they have speaker id verification already and we
0:05:45	no more or less what the languages for these recordings similarly use the expose nist
0:05:50	lre test sets
0:05:52	also several the existing ldc corpora like callfriend that exist in various languages
0:05:58	and it is just partially verified for language and speaker id the fisher levantine a
0:06:04	corpus of telephone speech that has both language and speaker verification
0:06:09	and also some broadcast recordings where we know the language more or less but don't
0:06:15	for instance to the speaker
0:06:17	the bulk of the data and the ldc is producing rats program is new data
0:06:21	collection either locally in philadelphia work from vendors around world and this is primarily telephone
0:06:27	speech although we're doing some live recordings as well
0:06:31	are targeting two types of data general conversation simulators and also some scenario these recordings
0:06:38	where people are engaged in some collaborative problem solving task like playing a game of
0:06:41	twenty questions
0:06:43	or engaging in a scavenger hunt with one another
0:06:46	and importantly a fundamental keystone of our system is that we always would like to
0:06:53	have a clean recording for purposes of manual annotation
0:06:59	and then are ideas that this clean recording is rebranding
0:07:03	in order to introduce the kinds of signal degradation that the program targets
0:07:08	so in order to perform that i generate that signal degradation we developed a multi
0:07:13	communication channel collection platform we wanna this platform to be capable of transmitting speech over
0:07:19	radio communication links where the transmission itself introduces the type of noise conditions in signal
0:07:26	quality variation of are interested in program
0:07:30	the platform that we developed is capable of simultaneous transmission of up to eight different
0:07:36	radio channels for each channels targeting a different height degree of voice
0:07:42	and again preserving the clean input channel to facilitate the manual annotation process
0:07:48	now there's a wrinkle here which is that it and this need to doing annotation
0:07:54	on the clean channel this requires
0:07:57	very careful process to a line
0:07:59	and to project annotations from the clean channel onto the age and degraded channels and
0:08:06	that's a very challenging problem
0:08:09	some other design principles
0:08:12	we wanted the system to be able to be used for either live sessions were
0:08:16	retransmissions
0:08:17	we want a wide range of channel types with different modulations bandwidths
0:08:22	different types of interference
0:08:25	we also wherever possible one and the actual components of the system to have some
0:08:29	operational relevance we just some research into the kinds of
0:08:32	sets
0:08:33	and you know push-to-talk devices and that sort of thing that might be actually used
0:08:39	in operational environment
0:08:41	the radio channels themselves were configured well first we selected a transceivers
0:08:47	who's the R P ranged from point five to twelve lots
0:08:51	but the transceivers and receivers are equipped with multiple omnidirectional low gain antennas
0:08:58	and the transceivers we selected are designed for half duplex analog communication also because this
0:09:04	is what we found was primarily used
0:09:07	in the real world data
0:09:09	and importantly they operate on a shared channel model so they can either be in
0:09:13	transmit motor receive mode but they can
0:09:15	be in both simultaneously
0:09:18	so this is some of the radio channels and that we developed and really that
0:09:24	this table is just to give you a feel for
0:09:27	the range of transmitters and receivers in a particular the bandwidth variation in the different
0:09:33	types of modulation that we were targeting not gonna have time to go into these
0:09:38	into too much detail
0:09:40	okay so the image here is fairly complex and this is the case are transmit
0:09:46	station
0:09:47	so i was one through the protocol for transmission briefly so we start with a
0:09:51	wrong transmit control computer
0:09:56	here
0:09:58	the there's a demon running on the transmit station control computer that's querying the database
0:10:02	for recordings that are available for retransmission
0:10:06	what it finds recording the control computer initiates a remote recording on the receive station
0:10:13	control computer
0:10:15	and it also initiates a local reference recording
0:10:18	that we have just as a baseline
0:10:22	it also sponsor a subprocess to drive a computer-controlled push-to-talk relay bank
0:10:29	and that is controlled based on a signal relay output so that's this portion of
0:10:36	the device
0:10:41	when the systems in transmit mode begins playing the output over the source recording output
0:10:47	over the specified audio devices
0:10:49	and the depiction of the
0:10:50	i audio devices this down here
0:10:53	the single relay is configured for of
0:10:55	fast attack
0:10:56	one sustain gradual release
0:10:59	and there's a very wide
0:11:01	rather utterances and this is just sort of maximise the amount of speech begets transmitted
0:11:06	through the system we also introduced a single power supply and power distribution i'm to
0:11:13	avoid having the battery problem with the various handsets that part of the transmission system
0:11:19	oh we also introduced in isolation transformer bank
0:11:23	which is here essentially to isolate the system from upstream electronic equipment
0:11:30	and the next slide shows you sort of a similar diagram for the received station
0:11:34	and this is mostly just to indicate the variety of receivers that we have
0:11:39	form
0:11:41	so after recordings are generated
0:11:45	essentially they're uploaded to our server and then we initiate this really like be post
0:11:49	processing sequence
0:11:50	to align the files and also detect any regions of non transmission a compact and
0:11:55	second
0:11:56	so that if you feel for what the resulting recordings sound like on a plane
0:12:02	resamples from each of the channels
0:12:04	so first we start with channels can be is evaluated have channels
0:12:10	oh and the reference recording first
0:12:13	i
0:12:16	i
0:12:17	i
0:12:19	there's channel i
0:12:29	yeah
0:12:38	okay so channel these are single sideband channel this one is one of the more
0:12:42	challenging and channels for the rest a cyst
0:12:48	i
0:12:49	i
0:12:51	i
0:12:54	the distortion channel B and then channel H is a narrowband that
0:13:07	channel is another
0:13:10	channels
0:13:19	and then are
0:13:20	channel
0:13:22	channel
0:13:26	i
0:13:28	i
0:13:31	okay a channel F is or frequency hopping spread
0:13:35	i
0:13:40	i
0:13:42	channel i
0:13:44	wideband
0:13:53	right system real challenges here these are actually recordings that were transmitted in their entirety
0:13:59	these are
0:14:00	like white intelligible but they take some getting used to there are much more difficult
0:14:06	recordings in a in a set of data
0:14:09	so after
0:14:11	the clean signals transmitted we have nine resulting audio files
0:14:15	clean channel the integrated channels we have a right
0:14:18	slide that indicates the retransmission start time
0:14:21	and all the sort source file parameter
0:14:23	we also have what we call a slot which is essentially timestamps on the push
0:14:27	to talk button on and off of that some for each of the individual channels
0:14:32	and then we have the reference
0:14:34	in addition but on the clean channel only and now we need to create annotation
0:14:39	on each of the degraded channels
0:14:42	projected from the clean channel as well as very accurate cross channel alignments
0:14:47	ideally we'd also like to be able to flag any segments that are impossible for
0:14:52	humans understand
0:14:53	it's not really fair
0:14:54	to evaluate system performance on
0:14:57	segments that human can even understand
0:15:00	so a perfect world is easy right so we start with a
0:15:04	source recording
0:15:06	yeah it's we've got perfect alignment on me degraded channel recordings
0:15:12	and see the regions are not transmission very cleanly
0:15:15	but that's not really the way things work
0:15:21	in the real world we have any number of challenges on the retransmission so we
0:15:25	have things like channel specific lab
0:15:28	there is a bit like
0:15:30	some of the channels
0:15:31	so there's still a in the segment correspondences
0:15:34	and it's not
0:15:35	the late at the same
0:15:37	all set up for each channel and so we have to do some channel specific
0:15:41	manipulation to account for that lack
0:15:43	we also have things like
0:15:46	to read in the non transmission regions
0:15:48	so these are all regions where the transmitter was then engaged but you can see
0:15:54	that for channel and a the duration is shorter
0:15:58	then for some of the other channels
0:15:59	is we have to account for that
0:16:01	we also have the occasional failure on a particular channel four sessions of here cases
0:16:06	where in
0:16:08	channels just were engaged during the transmission
0:16:11	and we have the most pernicious problem which are these channel specific dropouts
0:16:17	where everything's marching one just on for some reason a just conked out
0:16:22	segment
0:16:23	and so we have to have ways to detect these all of these issues
0:16:26	this is not a real challenge and managing the corpus
0:16:29	what we've done is collaborate with the rats performers to develop a number of techniques
0:16:34	to help better manage the data so dan ellis the columbia just develop on two
0:16:39	algorithms skewview sex Q that identify what the initial offsets for each channel should be
0:16:47	brain the cross-channel alignment
0:16:50	interesting
0:16:52	ldc also developed our own internal processed using a retina scanners
0:16:59	i'm to identify long time transmission regions on the channels
0:17:03	and this
0:17:07	this is sort of two and channel four channel
0:17:09	the rmse scans only allows to detect longer transmission regions about two seconds or greater
0:17:16	and we'd really like to be able to also detect
0:17:19	dropouts that are very short the sound quite a bit and so the grass community
0:17:25	is working on a robust
0:17:27	a channel specific energy detector no transmission region detector
0:17:31	they can detect be shorter dropouts
0:17:34	quickly moving on to the annotation tasks that are right channels better annotation sre better
0:17:41	lyman across the channels now we annotate
0:17:44	so there are five or annotation task
0:17:47	for speech activity were reading an audio segment around on the clean channel for lid
0:17:53	we're simply listening to the speech segments in judging them is in or out of
0:17:57	the target language for keyword for creating a time line
0:18:01	transcript for the speech segments
0:18:03	and then for the speaker id task we're listening to portions of all internal in
0:18:08	channel recordings associated with one speaker id in verifying that it's indeed the same person
0:18:14	we're also on a portion of the data the test data in particular
0:18:18	doing intelligibility but it so this is where we're having are annotators native speaker annotators
0:18:23	listened to the degraded recording segments
0:18:26	the speech segments and saying whether they're actually intelligible or not and this turns out
0:18:31	to be a very heart task for humans to do an agreement among humans on
0:18:35	intelligibility is extremely or
0:18:37	we also do most of education system outputs identified any real problems in the annotation
0:18:43	data
0:18:45	annotation release format is really simple we've got the final metadata and then for each
0:18:50	of the annotations what the annotation is
0:18:53	and then importantly what its provenance is because reusing some existing data and sort of
0:18:58	borrowing annotations from previously developed corpora we indicate whether the annotation is newly created whether
0:19:07	it's a legacy annotation or whether it's an automatic annotation for instance from a speech
0:19:11	activity detection system
0:19:14	so now we've got our annotations on the clean channel we've gotta alignments across the
0:19:18	degraded channels now we need to project the annotations onto those degraded channels we start
0:19:23	out with the green is
0:19:25	speech yellow as non-speech
0:19:28	we project that each of the degraded channels that have already been aligned
0:19:33	we identify did not i'm transmission regions is indicated by a push to talk about
0:19:38	slugs
0:19:39	we adjust for the rest the lack that happens
0:19:43	pacific channels
0:19:46	we run or rms can send find the files that failed transmissions entirely in exclude
0:19:51	those from a corpus
0:19:53	and then finally we run R and G detectors on a transmission detectors and find
0:19:59	any segments where
0:20:01	or
0:20:01	but more push to talk button lots a there was a transmission but actually there's
0:20:05	no signal
0:20:06	and so we select those and now we have annotations for each of the degraded
0:20:11	channels as well
0:20:12	so as a result each file for each segment we have one of five values
0:20:19	we have S for speech
0:20:21	there was a transmission of speech and S is there was a transmission non-speech T
0:20:27	is there is a transmission but has been labeled as to whether it contains speech
0:20:31	or not
0:20:31	and she is there was no transmission and then this R X
0:20:35	setting which is
0:20:36	we detected a transmission failure automatically
0:20:41	okay now quickly moving to the syndicate particular this evaluation
0:20:46	is just getting underway the dry run evaluation is actually happening next week
0:20:52	for sid we're defining a progress at which is two hundred fifty speakers
0:20:57	with ten sessions for each speaker nominally this is fifty speakers per language although it
0:21:02	won't actually play out that way six of the sessions per speaker going to be
0:21:07	sequestered by the evaluation team which is S A I C doesn't be used for
0:21:10	enrollment
0:21:11	the other four sessions per speaker are used for test
0:21:15	there's a dev test set that has the same characteristics as the progress
0:21:19	set
0:21:20	and then there's this additional generally used dataset which is two hundred fifty speakers that
0:21:24	have these two sessions each
0:21:26	and the performers can do whatever they like with this generally is that
0:21:31	see within rats is being evaluated is an open test
0:21:34	paradigm systems need to provide independent decisions about each of the target speakers
0:21:39	from the candidate ten percent candidate speakers without any
0:21:43	knowledge
0:21:44	of the impostors are in the test data
0:21:48	all speakers in the test will be enrolled in the test some samples will be
0:21:52	used as impostors
0:21:54	for the other trials
0:21:56	and the performers need to have agreed to avoid using the enrollment samples for any
0:22:00	purpose other that the target speaker enrollment so they can be used for training
0:22:05	in the trials involving that speaker
0:22:08	where also we also distribute the nist sre data sort of background on modeling data
0:22:14	for the performance of that data has been pushed to the retransmission system
0:22:19	so far we delivered something like fifteen hundred
0:22:23	single speaker call these are people who started out with the goal of making calls
0:22:28	and dropped out the collection so most people drop out
0:22:31	and ninety percent of the people drop election after all we don't a hundred and
0:22:36	thirty seven speakers that have to the nine whole speech
0:22:41	hundred eighty three speakers that have ten calls each and our goal is again two
0:22:47	hundred fifty speakers with at least two thousand other two hundred fifty that have that
0:22:54	the slide
0:22:55	just summarizes the total amount of data to be processed through the rest system to
0:22:58	date so we use this is about a month out of date so i think
0:23:03	we can add five hundred to the bottom line here
0:23:05	so we transmitted over three thousand hours probably closer to thirty five hundred hours now
0:23:11	a source data yielding about sixteen thousand hours or more of degraded audio channels in
0:23:17	this includes
0:23:19	four hundred hours of data labeled for so
0:23:21	i seven hundred twenty labeled language id and about four hundred hours of keyword spotting
0:23:25	transcripts
0:23:28	i'll come to the conclusion since i'm running out of time so in summary over
0:23:33	the past
0:23:34	you have i guess lpc is designed in the late this multi radio channel collection
0:23:39	platform we've undertaken a very large scale data collection including retransmission an annotation
0:23:46	of five very challenging languages
0:23:48	we retrieve retransmitted over three thousand hours of data yielding more than six thousand hours
0:23:52	of degraded signal
0:23:55	see that over fifteen thousand hours of clean signal data and generated corresponding degraded channel
0:24:01	annotations
0:24:02	we've developed independently and also with lots of the input from the rats performers several
0:24:07	algorithms to improve the overall quality of the transmitted data
0:24:11	we supported lots of new request for a new kinds of annotation collection
0:24:16	this is dry run evaluation is starting next week and people are very nervous and
0:24:21	this is really our data
0:24:23	i'm very eager to see what else
0:24:27	and thank you
0:24:28	i
0:24:41	oh
0:24:42	oh
0:24:47	oh
0:24:56	we would like to
0:24:59	the receivers the listening post in a moving vehicle looking at that time assessment or
0:25:04	something but we don't have the funding to support that model so the transmitters and
0:25:10	receivers are at ldc there about thirty meters apart but there are significant structural barriers
0:25:16	in between the transmit and the receive station
0:25:20	so there's
0:25:21	like the core of the building is between the transmitted receive station that's the best
0:25:25	we could do with the resources available we are pursuing for base to address a
0:25:29	novel channel selection that may involve please see the listening post
0:25:35	and a more remote location
0:25:37	or even doing some of extra collection listening post motion
0:25:50	i

The RATS Radio Traffic Collection System

SESSION 09: Speaker Recognition Evaluation

Kevin Walker and Stephanie Strassel