speaker that's probably collection system
so
okay i'm presenting this on behalf of kevin walker you
wasn't able to ten
due to a very normal the version to sixteen hour plane right
so i'm going to see how well this is a kind of a departure from
the other talks and the session and the conference as a whole but i think
of interest of this community nonetheless so i'm going to briefly describe the rats program
and its goals
and then really delve into the data creation process for rats that'd talk a little
bit about how or generating the content that's used in the rats program the system
that we built to produce degraded audio quality recordings for the program
talk a bit about the annotation process and then focus on some details of the
speaker id evaluations just about to start
within rats
so by way of introduction wraps is a three year darpa program
that's targeting speech an extremely noisy and highly distorted channels
specifically is targeting noise not background noise
but noise in the signal sort of
radio transmissions is
and of the target kind of
therefore evaluation tasks within rats and speech activity detection language id speaker id hubert spot
there are five very challenging languages that we're targeting
and phase one of rats the training and the test data is based on material
that ldc is providing later phases will also test
on operational data although there won't be any training data from the operational environment
so
in order to produce
data that is operationally relevant ldc needed to understand a little bit about the nature
of this data so talking to the community we understood the operational data to have
a really wide range of noise character
so from the
structural properties of the data what we're thinking about is something like
radio chatter from a taxicab driver
this radio channels they're always out of the background and
you're calibrateds
are also sort of ham radio data that's a good approximation of the structural properties
of the data were targeting in terms of density of tell how long the terms
are the very short they're very rapid back and for a turn-taking there's lots of
intervening silence and they're also occasional bursts of excited speech
in terms of the types of noise of interest to the program air traffic control
transmissions is a good approximation of the type of noise that were
i'm interested in so we get things like side and steering various types of channel
interference
and also the use of push-to-talk devices which can introduce squelch
and so in our collection we also want to target data that's more or less
understandable by a human
but nonetheless
side of the range so we want data that's challenging for human to understand the
not impossible that's impossible for human
you know we can't really and pursue it beyond that
in terms of the nature the speech we wanted to be communicative and transactional and
ideally goal oriented
it may be too part here multiparty speech half duplex full duplex or even try
so like a asr stands take communication that a police department use
what we are targeting narrowband wideband and spread spectrum
and also a real variety of geographical and topographical environments that my that the radio
channel performance in the transmission quality
with lots of that
around interference as well
the speakers may be stationary where they may be in motion in the listening post
may also be emotions you can imagine a drone flying overhead
surveillance area collecting data
and also speakers may know one another
so skip over the over you jump into the types of data that we're targeting
so we made the to use of found in data so there is some data
that you can get on the web that has the sorts of noise properties retargeting
address this is mostly shortwave transmissions
in that a lot of ham radio operators
post videos on you to a of their setup and so is just a stationary
image of their setup but you get the audio track of these sure way transmissions
that they're receiving
the really interesting
we're also doing limited collection of sort of short wave transmissions at ldc
we made a fairly heavy use of existing data set
interest program primarily because many of these data sets were already richly annotated with the
features of interest
so for instance which are all of the exposed nist speaker recognition test sets is
primarily english but they have speaker id verification already and we
no more or less what the languages for these recordings similarly use the expose nist
lre test sets
also several the existing ldc corpora like callfriend that exist in various languages
and it is just partially verified for language and speaker id the fisher levantine a
corpus of telephone speech that has both language and speaker verification
and also some broadcast recordings where we know the language more or less but don't
for instance to the speaker
the bulk of the data and the ldc is producing rats program is new data
collection either locally in philadelphia work from vendors around world and this is primarily telephone
speech although we're doing some live recordings as well
are targeting two types of data general conversation simulators and also some scenario these recordings
where people are engaged in some collaborative problem solving task like playing a game of
twenty questions
or engaging in a scavenger hunt with one another
and importantly a fundamental keystone of our system is that we always would like to
have a clean recording for purposes of manual annotation
and then are ideas that this clean recording is rebranding
in order to introduce the kinds of signal degradation that the program targets
so in order to perform that i generate that signal degradation we developed a multi
communication channel collection platform we wanna this platform to be capable of transmitting speech over
radio communication links where the transmission itself introduces the type of noise conditions in signal
quality variation of are interested in program
the platform that we developed is capable of simultaneous transmission of up to eight different
radio channels for each channels targeting a different height degree of voice
and again preserving the clean input channel to facilitate the manual annotation process
now there's a wrinkle here which is that it and this need to doing annotation
on the clean channel this requires
very careful process to a line
and to project annotations from the clean channel onto the age and degraded channels and
that's a very challenging problem
some other design principles
we wanted the system to be able to be used for either live sessions were
retransmissions
we want a wide range of channel types with different modulations bandwidths
different types of interference
we also wherever possible one and the actual components of the system to have some
operational relevance we just some research into the kinds of
sets
and you know push-to-talk devices and that sort of thing that might be actually used
in operational environment
the radio channels themselves were configured well first we selected a transceivers
who's the R P ranged from point five to twelve lots
but the transceivers and receivers are equipped with multiple omnidirectional low gain antennas
and the transceivers we selected are designed for half duplex analog communication also because this
is what we found was primarily used
in the real world data
and importantly they operate on a shared channel model so they can either be in
transmit motor receive mode but they can
be in both simultaneously
so this is some of the radio channels and that we developed and really that
this table is just to give you a feel for
the range of transmitters and receivers in a particular the bandwidth variation in the different
types of modulation that we were targeting not gonna have time to go into these
into too much detail
okay so the image here is fairly complex and this is the case are transmit
station
so i was one through the protocol for transmission briefly so we start with a
wrong transmit control computer
here
the there's a demon running on the transmit station control computer that's querying the database
for recordings that are available for retransmission
what it finds recording the control computer initiates a remote recording on the receive station
control computer
and it also initiates a local reference recording
that we have just as a baseline
it also sponsor a subprocess to drive a computer-controlled push-to-talk relay bank
and that is controlled based on a signal relay output so that's this portion of
the device
when the systems in transmit mode begins playing the output over the source recording output
over the specified audio devices
and the depiction of the
i audio devices this down here
the single relay is configured for of
fast attack
one sustain gradual release
and there's a very wide
rather utterances and this is just sort of maximise the amount of speech begets transmitted
through the system we also introduced a single power supply and power distribution i'm to
avoid having the battery problem with the various handsets that part of the transmission system
oh we also introduced in isolation transformer bank
which is here essentially to isolate the system from upstream electronic equipment
and the next slide shows you sort of a similar diagram for the received station
and this is mostly just to indicate the variety of receivers that we have
form
so after recordings are generated
essentially they're uploaded to our server and then we initiate this really like be post
processing sequence
to align the files and also detect any regions of non transmission a compact and
second
so that if you feel for what the resulting recordings sound like on a plane
resamples from each of the channels
so first we start with channels can be is evaluated have channels
oh and the reference recording first
i
i
i
there's channel i
yeah
okay so channel these are single sideband channel this one is one of the more
challenging and channels for the rest a cyst
i
i
i
the distortion channel B and then channel H is a narrowband that
channel is another
channels
and then are
channel
channel
i
i
okay a channel F is or frequency hopping spread
i
i
channel i
wideband
right system real challenges here these are actually recordings that were transmitted in their entirety
these are
like white intelligible but they take some getting used to there are much more difficult
recordings in a in a set of data
so after
the clean signals transmitted we have nine resulting audio files
clean channel the integrated channels we have a right
slide that indicates the retransmission start time
and all the sort source file parameter
we also have what we call a slot which is essentially timestamps on the push
to talk button on and off of that some for each of the individual channels
and then we have the reference
in addition but on the clean channel only and now we need to create annotation
on each of the degraded channels
projected from the clean channel as well as very accurate cross channel alignments
ideally we'd also like to be able to flag any segments that are impossible for
humans understand
it's not really fair
to evaluate system performance on
segments that human can even understand
so a perfect world is easy right so we start with a
source recording
yeah it's we've got perfect alignment on me degraded channel recordings
and see the regions are not transmission very cleanly
but that's not really the way things work
in the real world we have any number of challenges on the retransmission so we
have things like channel specific lab
there is a bit like
some of the channels
so there's still a in the segment correspondences
and it's not
the late at the same
all set up for each channel and so we have to do some channel specific
manipulation to account for that lack
we also have things like
to read in the non transmission regions
so these are all regions where the transmitter was then engaged but you can see
that for channel and a the duration is shorter
then for some of the other channels
is we have to account for that
we also have the occasional failure on a particular channel four sessions of here cases
where in
channels just were engaged during the transmission
and we have the most pernicious problem which are these channel specific dropouts
where everything's marching one just on for some reason a just conked out
segment
and so we have to have ways to detect these all of these issues
this is not a real challenge and managing the corpus
what we've done is collaborate with the rats performers to develop a number of techniques
to help better manage the data so dan ellis the columbia just develop on two
algorithms skewview sex Q that identify what the initial offsets for each channel should be
brain the cross-channel alignment
interesting
ldc also developed our own internal processed using a retina scanners
i'm to identify long time transmission regions on the channels
and this
this is sort of two and channel four channel
the rmse scans only allows to detect longer transmission regions about two seconds or greater
and we'd really like to be able to also detect
dropouts that are very short the sound quite a bit and so the grass community
is working on a robust
a channel specific energy detector no transmission region detector
they can detect be shorter dropouts
quickly moving on to the annotation tasks that are right channels better annotation sre better
lyman across the channels now we annotate
so there are five or annotation task
for speech activity were reading an audio segment around on the clean channel for lid
we're simply listening to the speech segments in judging them is in or out of
the target language for keyword for creating a time line
transcript for the speech segments
and then for the speaker id task we're listening to portions of all internal in
channel recordings associated with one speaker id in verifying that it's indeed the same person
we're also on a portion of the data the test data in particular
doing intelligibility but it so this is where we're having are annotators native speaker annotators
listened to the degraded recording segments
the speech segments and saying whether they're actually intelligible or not and this turns out
to be a very heart task for humans to do an agreement among humans on
intelligibility is extremely or
we also do most of education system outputs identified any real problems in the annotation
data
annotation release format is really simple we've got the final metadata and then for each
of the annotations what the annotation is
and then importantly what its provenance is because reusing some existing data and sort of
borrowing annotations from previously developed corpora we indicate whether the annotation is newly created whether
it's a legacy annotation or whether it's an automatic annotation for instance from a speech
activity detection system
so now we've got our annotations on the clean channel we've gotta alignments across the
degraded channels now we need to project the annotations onto those degraded channels we start
out with the green is
speech yellow as non-speech
we project that each of the degraded channels that have already been aligned
we identify did not i'm transmission regions is indicated by a push to talk about
slugs
we adjust for the rest the lack that happens
pacific channels
we run or rms can send find the files that failed transmissions entirely in exclude
those from a corpus
and then finally we run R and G detectors on a transmission detectors and find
any segments where
or
but more push to talk button lots a there was a transmission but actually there's
no signal
and so we select those and now we have annotations for each of the degraded
channels as well
so as a result each file for each segment we have one of five values
we have S for speech
there was a transmission of speech and S is there was a transmission non-speech T
is there is a transmission but has been labeled as to whether it contains speech
or not
and she is there was no transmission and then this R X
setting which is
we detected a transmission failure automatically
okay now quickly moving to the syndicate particular this evaluation
is just getting underway the dry run evaluation is actually happening next week
for sid we're defining a progress at which is two hundred fifty speakers
with ten sessions for each speaker nominally this is fifty speakers per language although it
won't actually play out that way six of the sessions per speaker going to be
sequestered by the evaluation team which is S A I C doesn't be used for
enrollment
the other four sessions per speaker are used for test
there's a dev test set that has the same characteristics as the progress
set
and then there's this additional generally used dataset which is two hundred fifty speakers that
have these two sessions each
and the performers can do whatever they like with this generally is that
see within rats is being evaluated is an open test
paradigm systems need to provide independent decisions about each of the target speakers
from the candidate ten percent candidate speakers without any
knowledge
of the impostors are in the test data
all speakers in the test will be enrolled in the test some samples will be
used as impostors
for the other trials
and the performers need to have agreed to avoid using the enrollment samples for any
purpose other that the target speaker enrollment so they can be used for training
in the trials involving that speaker
where also we also distribute the nist sre data sort of background on modeling data
for the performance of that data has been pushed to the retransmission system
so far we delivered something like fifteen hundred
single speaker call these are people who started out with the goal of making calls
and dropped out the collection so most people drop out
and ninety percent of the people drop election after all we don't a hundred and
thirty seven speakers that have to the nine whole speech
hundred eighty three speakers that have ten calls each and our goal is again two
hundred fifty speakers with at least two thousand other two hundred fifty that have that
the slide
just summarizes the total amount of data to be processed through the rest system to
date so we use this is about a month out of date so i think
we can add five hundred to the bottom line here
so we transmitted over three thousand hours probably closer to thirty five hundred hours now
a source data yielding about sixteen thousand hours or more of degraded audio channels in
this includes
four hundred hours of data labeled for so
i seven hundred twenty labeled language id and about four hundred hours of keyword spotting
transcripts
i'll come to the conclusion since i'm running out of time so in summary over
the past
you have i guess lpc is designed in the late this multi radio channel collection
platform we've undertaken a very large scale data collection including retransmission an annotation
of five very challenging languages
we retrieve retransmitted over three thousand hours of data yielding more than six thousand hours
of degraded signal
see that over fifteen thousand hours of clean signal data and generated corresponding degraded channel
annotations
we've developed independently and also with lots of the input from the rats performers several
algorithms to improve the overall quality of the transmitted data
we supported lots of new request for a new kinds of annotation collection
this is dry run evaluation is starting next week and people are very nervous and
this is really our data
i'm very eager to see what else
and thank you
i
oh
oh
oh
we would like to
the receivers the listening post in a moving vehicle looking at that time assessment or
something but we don't have the funding to support that model so the transmitters and
receivers are at ldc there about thirty meters apart but there are significant structural barriers
in between the transmit and the receive station
so there's
like the core of the building is between the transmitted receive station that's the best
we could do with the resources available we are pursuing for base to address a
novel channel selection that may involve please see the listening post
and a more remote location
or even doing some of extra collection listening post motion
i