so i agree much for all the introduction average that like to say i
i was work was a focus primarily on my two students chang change she and
gang worked on this for
a part of the lre efforts that we've been looking at
and they were supposed to both here unfortunately are processed for getting the visa to
finland i was a little bit more elaborate from the state since i wasn't able
to get an here but this represents their work so credit noted that he was
gonna
kinda pass the baton over to me say something about
i don't know how the
on the highway or something like this i was afraid that i was gonna get
a into a bad spot here so i'll start of the talk right thinking the
organisers for last night
i pulled a bunch of pictures i see
we have one long wheels that sitting out here kind of waving to everyone here
and tell me sits here is a
kind of got energy and even though these cities named after joe
expected generated go diving into the lake i and cannibal around a pretty kind of
took the gentle approaches of siding in all those daughter to cannibal and systems
right so
now that we've adjusted for the event pair for the morning i guess so the
outline for that are
at first will talk about robust language recognition some ideas that we're looking at in
this area and the focus of this talk will be a little more i'm feature
characterisation we have a number different features there were exploring
and from that we'll talk about some proposed a fusion system that we're looking at
then are evaluations are evaluations are on two different corpora the darpa rats are corpus
i which is a very noisy corpus and the nist lre which we just heard
about this is from the o nine test set that we are working with
and some performance analysis and conclusions so to begin with that the focus we want
that is one or things that when you look at language id if you could
simply say well the purpose used to distinguish one language from a set of languages
or multiple languages
but the type of task that you're looking at might be different depending on the
different context
you regret kinda node
that in the nist lre there are a number different scenarios you're looking for example
i that are doing hindi or let's say russian or ukrainian
these are languages that are close to each other and well they are unique separate
languages i there are maybe a little bit different and dialects of a particular language
the other hand you could have very distinct languages the really far apart somehow the
classifiers and features that you might use
for languages that are spaced really far apart me not necessarily be the best scenarios
you're looking at closely spaced languages
or dialects of the same language no the challenge i think that there is becoming
more and more a roland and the language shy space
is not just
the space between the languages but the space between the different characteristics that you might
see in the audio streams if you're gonna be using
it's much more likely that you use found data how to help build acoustic models
and particularly for the out of set languages in you would freak instead languages
not knowing the context in which the audio is captured of for those dataset languages
introduces a lot of a challenges
we had a paper in be interspeech two years back that was entitled i dialect
id is the secret in the silence and this was by no means an indication
of ldc it's torque efforts to collect a wide variety of language data both for
dialect the language id
we had done some studies on an arabic corpus which is a five corpus a
set for i arabic and compare that against the for our corporate available from ldc
and found that
in fact that if you throw away all the speech from the five were corpora
from the ldc set for arabic you actually did better for language id or dialect
id by just focusing on the silent sections and so what that actually tells us
is that
if you're not sure about how to data is being collected you probably doing a
channel handset or microphone id and not necessarily doing dialect id so the work we're
looking at here's actually to see if we can improve some performance and robustness side
note that
in some previous work we've done a lot of efforts and i b m s
r i b n
of late teens working on the darpa rats language id task which is very noisy
more recently our work is focused a little bit more in looking at improving open
set out of set language rejection
and their primarily because we were interested in seeing how we can come up with
more efficient ways to develop background models
for when we don't have all of the rejection language information we're trying to change
i in this study we're gonna for crystal the moron on alternate features as well
as various backend classifiers and fusion
so three different sets of features are being considered here the classical features that typically
you might expect to see in a typical speech application these are for different sets
of features we have here i innovative features are the power normalized cepstral coefficients b
and c from cmu group i and
perceptual minimum variance distortionless for someone's p mvdr these are set of features that we
had
maybe at ten years back
i one of the interspeech meetings
that we use for speech recognition and then a number of extension of features and
we refer to these primarily because there's additional processing that might be associated with these
as opposed to simply just extracting based feature set so
these include
various versions of mfcc features depending on with a window
our a cell lfs season rasta-plp type features
these are kind of the three classes of features that we've been working at
in order to kinda give your flow diagram of how the of the data is
being extracted it kind of see alright the paper we kind of summarise all the
different aspects here but
these are the various sets of features that are coming out of our system
and the next part there will look at how we actually extract these so in
the front end for processing we have speech activity detector uses a common set setup
we develop for the rats program
a standard that shifted delta cepstra features
with a seven one three seven i configuration for this
you a ubm in a state-of-the-art i-vector based system that uses for dimensions and we
use an lda based up again for dimensionality reduction on the back end processing we
do duration length normalisation and we have two different setups wanna gaussian
gender gaussian backend
also gaussian eyes that cosine distance scoring strategy for the two different classifiers
so the system flow diagram for this other words like this
we have our input audio data here the two audio a datasets that you see
here a basic represent raw data for the ubm type construction as well as for
the total variability matrix it's needed for the i-vector setup
and these two datasets are actually the same is what we use an hour training
set gaussian a gender back end it is on the side here and then the
cosine distance scoring setup is here
and then we do score fusion
score processing first and then fuse the setups
so for system fusion we have our setup looks like this we can do feature
concatenation then that's one of approaches we look at your just counting feature set up
i would back and fusion we use for call in kind of fuse the backend
a system so we see here to them up in the final a decision surface
or decision a
for the evaluation corpora i know that we had to different corpora that we're working
with of the nist lre weak classifiers is a large scale setup twenty three different
languages where only using these for the in set there and of the duration mismatch
we looked at the three sets that you would typically see
for the darpa program as i know some of you may not of be familiar
with the darpa setup but the it's five languages that are rendered darpa language id
task or arabic farsi urdu props to an already
and they're ten out of seven languages that are included in there is extremely noisy
a play just an audio clip here so you get some sense of how about
the data is
i see clearly see that that's not your typical telephone call but you might be
picking up
and so in that context the language id task is quite challenging so what are
the things we wanted to kind of see in our setup here for a lisa
darpa rats corpus where y to understand
if the channels were somehow dependent on each other if everything was kinda uniform there's
some variability across the channels so we consider seven of the channels channel d was
there is a channels in the system we set out channel id here because seven
channels here
or it is we look at the six are a language classes that of the
five
correlation in seven languages and then there's the ten out of set languages that are
set up here we scored
to a seven or eight errors only seven sorted files here crosses forty one classes
and the ideas that you kind of look at the channel confusion set up here
if there is no
you know dependency here we kind of expect there to be kind of clear diagonal
lines here the factory c d's and it i aspects here tell us to their
clearly some channel dependencies in here so where is telling us is that there's a
lot of transmission channel factors
they're kinda influencing
all the data and what we would expect the classifier setup so that was reason
i pointed to this previous study we good looking at the airbag test to ensure
we could try to do some type of normalisation and channel characteristics
so in looking at the two corpora we did kind of our evaluation here four
no the various feature set so
this has the other rats the results here and the lre on nine results here
the three different a broad classes of features the classical features innovative features an extension
of features are here and we list rich
performance here for four
for each of the different feature sets
and you can see with the gaussian eyes than the cosine distance scoring individual scores
here you look at the back end fusion strategy we get their performance improvement here
and we can see obviously that i confusion ends up helping and all these conditions
there's a very striking in terms of the performance on the clean datasets are a
little bit better than the performance on the noisy sets
see from the rats that next we wanted to kind of look at rank ordering
which features
might i actually show better improvement so here we just plot
the two classifiers and be a the backend fusion setup so this just gives your
relative comparison across other rats and the lre a nine dataset and basically by confusion
here benefits various feature concatenation strategy set and almost all combinations
we get thirty three percent relative improvement on performance for lid on the rats data
and of thirty four percent relative improvement on the lre set
so next we wanted to look at a little bit more i'm kind of
test duration aspects here so baseline system shows how test duration performance varies depending on
the on that the test sets here for the lre data
and you can kinda see as the test duration increase is obviously we get better
performance if you look at the hybrid fusion are also has a nice improvement here
we see that the relative improvement is quite substantial a hybrid fusion obviously does improve
lid performance
and the roles improvement is actually much stronger the longer duration set is but you
can see that we're almost kind of cutting the error rates here and half which
is or forty percent leased
which is quite nice in terms of the shorter three second duration sets
finally thing we want to look for is looking at the various features we want
to ask coupled basic questions in terms of how each of these features might you
contributing to improve system performance
so i one question might be how do we calibrate the contribution of each feature
in the fusion set
and use that contribution similar to the different tasks
for rats for a for the lre so the ideas that if you look at
the rank ordering hearing clean data versus the noisy data do you actually get a
different set of features that might be better for that particular task
so we use this that relative significance factor here where we use the a leave-one-out
that system ranking
for each particular feature and we normalize that by the individual systems ratings for that
particular feature so that allows us to look at the relative rank for the particular
features and this kind assures now the
the rank-order setups for the different features for rats and for lre
and what we see here is that if you're looking closely you see that sets
pasta l p i guess my students got hundred at ross l p l the
it's of released on the rats in lre the rasta plp feature actually
i gave us that the strongest contribution for improved terribly performance
and you can see various other features here rank a lower
what's interesting to note is that if you look at the relative significance factor here
for the clean data are rasta plp actually a far surpasses all the other features
i in the clean task that relative impact actually reduce is quite significantly it still
rank first
but the impact of that single feature when the data becomes x extremely noisy is
a whole lot less
well that's telling us this that in noisy tasks you actually need to leverage performance
across multiple features
in order to hope to get a similar levels of performance and the lid task
and noisy conditions
so i in conclusion here
probably if using various types of acoustic features and i can't classifiers we can contribute
to a stronger ali performance
in various are corpora
the latest propose gaussian a cosine distance scoring back end were shown to outperform the
generative gaussian backend
for the darpa rats scenario we saw that we had of thirty eight percent improvement
i'm
for that particular task and for nist lre we had some additional experiments are in
the paper that show that forty six percent relative improvement
and for the right order features the rasta-plp feature turned out to be the most
significant feature set
for the two corpora that we are considered but we found that you need to
fuse multiple features and particularly for the noisy conditions in order to hope to get
a similar levels of performance gain
a star
any questions
which
next on and i just don wallace logic presented right to left and spot and
lre results
what's
given rats so noise
so there's always a challenge in explaining why something works like so i would say
kind of looking at yellowy data
i think you have different sets of levels of noise on the rats i think
for us the rejection but you see for the rats data you got the ten
out of set languages those in some sense might be a little bit easier we
have done a test return yellow re sets and what we did as we generated
a five in set task that was used as close as possible to the five
means that we start from rats
we show the performance there was actually of a remote fairly different then we were
sitting on the lre on nine set
i wish i can give yield more insight as to why performance was colours but
how to say that using more features actually helps
expected say
did you look at the end scene channel in the rats so's i understood the
rats you trained on data in set through all the channels your testing the or
did you did the pull one out and unseen there's see i think the unseen
which the one images recently released
well you could do not held out just help wanted uni that's we need to
be gentle that we have all that actually with the channel but we did doing
to help
we do we have done tests on late in that context but not against all
these features
i can say you similar we did a fair amount of testing actually when we're
looking at ms and may g c features and a couple of other frontend enhancement
straight techniques in the last icassp for the lid task and their we did we
did hold one of the channels that just to see if we could do an
unseen channel that might help
so that you looked at a year shifted delta
cepstrum like but you can use for plp something that you have more long-term information
so you should be is used actual to shifted delta cepstra area system plp so
we use shifted delta cepstra set up on all that that's on for seven one
three seven is the configuration
the just a excellent talk solely on the at all the well the try to
the on the study on
is set up recognising the channel america the language rather than the channels so comment
on how you know in what cell findings from so which is features this simple
effective enough
also
i can answer that question and that real but let me naked one common so
when joe is giving us talk a keynote talk industry one comments i guess the
to get a chance to make all sit now
when you're doing language id or speaker id for that matter particular language id "'cause"
you're much more likely to use kind of found data for this and you may
not know the channel conditions are one of the tasks that actually a really good
thing to do and it may not be something you wanna report but it's something
that i think everyone should do
typically when you're looking at lid you would run a speech activity detector so you're
gonna have kind of your silence or low energy and noise and you're speech what
was really a good task is to run your language id task on the speech
and then run it on the silence okay all the data that you pulled out
if you run it on the silence and you find it you getting basically chance
across all your setups then you kind of note that the channels are not really
dependent on each other
what if you're getting really good performance
actually better performance than if you're using the speech and you can have no we
get your classifier is not really targeting the speech is actually targeting the channel characteristics
and that's what we found we tried actually i in a previous paper a number
of ways to kind of just
long term channel normalisation techniques something like this we were able to get the long
term channel exactly the same for those different corpora
during that we still could not get a the performance i'm a silence to draw
up to chance
a personal and no for i think it from looking at nist
i really would like to see a performance benchmark especially for lid not necessarily for
this it's i but if you look for lid if you could come up with
the performance benchmark that looked at your performance for all the speech and kind of
balance that against performance against the silence
because the ideas that you get a great performance here and you're getting just a
little improvement here than your gain is you that's all you're really leveraging actually looking
at the speech but the performance is really big thing that actually and out with
make up and affected your cheating
so that the kind of says more about the speaker