Speech Transcript - Robust Language Recognition Based on Diverse Features

so i agree much for all the introduction average that like to say i

i was work was a focus primarily on my two students chang change she and

gang worked on this for

a part of the lre efforts that we've been looking at

and they were supposed to both here unfortunately are processed for getting the visa to

finland i was a little bit more elaborate from the state since i wasn't able

to get an here but this represents their work so credit noted that he was

gonna

kinda pass the baton over to me say something about

i don't know how the

on the highway or something like this i was afraid that i was gonna get

a into a bad spot here so i'll start of the talk right thinking the

organisers for last night

i pulled a bunch of pictures i see

we have one long wheels that sitting out here kind of waving to everyone here

and tell me sits here is a

kind of got energy and even though these cities named after joe

expected generated go diving into the lake i and cannibal around a pretty kind of

took the gentle approaches of siding in all those daughter to cannibal and systems

right so

now that we've adjusted for the event pair for the morning i guess so the

outline for that are

at first will talk about robust language recognition some ideas that we're looking at in

this area and the focus of this talk will be a little more i'm feature

characterisation we have a number different features there were exploring

and from that we'll talk about some proposed a fusion system that we're looking at

then are evaluations are evaluations are on two different corpora the darpa rats are corpus

i which is a very noisy corpus and the nist lre which we just heard

about this is from the o nine test set that we are working with

and some performance analysis and conclusions so to begin with that the focus we want

that is one or things that when you look at language id if you could

simply say well the purpose used to distinguish one language from a set of languages

or multiple languages

but the type of task that you're looking at might be different depending on the

different context

you regret kinda node

that in the nist lre there are a number different scenarios you're looking for example

i that are doing hindi or let's say russian or ukrainian

these are languages that are close to each other and well they are unique separate

languages i there are maybe a little bit different and dialects of a particular language

the other hand you could have very distinct languages the really far apart somehow the

classifiers and features that you might use

for languages that are spaced really far apart me not necessarily be the best scenarios

you're looking at closely spaced languages

or dialects of the same language no the challenge i think that there is becoming

more and more a roland and the language shy space

is not just

the space between the languages but the space between the different characteristics that you might

see in the audio streams if you're gonna be using

it's much more likely that you use found data how to help build acoustic models

and particularly for the out of set languages in you would freak instead languages

not knowing the context in which the audio is captured of for those dataset languages

introduces a lot of a challenges

we had a paper in be interspeech two years back that was entitled i dialect

id is the secret in the silence and this was by no means an indication

of ldc it's torque efforts to collect a wide variety of language data both for

dialect the language id

we had done some studies on an arabic corpus which is a five corpus a

set for i arabic and compare that against the for our corporate available from ldc

and found that

in fact that if you throw away all the speech from the five were corpora

from the ldc set for arabic you actually did better for language id or dialect

id by just focusing on the silent sections and so what that actually tells us

is that

if you're not sure about how to data is being collected you probably doing a

channel handset or microphone id and not necessarily doing dialect id so the work we're

looking at here's actually to see if we can improve some performance and robustness side

note that

in some previous work we've done a lot of efforts and i b m s

r i b n

of late teens working on the darpa rats language id task which is very noisy

more recently our work is focused a little bit more in looking at improving open

set out of set language rejection

and their primarily because we were interested in seeing how we can come up with

more efficient ways to develop background models

for when we don't have all of the rejection language information we're trying to change

i in this study we're gonna for crystal the moron on alternate features as well

as various backend classifiers and fusion

so three different sets of features are being considered here the classical features that typically

you might expect to see in a typical speech application these are for different sets

of features we have here i innovative features are the power normalized cepstral coefficients b

and c from cmu group i and

perceptual minimum variance distortionless for someone's p mvdr these are set of features that we

had

maybe at ten years back

i one of the interspeech meetings

that we use for speech recognition and then a number of extension of features and

we refer to these primarily because there's additional processing that might be associated with these

as opposed to simply just extracting based feature set so

these include

various versions of mfcc features depending on with a window

our a cell lfs season rasta-plp type features

these are kind of the three classes of features that we've been working at

in order to kinda give your flow diagram of how the of the data is

being extracted it kind of see alright the paper we kind of summarise all the

different aspects here but

these are the various sets of features that are coming out of our system

and the next part there will look at how we actually extract these so in

the front end for processing we have speech activity detector uses a common set setup

we develop for the rats program

a standard that shifted delta cepstra features

with a seven one three seven i configuration for this

you a ubm in a state-of-the-art i-vector based system that uses for dimensions and we

use an lda based up again for dimensionality reduction on the back end processing we

do duration length normalisation and we have two different setups wanna gaussian

gender gaussian backend

also gaussian eyes that cosine distance scoring strategy for the two different classifiers

so the system flow diagram for this other words like this

we have our input audio data here the two audio a datasets that you see

here a basic represent raw data for the ubm type construction as well as for

the total variability matrix it's needed for the i-vector setup

and these two datasets are actually the same is what we use an hour training

set gaussian a gender back end it is on the side here and then the

cosine distance scoring setup is here

and then we do score fusion

score processing first and then fuse the setups

so for system fusion we have our setup looks like this we can do feature

concatenation then that's one of approaches we look at your just counting feature set up

i would back and fusion we use for call in kind of fuse the backend

a system so we see here to them up in the final a decision surface

or decision a

for the evaluation corpora i know that we had to different corpora that we're working

with of the nist lre weak classifiers is a large scale setup twenty three different

languages where only using these for the in set there and of the duration mismatch

we looked at the three sets that you would typically see

for the darpa program as i know some of you may not of be familiar

with the darpa setup but the it's five languages that are rendered darpa language id

task or arabic farsi urdu props to an already

and they're ten out of seven languages that are included in there is extremely noisy

a play just an audio clip here so you get some sense of how about

the data is

i see clearly see that that's not your typical telephone call but you might be

picking up

and so in that context the language id task is quite challenging so what are

the things we wanted to kind of see in our setup here for a lisa

darpa rats corpus where y to understand

if the channels were somehow dependent on each other if everything was kinda uniform there's

some variability across the channels so we consider seven of the channels channel d was

there is a channels in the system we set out channel id here because seven

channels here

or it is we look at the six are a language classes that of the

five

correlation in seven languages and then there's the ten out of set languages that are

set up here we scored

to a seven or eight errors only seven sorted files here crosses forty one classes

and the ideas that you kind of look at the channel confusion set up here

if there is no

you know dependency here we kind of expect there to be kind of clear diagonal

lines here the factory c d's and it i aspects here tell us to their

clearly some channel dependencies in here so where is telling us is that there's a

lot of transmission channel factors

they're kinda influencing

all the data and what we would expect the classifier setup so that was reason

i pointed to this previous study we good looking at the airbag test to ensure

we could try to do some type of normalisation and channel characteristics

so in looking at the two corpora we did kind of our evaluation here four

no the various feature set so

this has the other rats the results here and the lre on nine results here

the three different a broad classes of features the classical features innovative features an extension

of features are here and we list rich

performance here for four

for each of the different feature sets

and you can see with the gaussian eyes than the cosine distance scoring individual scores

here you look at the back end fusion strategy we get their performance improvement here

and we can see obviously that i confusion ends up helping and all these conditions

there's a very striking in terms of the performance on the clean datasets are a

little bit better than the performance on the noisy sets

see from the rats that next we wanted to kind of look at rank ordering

which features

might i actually show better improvement so here we just plot

the two classifiers and be a the backend fusion setup so this just gives your

relative comparison across other rats and the lre a nine dataset and basically by confusion

here benefits various feature concatenation strategy set and almost all combinations

we get thirty three percent relative improvement on performance for lid on the rats data

and of thirty four percent relative improvement on the lre set

so next we wanted to look at a little bit more i'm kind of

test duration aspects here so baseline system shows how test duration performance varies depending on

the on that the test sets here for the lre data

and you can kinda see as the test duration increase is obviously we get better

performance if you look at the hybrid fusion are also has a nice improvement here

we see that the relative improvement is quite substantial a hybrid fusion obviously does improve

lid performance

and the roles improvement is actually much stronger the longer duration set is but you

can see that we're almost kind of cutting the error rates here and half which

is or forty percent leased

which is quite nice in terms of the shorter three second duration sets

finally thing we want to look for is looking at the various features we want

to ask coupled basic questions in terms of how each of these features might you

contributing to improve system performance

so i one question might be how do we calibrate the contribution of each feature

in the fusion set

and use that contribution similar to the different tasks

for rats for a for the lre so the ideas that if you look at

the rank ordering hearing clean data versus the noisy data do you actually get a

different set of features that might be better for that particular task

so we use this that relative significance factor here where we use the a leave-one-out

that system ranking

for each particular feature and we normalize that by the individual systems ratings for that

particular feature so that allows us to look at the relative rank for the particular

features and this kind assures now the

the rank-order setups for the different features for rats and for lre

and what we see here is that if you're looking closely you see that sets

pasta l p i guess my students got hundred at ross l p l the

it's of released on the rats in lre the rasta plp feature actually

i gave us that the strongest contribution for improved terribly performance

and you can see various other features here rank a lower

what's interesting to note is that if you look at the relative significance factor here

for the clean data are rasta plp actually a far surpasses all the other features

i in the clean task that relative impact actually reduce is quite significantly it still

rank first

but the impact of that single feature when the data becomes x extremely noisy is

a whole lot less

well that's telling us this that in noisy tasks you actually need to leverage performance

across multiple features

in order to hope to get a similar levels of performance and the lid task

and noisy conditions

so i in conclusion here

probably if using various types of acoustic features and i can't classifiers we can contribute

to a stronger ali performance

in various are corpora

the latest propose gaussian a cosine distance scoring back end were shown to outperform the

generative gaussian backend

for the darpa rats scenario we saw that we had of thirty eight percent improvement

i'm

for that particular task and for nist lre we had some additional experiments are in

the paper that show that forty six percent relative improvement

and for the right order features the rasta-plp feature turned out to be the most

significant feature set

for the two corpora that we are considered but we found that you need to

fuse multiple features and particularly for the noisy conditions in order to hope to get

a similar levels of performance gain

a star

any questions

which

next on and i just don wallace logic presented right to left and spot and

lre results

what's

given rats so noise

so there's always a challenge in explaining why something works like so i would say

kind of looking at yellowy data

i think you have different sets of levels of noise on the rats i think

for us the rejection but you see for the rats data you got the ten

out of set languages those in some sense might be a little bit easier we

have done a test return yellow re sets and what we did as we generated

a five in set task that was used as close as possible to the five

means that we start from rats

we show the performance there was actually of a remote fairly different then we were

sitting on the lre on nine set

i wish i can give yield more insight as to why performance was colours but

how to say that using more features actually helps

expected say

did you look at the end scene channel in the rats so's i understood the

rats you trained on data in set through all the channels your testing the or

did you did the pull one out and unseen there's see i think the unseen

which the one images recently released

well you could do not held out just help wanted uni that's we need to

be gentle that we have all that actually with the channel but we did doing

to help

we do we have done tests on late in that context but not against all

these features

i can say you similar we did a fair amount of testing actually when we're

looking at ms and may g c features and a couple of other frontend enhancement

straight techniques in the last icassp for the lid task and their we did we

did hold one of the channels that just to see if we could do an

unseen channel that might help

so that you looked at a year shifted delta

cepstrum like but you can use for plp something that you have more long-term information

so you should be is used actual to shifted delta cepstra area system plp so

we use shifted delta cepstra set up on all that that's on for seven one

three seven is the configuration

the just a excellent talk solely on the at all the well the try to

the on the study on

is set up recognising the channel america the language rather than the channels so comment

on how you know in what cell findings from so which is features this simple

effective enough

also

i can answer that question and that real but let me naked one common so

when joe is giving us talk a keynote talk industry one comments i guess the

to get a chance to make all sit now

when you're doing language id or speaker id for that matter particular language id "'cause"

you're much more likely to use kind of found data for this and you may

not know the channel conditions are one of the tasks that actually a really good

thing to do and it may not be something you wanna report but it's something

that i think everyone should do

typically when you're looking at lid you would run a speech activity detector so you're

gonna have kind of your silence or low energy and noise and you're speech what

was really a good task is to run your language id task on the speech

and then run it on the silence okay all the data that you pulled out

if you run it on the silence and you find it you getting basically chance

across all your setups then you kind of note that the channels are not really

dependent on each other

what if you're getting really good performance

actually better performance than if you're using the speech and you can have no we

get your classifier is not really targeting the speech is actually targeting the channel characteristics

and that's what we found we tried actually i in a previous paper a number

of ways to kind of just

long term channel normalisation techniques something like this we were able to get the long

term channel exactly the same for those different corpora

during that we still could not get a the performance i'm a silence to draw

up to chance

a personal and no for i think it from looking at nist

i really would like to see a performance benchmark especially for lid not necessarily for

this it's i but if you look for lid if you could come up with

the performance benchmark that looked at your performance for all the speech and kind of

balance that against performance against the silence

because the ideas that you get a great performance here and you're getting just a

little improvement here than your gain is you that's all you're really leveraging actually looking

at the speech but the performance is really big thing that actually and out with

make up and affected your cheating

so that the kind of says more about the speaker

Robust Language Recognition Based on Diverse Features

Language Recognition

Qian Zhang, Gang Liu and John Hansen