Thank you Isabel very much our feelings are shared. I'd like to thank the organisation

here for inviting me

to be a keynote speaker. It's really an honor. It's also a big challenge, so

I hope I will

make some messages that come across. Well to ... at least some of you.

As Isabel said I've been working in speech and speech processing for many years now

and today I'll

focus mostly on the act of speech recognition. But a little bit of context that

I'll talk about first.

So.

At LIMSI - being in Europe of course - we try to work on speech

recognition and the multilingual

context and processing at least a fair number of the European languages

This isn't really new. We are seeing sort of a regrowth in wanting to do

speech recognition in

different languages but, if you go back a long time ago there was some

research that was there. We just didn't hear about it this much.

And that's probably because there weren't too many common corpora and benchmark tests to be

able to compare results on and so people tended to report on your papers were

accepted more easily if

you use common data. Which is still the case.

And it's logical you want to compare results to other peoples'

results, but now there's more and more data out there, there's more test sets out

there and so we

can do more comparisons and we're seeing more languages covered. So I think that's really

nice.

So I'll speak about some of our research results in Quaero and Babel programs.

Sure.

Is that better? Sorry, o.k. I was popping this morning a bit so I wanted

to be not too close.

So, I'll speak highlights and research results from Quaero and Babel.

And then I want to touch upon some activities

I did with some colleagues at LIMSI and at the Laboratory of Phonetics in Paris

to trying new speech technologies for other

applications to carry out linguistic studies and corpora based studies

And I'll mentioned briefly a couple of a perceptual experiments that we've done

and then finally some concluding results, remarks.

So I guess we probably all agree in this community that we've seen a lot

of progress over the last

decade or two decades.

We're actually seeing some

technologies that are using

speech or are working on it. I think that's kind of fun, it's really nice.

But we see it for a few languages

and as we heard yesterday from Haizhou. He mentioned that about 1 % of the

world's languages are

actually seen in our proceedings, so we have something about them.

That's pretty low, but it's

up there from one or two that we did maybe twenty years ago. We're soaking

up the sun.

One of the problems I mentioned before is that our

technology typically relies on having a lot of language resources and these are difficult and

expensive to get. And so therefore it's harder to get them for limited languages

and current practice still

takes several months to years to bring up systems for language and if you count

the time for collecting

data and processing it and transcribing it and things like that.

So this is sort of a step back in time and we'll see if we

go back to say the late eighties,

early nineties. We had systems that were basicly command controlled dictation and we had some

dialogue systems, usually

pretty limited to a specific task like ATIS (Air Traffic Information System),

travel reservation, things like that.

Some of you here probably know well and some of you are maybe too young

and don't even know it, because the

publications are not necessarily scanned in online, so you don't see them.

But in fact we're now seeing a regrowth in the same activities. When you look

at voice mail and

dictation machines in your phones and the different

the personal assistants that we're seeing that are finally coming out now.

And so that's really exciting to see this sort of a pick up again that

we saw in the past.

And then of course we have some new applications or for someone new applications

that are growing as we've got better processing capability both in terms of computers and

data out there. And so we have speech analytics of both call center data or

meeting type data,

lectures, there's a bunch of things.

We have speech-to-speech translation and also indexation of audio, video,

documents which is used for media mining,

tasks and companies are very interested in that.

And of course for the speech analytics people that are really interested in finding out

what people

wanna buy and trying to sell them things and stuff like that.

So let me back up a little bit and talk about something.

Why is speech

processing difficult. All of us speak easily and

sort of mentioned this is sort of natural we learn language but

I think any of us who learned a foreign language at least as an adult

understand that's a little bit

harder than it really seems and so I learned French when I was ...

... after my PhD. I won't say my age.

And it wasn't so easy to learn, it wasn't so natural and my daughter who

grew up in lingual environment speaks

French and English fluently and she's better in the other languages than I am.

So I was good unilingual American speaker.

No other contact with the other languages.

We need context to understand what's being said, so you speak differently if you're speaking

in public then

if you're speaking to someone who you know. We all know this.

?? projector screen with speech is continuous and so if I'm

talking to you I might say it is not easy or it's not easy

but if I'm talking to my mother it's not easy.

I reduced that it's not easy. Well that's not so clear where the words are

and so I think that we all know once again that humans

reduce the pronunciation in regions that are of low information. Where it's not very important

you're

putting the minimum effort into saying what you want to say.

And of course there's other variability factors that

and I'll also mention that the speaker's characteristic accent,

the context we are in. Humans do very well in adapting to this.

Machines don't, in general.

So here I wanna, since I am taking a step back

in time I wanted to play a couple of very simple samples.

Is there anyone in this room that doesn't know this type of sentence?

You all heard it.

Good! Okay. That's timid, that's going back really, really long time ago.

I was involved in

some of the selection for timid but not in the non-sense's.

Those were selected to elicit pronunciation variants for

different styles of speaking

and you can hear that in the sample here

even in a very, very simple read text

we have different realisations of the word greasy.

So in the first case we have an S, in the second case we have

a Z.

And we can see that ...

I can't really point to the screen there. I don't have a pointer. Do you

have a pointer?

I think I've refused it before.

So in any case you can see in blue there that S is quite a

bit longer, you can see the voicing and the Z.

And is everyone here familiar with spectrograms? I sort of assumed the worst. It's okay,

good.

So here's another example of more conversational type speech.

And we'll see that people interrupt each other. You can hear some of the hesitations.

So in this example it's office corpus and participants called each other

and they're supposed to talk about some topic that they were given and they have

a mutual topic.

You're supposed to talk about it but not everybody does.

But even in this ...

They don't know each other very well but they still interrupted each other. They did

some turn taking

and you hear you can see the

hmm and laughter and there's someone else in another

presentation. I don't where it is.

Now I'm gonna play an example from Mandarin and I'm having

confidence that Billy Hartman who is probably in ?? with his wife

gave me the correct translation,

correct text here, because I don't understand it. Which is even more spontaneous so it's

an example

taken from the callhome mandarin corpus where we think it's a mother and a daughter

who are talking to

each other about the daughters job interview.

So for those who speak Mandarin.

So if I understood correctly and the translation is correct

basically talking about it and the mother doesn't understand what the job interview is about

and the daughter says

she says: Don't speak to me in another

language. Speak to me in Chinese. And the daughter says: You wouldn't have understand anyway,

even if I spoke in Chinese

And so I've had some similar situations speaking with my mother

That's what I do.

So now I'm gonna switch gears a little bit and talk about the

Quaero program which is one of the two topics I want to mainly focus on

here, we talk about the speech

recognition in different languages.

This is a large project

in France, it's research innovation project

which was funded by OUZIO, a French innovation agency.

It was initiated in 2004

but didn't start until 2008 and then ran for almost six years until the end

of 2013

so it's relatively recent that it finished. It was really fun

but when we started putting it together

the web was a lot of than it is now. So as we also heard,

I think it was this morning, there was no YouTube,

no Facebook, no Twitter, no Google Books, iPhones. All that didn't exist.

So life was boring, what do we do with our free time, right?

Instead of spending your time on the ??.

I think it's hard to be in the position of young people who don't know

life without all of this

and my daughter grew up with all of it.

And so it's very hard to relate to what this situation really is

but in any case

to get back to sort of processing of this data

we have tons and tons of data. I read that there's roughly 100 hours of

video uploaded to YouTube every minute.

And that's a huge amount of data and 61 languages. So if we are treating

about 7 of them,

we are not so bad. Maybe we cover the languages doing the videos there.

But we don't have to organise this data, we don't not know how to accesses

this data.

And so Quaero was trying to aim at this. How can we organize the data,

how can we access it, how can we index it

how can we build applications that can make use of today's technology and do something

interesting with it. I'm not gonna talk about all that. If you're interested I suggest

you to go to Quaero website and

you can find some demos and links and things like that

there. I'm gonna focus on the work that we did in speech processing

and at LIMSI we spoke about, we worked on mostly speech and text processing

including this applied to speech data, so named entity searchable text and speech,

work translation,

both of text and speech.

So here is

showing the speech processing technologies that we

worked on in the project. So the first box we have is audio speaker segmentation

such as chopping

signal and trying to decide speech and nonspeech regions

and

dividing into segments corresponding to different speakers

detecting speaker changes

then we may or may not know

the identity of the language being spoken

so we have a box of language identification if we don't know it.

Most of the time we want to transcribe the speech data because

speech is ubiquitous

there's speech all over the place and it has a high amount of information content

and so we believe it is

that the most useful. We work in speech and not in an image. Image people

might tell us that

image is more useful for indexing this type of data.

One advantage we have speech relative to image that I've just mentioned

is that the speech has the underlying writen representation that we've all pretty much agree

upon

more or less. Able decide where the word is.

We might differ a little bit but we pretty much agree upon it. The image

is not the case

if you give an image to two different people

someone will tell you it's a blue house, someone will tell you it's trees in

the park

with a little blue cabin in it. Something like that. You get a

very different description based on what people are interested in

and their matter expressing things. For speech in general

we're a little bit more normalized we

pretty much agree on what would be there.

Then you might wanna do

other type of processing such as speaker diarization.

This morning doctor ?? spoke about the Twitter statistics during the presidential elections and

that was something we actually worked on in Quaero.

Which was to try and look at a corpus of recordings

and look at speaker times within this corpus of recordings, you might have hundreds or

thousands of hours of recordings and look at how many

speakers are speaking when and how much time is allocated

and that's actually something that's got a potential to use at least in France

where they control that during the election period all the parties get the same amount

speaking time.

As you want very accurate measures of who is speaking when

so that everybody has a fair game

during the elections.

Other things that we worked on were adding the metadata to the

transcriptions, you might add punctuation or markers to

make it be more readable, you might want to transform numbers from

words into a number sequences like in newspaper text in 1997.

And you might want to identify

entities or speakers or topics that can be useful for automatic processing so you could

tags in

where same identities are.

And then finally the other box there is speech

translation typically based on the speech transcription

but we're also trying to work on having a tighter link between the speech and

the translation

portions, so you don't just transcribe and then

translate.

But we're trying to have more tight

relation between the two.

Let me talk a little bit now about speech recognition.

Everybody I think know there's a box, so basically we have

the main point is just that we have three important models. The language model,

the pronunciation model and the acoustic model.

And these are all typically estimated on very large corporas

This is where we're getting into problems with the low resource languages.

And I want to give a couple the illustrations on

why at least I believe and I spent effort doing this on pronunciation model

is really important that we have the right pronunciation in the dictionary.

So we take these two examples, on the left we have two versions of coupon.

And on the right we have

two versions of interest.

So in the case of the coupon in one case we have the ya sound

inserted there

and our models for speech recognition are typically

modeling phonetics in it's context.

And so we can see that if we have a transcription just k. u. p.

for it.

We're gonna have this ya there and the case is not gonna be

very good match to the one that we have the second case.

That's really very big difference and also the U becomes almost frented EU.

That is not

distinguishable technically in English.

And the same thing for interest. We have interest or interest.

Well in one case you N and the other case you have the TR cluster.

These are very different and you can imagine that if we ...

since our

acoustic models are based on alignment of these transcriptions with the audio signal

if we have more accurate pronunciations we're going to

have better acoustic models at the end and that's what our goal is.

So now I want to speak a little bit about

culture lightly supervised learning, there's many terms

being used for it now.

Unsupervised training, semi-supervised training, lightly supervised training

and so

basicly one goal is that .. and Ann mentioned something

about this yesterday, maybe machines can just learn on their own.

So here we have a machine

he's reading the newspaper, he's looking at the TV and he's learning.

Okay that's great.

That's something we would like to happen.

But we still believe that we need to put some guidance there and so this

is

researcher here trying to

give some informations and supervision the machine who's learning.

When we look at traditional acoustic modeling we typically use between several hundreds to

several thousands of hours of carefully annotated data and once again I said before that

this is expensive

and so people trying to look into

ways to reduce this information, reducing the amount of supervision for the training process.

And so

I believe that some people in this room are doing it

to automate the process of collecting the data. To automate the iterative learning of the

systems by themselves

even including the evaluation so having some

data used to evaluate on that is not necessarily really carefully annotated and most the

time it is

but there's been some work trying to use

unannotated data to improve the system

which I think is really exciting.

So we talk about reduced supervision and

unsupervised training that has a lot of different names that are used.

The basic idea is to use some existing speech recognizer,

you transcribe some data, you assume that this transcription is true.

Then you build new models estimating with this transcription and you reiterate and there's been

a

lot of work on it for about fifteen years now

and many different variance that have been explored, where to filter the data, where to

use confidence factors, do you

train on things that are only good, do you take things in the

middle range, this many things you can read about it.

Something that's pretty exciting that we see in the Babel work

is that even now if we apply these two systems starting

with very high word error rate, it still seems to be converging

and that's really nice.

The first things I'll talk about are going to be in case for a broadcast

news data but we have a lot of

data we can use as a supervision. And by this I mean we're using language

models that are trained in many

millions of words of text and this is giving some information to the systems. It's

not completely

unsupervised which is why you these different names for

what's being done by different researchers calling it.

It's all about the same but it's called by different names

and so here I wanted to illustrate

this was the case study for the Hungarian that we did in

the Quaero program and it was presented at last year's Interspeech, so maybe some of

you saw it, by A. Roy.

And we started off with

having

seed models at this point appear of about eighty percent or seed models that come

from other

languages or five languages we took them from, so we did what most people would

call cross language transfer.

These models came from if I

have it correctly: English, French, Russian, Italian and German

and we tried to just choose the best match between

the phone set of Hungarian to one of these languages.

And then we

use this model here to transcribe about 40 hours of data which is this point

here

and this size of the circle

is showing you

roughly how much data is used

that's this is forty hours then we double again here

and go about eighty hours and this is the word error rate

and this is the iteration number

and so we often use increased amount of data, increase the model

size, we have more parameters and models is going on

so the second point here is using

the same amount of data, but using more context

so we built a bigger model, so we once again took this model, we redecoded

all the data the forty hours

and built another model and so now we went down to about sixty percent, so

we still kind of flying

we doubled the data again and we're probably about a hundred fifty

hundred fifty something like that. Then we got down to about

fifty percent. These are all using same language model

so that wasn't changed in study

and then finally here we use about three hundred hours of

training data we're down to about thirty or thirty five percent

and of course everybody knows that

these were done with just standard PLP F0 features

and

pretty much everybody's using features generated by the MLP

so we took

our english MLP,

we generated features on the Hungarian data since across

lingual transfer

models there

and we see there begins small gain a little bit

once our amount of data is fixed

and then we took the transcripts that were generated by this system

here

and

we built an MLP

training MLP for the Hungarian language and there we also get now about a two

or three percent

absolute gain and we're down to a word error rate of about twenty five percent

which isn't wonderful

but it's still relatively high, but it's good enough for some applications such as media

monitoring and

things like that.

And so this was done completely un-transcribed and we did it with a bunch of

languages

so now let me show you some results for the ...

I think it's about nineteen languages we did in Quaero, we did more

we did twenty three but this is only for nineteen of them

and if we look here, if you go up to check

these were trained in a standard supervised manner

with somewhere between a hundred and five hundred hours

of data depending upon the language.

And so these with the blue shading were trained in unsupervised manner

once again we have the word error rate on the left and this is the

average error rate across

three to four hours of data per show,

per language, sorry.

And so we can see that while in general

the error rates are a little bit lower for

the supervised training, these arent't so bad some of them are really about the same

range and

you have to take the result with a little bit of grain of salt, because

some of these languages here

might be a little bit less well trained or a little bit less well advanced

than the lower scoring languages. These might be

doing a little bit better if we worked more on them.

But this isn't the full story so now I'm going to complicate the figure

and in green you have

word error rate on the lowest file, so that the audio file that had the

lowest word error

rate per language so these are in green.

Okay, so these files are easy they're probably news like files

okay.

And it gets very low, even Portuguese we're down around three percent for one of

these segments

and then in yellow we have the worst scoring one

and these were scoring files

typically or more interactive spontaneous speech talk shows, debates, noisy recording that's offside,

that's a lot of variability factors that come in.

So even though this blue curve is kinda nice we really see we have a

lot of work to do

if we want to be able to process all the state up here.

So now i'm going to switch gears and not talking any more about Quaero and

talk a little bit about Babel.

Where it's a lot harder to have supervision from

language model because you are working on languages in Babel

that have very little data, that is hard to get or typically have little data

but not all of them

are really in that situation.

And so this is the

a sentence I took from Mary Harper slide's

that she presented at ?? calling and so the idea

that's being investigated to apply

different techniques of linguistic machine language,

machine learning and speech processing methods

to be able to do speech recognition for keyword search and I highly recommend for

people that are

not familiar with Mary's talk so you see them.

I know that the ASRU one is online on Superlectures,

and the ?? column one I don't know, so people here probably know better than

me if ?? is there.

But there it's really interesting talks

and if you're interested in this topic I suggest you to

go there.

So, keyword spotting. Yesterday Ann spoke about that children can do keyword spotting very young

and so I wanna do first test for you because basic keyword spotting

what I mean is that

you're gonna localise in the audio signals some points where you have

your detected keyword

so these two you detected right

here

you missed it, it's the same word whatever keyword it was or occurred but you

didn't get it

and here you detected a keyword but you didnt get it

so that's a false alarms. So here you've missed the false alarm and the correct.

So now let me play you a couple of samples

and this is actually a test of two things same time.

One is language IDs so I'm gonna play samples at

different languages and there's two times six different languages

and there's a common words in all of these

samples.

And so I'd like people to let me know if you can detect

this words, so see if we as adults can do like children

can do.

Do you want to here it again?

And do we make it a little louder?

Is it possible to be a little bit louder on the audio because I can't

control it here.

I don't think it goes any louder.

I have it on the loudest.

Okay so I'll show you the languages there first. Anyone get the languages there's probably

speaker of each

language here, so you probably recognised your own language.

So the languages were: Tagalog, Arabic, French, Dutch, Haitian and Lithuanian.

Shall I play it again?

It's okay?

Alright, so.

So here's this second set of languages that we have there, the last one is

Tamil. I'm not really

sure the end where there were taxi in different places. Google translate told us it

was.

But there might be some native speakers here that can

tell us if that is or not. To me it sounded like income taxes and

sales tax.

But I don't

really know. Google told us that it was: to income from

taxes and sale of taxes, or something like that so anyway so

basically I did, everyone

catched the word taxes or only some of you did?

Taxes is one of those words that seems to be relatively

common and

in many languages anyway that's same thing.

Before talking about keyword spotting I'm not gonna talk about it too much actually, is

I wanted to

show some results on conversational telephone speech. So we'll talk about term error rate here

rather than word error rate, because in Mandarin we

measure the character error rate rather than in order. So for English

and Arabic we're measuring word error rate and for Mandarin its character

and these results are for I believe the NIST archives of

for

transcription task

and English systems are trained on about

two thousand hours of

data with annotations.

The Arabic and Mandarin systems were probably

trained on about two hundred or three hundred hours of data.

It's quite a bit less

and we can see that the English system gets pretty good. We're down to about

eighteen percent

of the word error rate. The Arabic is really quite high

about forty five percent. Maybe in part due to different dialects

and also maybe in part due to pronunciation modeling because

it's very

difficult in Arabic if you don't have the diacriticised form.

We also at LIMSI work on some other languages

including French, Spanish, Russian, Italian and these are

just some results to show you that we're sort of in the same ballpark

of error rates

for these systems, for once again conversational speech

and these are trained on about a hundred to two hundred hours of data.

Now let's go to Babel which can just be very challenging compared to what we

see here which is

already harder that we had for the broadcast type data.

And before that I just want to say a few words what we mean by

low resource language so in general

these days it means it has got low presence on the Internet.

That's probably not what ethnologists in English would agree

upon but I think from the technology community we are gonna say

you cannot get any data it's a low resource language.

It's got limited text resources

well at least in electronic form

there is

little or

some, but not too much I\O data,

you may or may not find some pronunciation dictionaries and it can be difficult to

find

maybe reliable knowledge about the language if you google different things and you find some

characteristics about the language you get three different peoples telling you three different

things and you don't really know what to believe.

And one point I'd like to make is that this is true for what we're

calling these low resource languages

but is also true many times for different types of applications that has passed that

we dealt with

even in well resourced languages. You might not have any data on the type of

test you're addressing.

So here's an overview of the Babel languages for the first two years of the

program

and I'm roughly trying to give an idea of the characteristics of the language I'm

sure that these

are not really hundred percent correct.

I tried to classify the characteristics into general classes and give it something we can

easily understand

and so for example we see the list of languages we ?? assume is make

any better

relatively closely related

and

Cantonese allow

and Vietnamese

that are used

different scripts that's Bengali and Assamese share the same written script.

We also have the Pashto

which uses the Arabic script, the one we have to problem to of diacritization in

it.

And then we have

Turkish, Tagalog, Vietnamese and ??

which was actually very challenging because there we had clicks we needed to deal with.

So they use different scripts,

some of them languages have tones so in this case we had four that had

tone,

we were trying to classify the morphology into being easy, hard and medium,

okay, this is not very

I'm sure it is not very reliable but basically three of them we consider to

have a difficult

morphology so that was the Pashto, the Turkish and the Zulu.

And the others of them are low.

The next column is the number of dialects in this is not

the number of dialects in the language, this is the number of dialects in the

corpus collected

in the context of Babel.

So in some cases we only had one as in Lao and Zulu, but in

another cases we had for Cantonese as many as

five, in Turkish as many as seven.

And then once again whether or not

the G2P

is easy or difficult

and so some of them are easy, some of them seem to be hard.

In particular the Pashto

and for the Cantonese is basically the dictionary lookup

limited character set.

So here and the

last column I'm showing the word error rates for

the Babel languages and its joint in a different style.

If you look at the top of the blue bar

that's the

word error rate

of the

worst language. So in this case for the .. in fact for both of them

with the top of the blue

this language here is about

fifty and some percent and sixty and some percent, that's Pashto

and the top of the oranges just showing you the range of the

word error rates across the different languages.

This word error rate I said backwards. This is the best language

and this is the worst language. The top here are Pashto

which is about seventy percent in one case and

fifty five percent for another

and this is the best which I believe is Vietnamese and Cantonese.

Sorry, if I confused you there.

And I'm wrong again with that too. I mixed up the keyword spotting

So this is, I should've read my notes,

the lowest word error rate was for Haitian and Tagalog

and the highest was for Pashto.

And in this case we had, what's called in our community, you can see it

in another papers, is Full LP

which means you have somewhere between sixty and ninety hours of annotated data for training

and

there's the LLP, which is the low resourced

which is only ten hours of annotated data per language, but you can use the

additional data here

in unsupervised or semi-supervised manner.

So some of the research directions that you've probably seen a fair amount of talks

about here

are looking into language-independent methods

to develop

speech-to-text and keyword spotting for the languages looking into multilingual acoustic modeling.

Yesterday there was some talk by the Cambridge people and there was also talk from

MIT

trying to improve model accuracy with these limited training conditions

using unsupervised or semi-supervised

techniques for the conversational data

which we don't have too much

information that's coming for the language model.

It's a very weak language model that we have

and trying to explore multilingual and

unsupervised MLP training. And both of those have been pretty successful

where is the multilingual acoustic modeling using standard ?? hmms is a little bit less

successful.

And one other thing that we're seeing

is interest in is using graphemic models because these could sort of avoid the problem

of

having to do grapheme to phoneme

and it reduces the problem of pronunciation modeling to

something closer to text normalization you have to do anyway

for language modeling.

So now I wanted to talk just

briefly about something that didn't work that we tried at LIMSI. So one of the

languages is Haitian, so this is great you know

we work in French we developed decent French system

so why not try using French models to help our Haitian system

and so the first thing we do is to try to run our French system

on Haitian data, it was a disaster

it was really bad

then we took the French models,

acoustic models and the language model for Haitian data but also

wasn't very good

then we said okay let's try adding varying amounts of

French data to

Haitian system. So this is the Haitian baseline, so we have about

seventy and some percent word error rate so seventy two ?? much yourself

If we had ten hours of French we get worse, we got about seventy four

or seventy five.

We had twenty hours to go, got worse again.

We had fifty hours to get worse again, we said hups! This is not working,

stop,

this was

work that we never really got back to. We wanted to look a little more

in trying to understand better why

this was happening, we don't know if it's due to the fact that the recording

conditions were very different, we

don't if there were really phonetic or phonological differences between languages

and then we had another bright idea let's just say

okay, let's not use standard French data we also have some accented French data from

Africa

we have some data from North Africa, from

I don't remember where the other was from

and so we said let's trying do that

same results. We took ten hours of data we had and basically

degrade the same way.

So we were kinda disappointed by the results and then

dropped working on it for awhile.

We hoped to get back to some of this again. There

was a paper from KIT that was talking about using

multilingual and bilingual models

for recognition of non-native speech and that actually was getting some gain, so I thought

that was

a positive result despite

instead of our negative result here.

Let me just,

one of the.

One of things we also tried to do some joint models for Bengali and Assamese,

because we have been naive and not speaking

these languages decided this was something we can try

and put them together and see if we can get some gain.

In one condition we got tiny little gain from the language model trainable set of

data, but really tiny

and the acoustic model once again didn't help us.

And I heard that yesterday somebody commented on it

saying that they really are quite different languages and we shouldn't be

assuming just because we don't understand that they are very close.

But we did have Bengali speakers in our lab and they told us they were

pretty close,

so it wasn't based on nothing.

So let me just give a couple of results on keyword spotting just to give

you sort of an idea of

what type things were talking about what the results are.

On the left part of the graph I give results

problem

2006, it was the spoken term detection task that was run by NIST and it

was done on a more

cases in this one. This is on broadcast news and conversational data and you can

see that the measure that is used

here is MTWV, which is the Maximum Term Weighted Value and

I don't wanna go into it

but basically it's a measure of false alarms and misses and you can put the

penalty to it.

The higher the number the better, so on the other slide

we wanted lower number because it was word error rate

and on these ones we want high number.

And so we can see that for the broadcast news data it was about eighty

two or eighty five percent and for

the CTS data it's pretty close up around eighty

but if we look at the Babel languages now

we are down between forty five. So once again now the

worst language is here which is around forty five percent for

sixty of full training against sixty to ninety hours of supervised training

and the best one goes up to about seventy two percent.

Now look at my notes so I get these rates and the worst language was

Pashto

and the best languages were Cantonese and Vietnamese.

And this is now the limited condition and you can see that you take a

really big hit

for the worst language here

but in fact on the best ones, we're not doing so much worse. So these

systems were trained on the ten

hours and then the additional data was used in unsupervised manner

and then there's a bunch of bells and whistles and a bunch of techniques used

to get these

keyword spotting that I didn't talk about and I won't talk about

but there's a lot of talks on it that you'll see here that

you can go to and I think there are two sessions tomorrow

and maybe another poster.

Once again there's talks from Mary Harper if you feel interested in finding out more.

So some findings from Babel, so you've seen unsupervised training

is helping a little bit at least even though we have very poor language models.

The multilingual acoustic models don't seem to be very successful

but there is

something of hope from some research going on.

The multilingual MLPs are bit more successful, meaning there's quite a few papers talking about

that.

Something that

we've used in LIMSI for awhile, but was also shown in Babel

programs that pitch features are useful even for non-tonal languages.

It was in the past we used pitch for work on tonal languages and we

don't need to use it all the time.

And now I think a lot of people are just systematically using it in their

systems.

Graphemic models are once again becoming

popular and they give results very close to phonemic ones

and then for keyword spotting there's a bunch of important things,

score normalization

is extremely important there was a talk

the last ASRU meeting

and dealing with out-of-vocabulary keyword so basically when you get a keyword you don't necessarily

know all those words and particularly when you have ten hours of data, transcript of

that

you've got very small vocabulary. You have no idea what type of query

person will give and you need to do something, do tricks

to be able to recognize and find these keywords in the audio

and

typically is being investigated now separate units

and proxy type things and I'm sure you'll find papers on that here.

So let me switch gears now in my last fifteen minutes,

ten minutes. Okay, to talk about some linguistic

studies and the idea is to use speech technologies

as tools to study language variation, to do error analysis,

there are two recent workshops that I listed on the slides.

And I'm gonna take case study from Luxembourgish - the Luxembourg's language.

This is done working closely with Martine Adda-Decker from MC who is Luxembourgish for those

who don't know her.

She says that Luxembourgish is really true multilingual environment, sort of like Singapore

and in fact it seems a lot like Singapore.

The capital city is the same name as the country for both of these

well there's a little bit different

it's a little bit warmer here.

But Luxembourg is about three times the size of Singapore

and Singapore has about ten times the amount of people.

So it's

not quite the same.

So basic question we're asking for Luxembourgish is that given you've got a lot of

contact with English, French and German which language is the closest?

And there was a paper, there was a couple of papers that Martine's first author

of.

A different workshops and most recent one was the last SLTU.

This is a plot showing the number of shared words between Luxembourgish, French, English and

German

and so the bottom curve is English,

the middle one is German and the top one is French

and

along the x-axis is the size of the word list sorted by frequency

and on the y-axis is the number of shared words and so you can see

that at the low end we've got the

function words as we expect those the most frequent in the languages,

then you get more general content words.

And as higher up you get technical terms and a proper names.

And you can see that in general there's more sharing with French

than with German or English at least at the lexical level.

And you have

once again the highest amount of sharing when you get

technical terms and it's because these are shared across languages more generally.

So what we try to do that's the question of given that we have this

similarity

to some extent at the lexical level there is this type of similarity at the

phonological level.

And so what we did we took acoustic models from English, French and German.

We tried to do an equivalence between these

IPS symbols

and those in Luxembourgish. So Martine defined the set up

phones for

Luxembourgish

and then we

did hacked up pronunciation dictionary that would allow

any language change to happen after any phoneme, so if you have a

this can get pretty big you have a lot of pronunciations because you have

if you had three letters you're going to be able to decide each point go

to

the other ones. You can see the illustration here with the pad you go anywhere.

And the we said when I also trained a model on a three, multilingual model

trained on a three data

together so we took a subset of the English, French and German data and did

what we called a pooled model.

And so the first experiment with it is we tried to align

the audio data with

these three models in parallel, so that the system could choose which acoustic model likes

best

the English, French, German and pooled

and then we did a second experiment so we train the

Luxembourgish model in unsupervised manner just like I showed for Hungarian and we said now

let's use that and we replaced the pooled model with Luxembourgish model.

And so of course or expectation is that once we put Luxembourgish model in there

it should get

the data, so the alignment should go to that model that's what we expect.

And

so here's what we got

so on the left is experiment 1, the one where we have the pooled model.

On the right we have Luxembourgish model and

the top is

German then we have French, English and pooled Luxembourgish and so we were really

disappointed, so the first thing we see is first of all Luxembourgish doesn't take everything

and second so we have pretty much the same distribution, there's very little change.

So we said okay let's try and look at this a little bit more. Martine

said let's look at this

more carefully because she knows the language.

I was looking at and so we looked at

some diphthongs

which only exist in Luxembourgish and so we had this ?? card base. We're trying

to choose something close when we took

English and French and now we see the effect that we want

so originally they wanted English which has diphthongs or more diphthongs.

And now they want to Luxembourgish. So we are happy we've got some results we

wanted.

We should do some more working, looking more things we are happy with this result.

The second thing i wanted to mention was talking a bit about language change and

this was

associated phonetic corpus based study that was also presented last year at Interspeech by Maria

Candeias.

And we were looking at three different phenomena that are going to be growing

in the society, so you have consonant cluster reduction so explique

exclaim, so you have eXCLaim

you get rid of the ??.

There is too many things to pronounce.

The palatalization and affrication of dental stops which is a sign of the

social status in immigrant population.

And in fact that for me when you hear the cha or ja they sounds

very normal to me, because we have them

in English and I'm used to it, so we do that in English and then

the third one is the

fricative epithesis

which is at the end of word, you had this

?? type sound. Sorry.

And I'll play you an example

And that was something that I remember very distinctly when I first came to France

I heard it all the

time and women did this. It was some characteristic of women speech that at the

end there's eesh.

And it's very common

but in fact is now

growing more in

even male speech.

But these were examples that were taken from broadcast

data, so this is people talking on the radio and the television

so you imagine that if they're doing it it's something that is really

now accepted by the community. That's really are a growing trend and so

Maria was looking at these over the last decade and so what we did was

same type of thing. We took

a dictionary and we allowed after Es to have this eesh

type sound. We allowed different phonemes to go there

and then looked at alignments and how many counts we got of the different

occurrences.

And so here we're just showing that between 2003

and 2007

this is becoming longer

and it's also increased in frequency by about twenty percent.

So now let me just, last thing I wantes to talk about

was human performance and we all know that humans do better

then machines on transcription tasks

and machines have trouble dealing with variability that humans do much better with.

So here is a plot of, this is based on some work of doctor ??

and his colleagues.

That

we took

samples stimuli from the what the recognizer got wrong. So everything you see is

100 % word error rate by the recognizer that were very confusable little function words

like ah.

And an in English

Things like that.

And we played them stimuli readers. With 14 native

French stimuli subjects and 70 English subjects.

And everyone who listened understood the stimuli

and so here you can see that if we give just a local context, a

three

gram context which is what many recognisers have to the humans

they have make thirty percent errors on this

but the system was a hundred percent wrong.

If we up that context to five grams, so we've got one word each side

they now go down by about twenty percent. So this is nice going the right

directions the context is

helping us as it seems a little bit.

And if we go up to seven or nine gram

we are doing little bit better but we still have about fifteen percent error rate

by humans on this

task, so our feelings that these are intrinsically

ambiguous given even a small context. We need a larger one.

And just to have some control we also put in some samples where the recognizer

was correct

and here now zero word error rate for the recogniser and you see the humans

make very few errors also

which comforts us that's not an experimental problem that we have higher rates for humans.

So I just wanna play one more example

from the human misunderstanding.

This coming from French talk show I think there's enough French people here that will

follow it.

And the

other person

So the error that happens is that one speaker said là aussi

which is very close to là aussi en

which is very close là aussi en.

I pronounce it poorly.

And in fact what was really interesting about this

the time the correction came was about twenty words

later than the person actually said là aussi en.

And so the two that were talking they had own mindset

and they weren't really listening to the other one completely and this is once again

a broadcast talk show.

I can play the longer sentence for people later if you're interested.

And so my last slide

is that as a community we are processing more languages and wider variety of data.

We are able to get by with less supervision at least to some of the

training data.

We're seeing some successful applications

with this imperfect technology.

Something we

like to extend to is to use the technology for other

purposes. We still have little semantic and world knowledge in our models.

And we still have a lot of progress to do, because those word error rates

are still flying and there's a lot of task there

and so maybe we need to some deep thinking

and how to deal with this.

So that's all.

Thank you.

We have time for some questions?

No questions.

Hi Lori.

Hi Malcolm. In the semi-super-sample summary supervised learning sense cases do you have any sense

of when things fail?

Why things converge or diverge?

We had some problems with some languages ... this is on broadcast data?

We had some problems if you had

poor text normalisation or if you didn't do good filtering to make sure that the

text data were really

from the language that you were targeting

it can fail, it just doesn't converge. So this one case and in fact we

had two languages where the

problem was like that. So basically the word segmentation wasn't good.

I think if you have

too much garbage in your

language model you're going to have a

poor information you're giving. What amazes me and we haven't

done too much of the work actually ourselves at LIMSI yet

is that it still seems you working to some degree for the Babel data.

Where we're flying with these word error rates and we have very little

language model data but probably what we have is

correct because manual transcripts we're using for it

and the case where you're downloading data from web, but you don't really know what

are getting.

And so if you put garbage in, you're getting garbage out. That's why we need

human to supervise what's

going on at least to some extent.

So it was quantified to some extent?

I don't

really have enough information and I know that one of the languages that we tried,

so basically you'd get

some improvement, but you'd stagnate maybe at the level of

the second or third iteration just to improve further.

It didn't happen too often.

And it's something that

I don't really have a good feeling for. Something I didn't talk about was text

normalisation that really is an important part of our work. It is something sort of

considered I think front work

and people don't talk about too much.

Any more questions?

Well if not, I would like to invite the organiser our chairman

to give enough

of our appreciation to Lori.

Let's thank her again.