Thank you Isabel very much our feelings are shared. I'd like to thank the organisation
here for inviting me
to be a keynote speaker. It's really an honor. It's also a big challenge, so
I hope I will
make some messages that come across. Well to ... at least some of you.
As Isabel said I've been working in speech and speech processing for many years now
and today I'll
focus mostly on the act of speech recognition. But a little bit of context that
I'll talk about first.
So.
At LIMSI - being in Europe of course - we try to work on speech
recognition and the multilingual
context and processing at least a fair number of the European languages
This isn't really new. We are seeing sort of a regrowth in wanting to do
speech recognition in
different languages but, if you go back a long time ago there was some
research that was there. We just didn't hear about it this much.
And that's probably because there weren't too many common corpora and benchmark tests to be
able to compare results on and so people tended to report on your papers were
accepted more easily if
you use common data. Which is still the case.
And it's logical you want to compare results to other peoples'
results, but now there's more and more data out there, there's more test sets out
there and so we
can do more comparisons and we're seeing more languages covered. So I think that's really
nice.
So I'll speak about some of our research results in Quaero and Babel programs.
Sure.
Is that better? Sorry, o.k. I was popping this morning a bit so I wanted
to be not too close.
So, I'll speak highlights and research results from Quaero and Babel.
And then I want to touch upon some activities
I did with some colleagues at LIMSI and at the Laboratory of Phonetics in Paris
to trying new speech technologies for other
applications to carry out linguistic studies and corpora based studies
And I'll mentioned briefly a couple of a perceptual experiments that we've done
and then finally some concluding results, remarks.
So I guess we probably all agree in this community that we've seen a lot
of progress over the last
decade or two decades.
We're actually seeing some
technologies that are using
speech or are working on it. I think that's kind of fun, it's really nice.
But we see it for a few languages
and as we heard yesterday from Haizhou. He mentioned that about 1 % of the
world's languages are
actually seen in our proceedings, so we have something about them.
That's pretty low, but it's
up there from one or two that we did maybe twenty years ago. We're soaking
up the sun.
One of the problems I mentioned before is that our
technology typically relies on having a lot of language resources and these are difficult and
expensive to get. And so therefore it's harder to get them for limited languages
and current practice still
takes several months to years to bring up systems for language and if you count
the time for collecting
data and processing it and transcribing it and things like that.
So this is sort of a step back in time and we'll see if we
go back to say the late eighties,
early nineties. We had systems that were basicly command controlled dictation and we had some
dialogue systems, usually
pretty limited to a specific task like ATIS (Air Traffic Information System),
travel reservation, things like that.
Some of you here probably know well and some of you are maybe too young
and don't even know it, because the
publications are not necessarily scanned in online, so you don't see them.
But in fact we're now seeing a regrowth in the same activities. When you look
at voice mail and
dictation machines in your phones and the different
the personal assistants that we're seeing that are finally coming out now.
And so that's really exciting to see this sort of a pick up again that
we saw in the past.
And then of course we have some new applications or for someone new applications
that are growing as we've got better processing capability both in terms of computers and
data out there. And so we have speech analytics of both call center data or
meeting type data,
lectures, there's a bunch of things.
We have speech-to-speech translation and also indexation of audio, video,
documents which is used for media mining,
tasks and companies are very interested in that.
And of course for the speech analytics people that are really interested in finding out
what people
wanna buy and trying to sell them things and stuff like that.
So let me back up a little bit and talk about something.
Why is speech
processing difficult. All of us speak easily and
sort of mentioned this is sort of natural we learn language but
I think any of us who learned a foreign language at least as an adult
understand that's a little bit
harder than it really seems and so I learned French when I was ...
... after my PhD. I won't say my age.
And it wasn't so easy to learn, it wasn't so natural and my daughter who
grew up in lingual environment speaks
French and English fluently and she's better in the other languages than I am.
So I was good unilingual American speaker.
No other contact with the other languages.
We need context to understand what's being said, so you speak differently if you're speaking
in public then
if you're speaking to someone who you know. We all know this.
?? projector screen with speech is continuous and so if I'm
talking to you I might say it is not easy or it's not easy
but if I'm talking to my mother it's not easy.
I reduced that it's not easy. Well that's not so clear where the words are
and so I think that we all know once again that humans
reduce the pronunciation in regions that are of low information. Where it's not very important
you're
putting the minimum effort into saying what you want to say.
And of course there's other variability factors that
and I'll also mention that the speaker's characteristic accent,
the context we are in. Humans do very well in adapting to this.
Machines don't, in general.
So here I wanna, since I am taking a step back
in time I wanted to play a couple of very simple samples.
Is there anyone in this room that doesn't know this type of sentence?
You all heard it.
Good! Okay. That's timid, that's going back really, really long time ago.
I was involved in
some of the selection for timid but not in the non-sense's.
Those were selected to elicit pronunciation variants for
different styles of speaking
and you can hear that in the sample here
even in a very, very simple read text
we have different realisations of the word greasy.
So in the first case we have an S, in the second case we have
a Z.
And we can see that ...
I can't really point to the screen there. I don't have a pointer. Do you
have a pointer?
I think I've refused it before.
So in any case you can see in blue there that S is quite a
bit longer, you can see the voicing and the Z.
And is everyone here familiar with spectrograms? I sort of assumed the worst. It's okay,
good.
So here's another example of more conversational type speech.
And we'll see that people interrupt each other. You can hear some of the hesitations.
So in this example it's office corpus and participants called each other
and they're supposed to talk about some topic that they were given and they have
a mutual topic.
You're supposed to talk about it but not everybody does.
But even in this ...
They don't know each other very well but they still interrupted each other. They did
some turn taking
and you hear you can see the
hmm and laughter and there's someone else in another
presentation. I don't where it is.
Now I'm gonna play an example from Mandarin and I'm having
confidence that Billy Hartman who is probably in ?? with his wife
gave me the correct translation,
correct text here, because I don't understand it. Which is even more spontaneous so it's
an example
taken from the callhome mandarin corpus where we think it's a mother and a daughter
who are talking to
each other about the daughters job interview.
So for those who speak Mandarin.
So if I understood correctly and the translation is correct
basically talking about it and the mother doesn't understand what the job interview is about
and the daughter says
she says: Don't speak to me in another
language. Speak to me in Chinese. And the daughter says: You wouldn't have understand anyway,
even if I spoke in Chinese
And so I've had some similar situations speaking with my mother
That's what I do.
So now I'm gonna switch gears a little bit and talk about the
Quaero program which is one of the two topics I want to mainly focus on
here, we talk about the speech
recognition in different languages.
This is a large project
in France, it's research innovation project
which was funded by OUZIO, a French innovation agency.
It was initiated in 2004
but didn't start until 2008 and then ran for almost six years until the end
of 2013
so it's relatively recent that it finished. It was really fun
but when we started putting it together
the web was a lot of than it is now. So as we also heard,
I think it was this morning, there was no YouTube,
no Facebook, no Twitter, no Google Books, iPhones. All that didn't exist.
So life was boring, what do we do with our free time, right?
Instead of spending your time on the ??.
I think it's hard to be in the position of young people who don't know
life without all of this
and my daughter grew up with all of it.
And so it's very hard to relate to what this situation really is
but in any case
to get back to sort of processing of this data
we have tons and tons of data. I read that there's roughly 100 hours of
video uploaded to YouTube every minute.
And that's a huge amount of data and 61 languages. So if we are treating
about 7 of them,
we are not so bad. Maybe we cover the languages doing the videos there.
But we don't have to organise this data, we don't not know how to accesses
this data.
And so Quaero was trying to aim at this. How can we organize the data,
how can we access it, how can we index it
how can we build applications that can make use of today's technology and do something
interesting with it. I'm not gonna talk about all that. If you're interested I suggest
you to go to Quaero website and
you can find some demos and links and things like that
there. I'm gonna focus on the work that we did in speech processing
and at LIMSI we spoke about, we worked on mostly speech and text processing
including this applied to speech data, so named entity searchable text and speech,
work translation,
both of text and speech.
So here is
showing the speech processing technologies that we
worked on in the project. So the first box we have is audio speaker segmentation
such as chopping
signal and trying to decide speech and nonspeech regions
and
dividing into segments corresponding to different speakers
detecting speaker changes
then we may or may not know
the identity of the language being spoken
so we have a box of language identification if we don't know it.
Most of the time we want to transcribe the speech data because
speech is ubiquitous
there's speech all over the place and it has a high amount of information content
and so we believe it is
that the most useful. We work in speech and not in an image. Image people
might tell us that
image is more useful for indexing this type of data.
One advantage we have speech relative to image that I've just mentioned
is that the speech has the underlying writen representation that we've all pretty much agree
upon
more or less. Able decide where the word is.
We might differ a little bit but we pretty much agree upon it. The image
is not the case
if you give an image to two different people
someone will tell you it's a blue house, someone will tell you it's trees in
the park
with a little blue cabin in it. Something like that. You get a
very different description based on what people are interested in
and their matter expressing things. For speech in general
we're a little bit more normalized we
pretty much agree on what would be there.
Then you might wanna do
other type of processing such as speaker diarization.
This morning doctor ?? spoke about the Twitter statistics during the presidential elections and
that was something we actually worked on in Quaero.
Which was to try and look at a corpus of recordings
and look at speaker times within this corpus of recordings, you might have hundreds or
thousands of hours of recordings and look at how many
speakers are speaking when and how much time is allocated
and that's actually something that's got a potential to use at least in France
where they control that during the election period all the parties get the same amount
speaking time.
As you want very accurate measures of who is speaking when
so that everybody has a fair game
during the elections.
Other things that we worked on were adding the metadata to the
transcriptions, you might add punctuation or markers to
make it be more readable, you might want to transform numbers from
words into a number sequences like in newspaper text in 1997.
And you might want to identify
entities or speakers or topics that can be useful for automatic processing so you could
tags in
where same identities are.
And then finally the other box there is speech
translation typically based on the speech transcription
but we're also trying to work on having a tighter link between the speech and
the translation
portions, so you don't just transcribe and then
translate.
But we're trying to have more tight
relation between the two.
Let me talk a little bit now about speech recognition.
Everybody I think know there's a box, so basically we have
the main point is just that we have three important models. The language model,
the pronunciation model and the acoustic model.
And these are all typically estimated on very large corporas
This is where we're getting into problems with the low resource languages.
And I want to give a couple the illustrations on
why at least I believe and I spent effort doing this on pronunciation model
is really important that we have the right pronunciation in the dictionary.
So we take these two examples, on the left we have two versions of coupon.
And on the right we have
two versions of interest.
So in the case of the coupon in one case we have the ya sound
inserted there
and our models for speech recognition are typically
modeling phonetics in it's context.
And so we can see that if we have a transcription just k. u. p.
for it.
We're gonna have this ya there and the case is not gonna be
very good match to the one that we have the second case.
That's really very big difference and also the U becomes almost frented EU.
That is not
distinguishable technically in English.
And the same thing for interest. We have interest or interest.
Well in one case you N and the other case you have the TR cluster.
These are very different and you can imagine that if we ...
since our
acoustic models are based on alignment of these transcriptions with the audio signal
if we have more accurate pronunciations we're going to
have better acoustic models at the end and that's what our goal is.
So now I want to speak a little bit about
culture lightly supervised learning, there's many terms
being used for it now.
Unsupervised training, semi-supervised training, lightly supervised training
and so
basicly one goal is that .. and Ann mentioned something
about this yesterday, maybe machines can just learn on their own.
So here we have a machine
he's reading the newspaper, he's looking at the TV and he's learning.
Okay that's great.
That's something we would like to happen.
But we still believe that we need to put some guidance there and so this
is
researcher here trying to
give some informations and supervision the machine who's learning.
When we look at traditional acoustic modeling we typically use between several hundreds to
several thousands of hours of carefully annotated data and once again I said before that
this is expensive
and so people trying to look into
ways to reduce this information, reducing the amount of supervision for the training process.
And so
I believe that some people in this room are doing it
to automate the process of collecting the data. To automate the iterative learning of the
systems by themselves
even including the evaluation so having some
data used to evaluate on that is not necessarily really carefully annotated and most the
time it is
but there's been some work trying to use
unannotated data to improve the system
which I think is really exciting.
So we talk about reduced supervision and
unsupervised training that has a lot of different names that are used.
The basic idea is to use some existing speech recognizer,
you transcribe some data, you assume that this transcription is true.
Then you build new models estimating with this transcription and you reiterate and there's been
a
lot of work on it for about fifteen years now
and many different variance that have been explored, where to filter the data, where to
use confidence factors, do you
train on things that are only good, do you take things in the
middle range, this many things you can read about it.
Something that's pretty exciting that we see in the Babel work
is that even now if we apply these two systems starting
with very high word error rate, it still seems to be converging
and that's really nice.
The first things I'll talk about are going to be in case for a broadcast
news data but we have a lot of
data we can use as a supervision. And by this I mean we're using language
models that are trained in many
millions of words of text and this is giving some information to the systems. It's
not completely
unsupervised which is why you these different names for
what's being done by different researchers calling it.
It's all about the same but it's called by different names
and so here I wanted to illustrate
this was the case study for the Hungarian that we did in
the Quaero program and it was presented at last year's Interspeech, so maybe some of
you saw it, by A. Roy.
And we started off with
having
seed models at this point appear of about eighty percent or seed models that come
from other
languages or five languages we took them from, so we did what most people would
call cross language transfer.
These models came from if I
have it correctly: English, French, Russian, Italian and German
and we tried to just choose the best match between
the phone set of Hungarian to one of these languages.
And then we
use this model here to transcribe about 40 hours of data which is this point
here
and this size of the circle
is showing you
roughly how much data is used
that's this is forty hours then we double again here
and go about eighty hours and this is the word error rate
and this is the iteration number
and so we often use increased amount of data, increase the model
size, we have more parameters and models is going on
so the second point here is using
the same amount of data, but using more context
so we built a bigger model, so we once again took this model, we redecoded
all the data the forty hours
and built another model and so now we went down to about sixty percent, so
we still kind of flying
we doubled the data again and we're probably about a hundred fifty
hundred fifty something like that. Then we got down to about
fifty percent. These are all using same language model
so that wasn't changed in study
and then finally here we use about three hundred hours of
training data we're down to about thirty or thirty five percent
and of course everybody knows that
these were done with just standard PLP F0 features
and
pretty much everybody's using features generated by the MLP
so we took
our english MLP,
we generated features on the Hungarian data since across
lingual transfer
models there
and we see there begins small gain a little bit
once our amount of data is fixed
and then we took the transcripts that were generated by this system
here
and
we built an MLP
training MLP for the Hungarian language and there we also get now about a two
or three percent
absolute gain and we're down to a word error rate of about twenty five percent
which isn't wonderful
but it's still relatively high, but it's good enough for some applications such as media
monitoring and
things like that.
And so this was done completely un-transcribed and we did it with a bunch of
languages
so now let me show you some results for the ...
I think it's about nineteen languages we did in Quaero, we did more
we did twenty three but this is only for nineteen of them
and if we look here, if you go up to check
these were trained in a standard supervised manner
with somewhere between a hundred and five hundred hours
of data depending upon the language.
And so these with the blue shading were trained in unsupervised manner
once again we have the word error rate on the left and this is the
average error rate across
three to four hours of data per show,
per language, sorry.
And so we can see that while in general
the error rates are a little bit lower for
the supervised training, these arent't so bad some of them are really about the same
range and
you have to take the result with a little bit of grain of salt, because
some of these languages here
might be a little bit less well trained or a little bit less well advanced
than the lower scoring languages. These might be
doing a little bit better if we worked more on them.
But this isn't the full story so now I'm going to complicate the figure
and in green you have
word error rate on the lowest file, so that the audio file that had the
lowest word error
rate per language so these are in green.
Okay, so these files are easy they're probably news like files
okay.
And it gets very low, even Portuguese we're down around three percent for one of
these segments
and then in yellow we have the worst scoring one
and these were scoring files
typically or more interactive spontaneous speech talk shows, debates, noisy recording that's offside,
that's a lot of variability factors that come in.
So even though this blue curve is kinda nice we really see we have a
lot of work to do
if we want to be able to process all the state up here.
So now i'm going to switch gears and not talking any more about Quaero and
talk a little bit about Babel.
Where it's a lot harder to have supervision from
language model because you are working on languages in Babel
that have very little data, that is hard to get or typically have little data
but not all of them
are really in that situation.
And so this is the
a sentence I took from Mary Harper slide's
that she presented at ?? calling and so the idea
that's being investigated to apply
different techniques of linguistic machine language,
machine learning and speech processing methods
to be able to do speech recognition for keyword search and I highly recommend for
people that are
not familiar with Mary's talk so you see them.
I know that the ASRU one is online on Superlectures,
and the ?? column one I don't know, so people here probably know better than
me if ?? is there.
But there it's really interesting talks
and if you're interested in this topic I suggest you to
go there.
So, keyword spotting. Yesterday Ann spoke about that children can do keyword spotting very young
and so I wanna do first test for you because basic keyword spotting
what I mean is that
you're gonna localise in the audio signals some points where you have
your detected keyword
so these two you detected right
here
you missed it, it's the same word whatever keyword it was or occurred but you
didn't get it
and here you detected a keyword but you didnt get it
so that's a false alarms. So here you've missed the false alarm and the correct.
So now let me play you a couple of samples
and this is actually a test of two things same time.
One is language IDs so I'm gonna play samples at
different languages and there's two times six different languages
and there's a common words in all of these
samples.
And so I'd like people to let me know if you can detect
this words, so see if we as adults can do like children
can do.
Do you want to here it again?
And do we make it a little louder?
Is it possible to be a little bit louder on the audio because I can't
control it here.
I don't think it goes any louder.
I have it on the loudest.
Okay so I'll show you the languages there first. Anyone get the languages there's probably
speaker of each
language here, so you probably recognised your own language.
So the languages were: Tagalog, Arabic, French, Dutch, Haitian and Lithuanian.
Shall I play it again?
It's okay?
Alright, so.
So here's this second set of languages that we have there, the last one is
Tamil. I'm not really
sure the end where there were taxi in different places. Google translate told us it
was.
But there might be some native speakers here that can
tell us if that is or not. To me it sounded like income taxes and
sales tax.
But I don't
really know. Google told us that it was: to income from
taxes and sale of taxes, or something like that so anyway so
basically I did, everyone
catched the word taxes or only some of you did?
Taxes is one of those words that seems to be relatively
common and
in many languages anyway that's same thing.
Before talking about keyword spotting I'm not gonna talk about it too much actually, is
I wanted to
show some results on conversational telephone speech. So we'll talk about term error rate here
rather than word error rate, because in Mandarin we
measure the character error rate rather than in order. So for English
and Arabic we're measuring word error rate and for Mandarin its character
and these results are for I believe the NIST archives of
for
transcription task
and English systems are trained on about
two thousand hours of
data with annotations.
The Arabic and Mandarin systems were probably
trained on about two hundred or three hundred hours of data.
It's quite a bit less
and we can see that the English system gets pretty good. We're down to about
eighteen percent
of the word error rate. The Arabic is really quite high
about forty five percent. Maybe in part due to different dialects
and also maybe in part due to pronunciation modeling because
it's very
difficult in Arabic if you don't have the diacriticised form.
We also at LIMSI work on some other languages
including French, Spanish, Russian, Italian and these are
just some results to show you that we're sort of in the same ballpark
of error rates
for these systems, for once again conversational speech
and these are trained on about a hundred to two hundred hours of data.
Now let's go to Babel which can just be very challenging compared to what we
see here which is
already harder that we had for the broadcast type data.
And before that I just want to say a few words what we mean by
low resource language so in general
these days it means it has got low presence on the Internet.
That's probably not what ethnologists in English would agree
upon but I think from the technology community we are gonna say
you cannot get any data it's a low resource language.
It's got limited text resources
well at least in electronic form
there is
little or
some, but not too much I\O data,
you may or may not find some pronunciation dictionaries and it can be difficult to
find
maybe reliable knowledge about the language if you google different things and you find some
characteristics about the language you get three different peoples telling you three different
things and you don't really know what to believe.
And one point I'd like to make is that this is true for what we're
calling these low resource languages
but is also true many times for different types of applications that has passed that
we dealt with
even in well resourced languages. You might not have any data on the type of
test you're addressing.
So here's an overview of the Babel languages for the first two years of the
program
and I'm roughly trying to give an idea of the characteristics of the language I'm
sure that these
are not really hundred percent correct.
I tried to classify the characteristics into general classes and give it something we can
easily understand
and so for example we see the list of languages we ?? assume is make
any better
relatively closely related
and
Cantonese allow
and Vietnamese
that are used
different scripts that's Bengali and Assamese share the same written script.
We also have the Pashto
which uses the Arabic script, the one we have to problem to of diacritization in
it.
And then we have
Turkish, Tagalog, Vietnamese and ??
which was actually very challenging because there we had clicks we needed to deal with.
So they use different scripts,
some of them languages have tones so in this case we had four that had
tone,
we were trying to classify the morphology into being easy, hard and medium,
okay, this is not very
I'm sure it is not very reliable but basically three of them we consider to
have a difficult
morphology so that was the Pashto, the Turkish and the Zulu.
And the others of them are low.
The next column is the number of dialects in this is not
the number of dialects in the language, this is the number of dialects in the
corpus collected
in the context of Babel.
So in some cases we only had one as in Lao and Zulu, but in
another cases we had for Cantonese as many as
five, in Turkish as many as seven.
And then once again whether or not
the G2P
is easy or difficult
and so some of them are easy, some of them seem to be hard.
In particular the Pashto
and for the Cantonese is basically the dictionary lookup
limited character set.
So here and the
last column I'm showing the word error rates for
the Babel languages and its joint in a different style.
If you look at the top of the blue bar
that's the
word error rate
of the
worst language. So in this case for the .. in fact for both of them
with the top of the blue
this language here is about
fifty and some percent and sixty and some percent, that's Pashto
and the top of the oranges just showing you the range of the
word error rates across the different languages.
This word error rate I said backwards. This is the best language
and this is the worst language. The top here are Pashto
which is about seventy percent in one case and
fifty five percent for another
and this is the best which I believe is Vietnamese and Cantonese.
Sorry, if I confused you there.
And I'm wrong again with that too. I mixed up the keyword spotting
So this is, I should've read my notes,
the lowest word error rate was for Haitian and Tagalog
and the highest was for Pashto.
And in this case we had, what's called in our community, you can see it
in another papers, is Full LP
which means you have somewhere between sixty and ninety hours of annotated data for training
and
there's the LLP, which is the low resourced
which is only ten hours of annotated data per language, but you can use the
additional data here
in unsupervised or semi-supervised manner.
So some of the research directions that you've probably seen a fair amount of talks
about here
are looking into language-independent methods
to develop
speech-to-text and keyword spotting for the languages looking into multilingual acoustic modeling.
Yesterday there was some talk by the Cambridge people and there was also talk from
MIT
trying to improve model accuracy with these limited training conditions
using unsupervised or semi-supervised
techniques for the conversational data
which we don't have too much
information that's coming for the language model.
It's a very weak language model that we have
and trying to explore multilingual and
unsupervised MLP training. And both of those have been pretty successful
where is the multilingual acoustic modeling using standard ?? hmms is a little bit less
successful.
And one other thing that we're seeing
is interest in is using graphemic models because these could sort of avoid the problem
of
having to do grapheme to phoneme
and it reduces the problem of pronunciation modeling to
something closer to text normalization you have to do anyway
for language modeling.
So now I wanted to talk just
briefly about something that didn't work that we tried at LIMSI. So one of the
languages is Haitian, so this is great you know
we work in French we developed decent French system
so why not try using French models to help our Haitian system
and so the first thing we do is to try to run our French system
on Haitian data, it was a disaster
it was really bad
then we took the French models,
acoustic models and the language model for Haitian data but also
wasn't very good
then we said okay let's try adding varying amounts of
French data to
Haitian system. So this is the Haitian baseline, so we have about
seventy and some percent word error rate so seventy two ?? much yourself
If we had ten hours of French we get worse, we got about seventy four
or seventy five.
We had twenty hours to go, got worse again.
We had fifty hours to get worse again, we said hups! This is not working,
stop,
this was
work that we never really got back to. We wanted to look a little more
in trying to understand better why
this was happening, we don't know if it's due to the fact that the recording
conditions were very different, we
don't if there were really phonetic or phonological differences between languages
and then we had another bright idea let's just say
okay, let's not use standard French data we also have some accented French data from
Africa
we have some data from North Africa, from
I don't remember where the other was from
and so we said let's trying do that
same results. We took ten hours of data we had and basically
degrade the same way.
So we were kinda disappointed by the results and then
dropped working on it for awhile.
We hoped to get back to some of this again. There
was a paper from KIT that was talking about using
multilingual and bilingual models
for recognition of non-native speech and that actually was getting some gain, so I thought
that was
a positive result despite
instead of our negative result here.
Let me just,
one of the.
One of things we also tried to do some joint models for Bengali and Assamese,
because we have been naive and not speaking
these languages decided this was something we can try
and put them together and see if we can get some gain.
In one condition we got tiny little gain from the language model trainable set of
data, but really tiny
and the acoustic model once again didn't help us.
And I heard that yesterday somebody commented on it
saying that they really are quite different languages and we shouldn't be
assuming just because we don't understand that they are very close.
But we did have Bengali speakers in our lab and they told us they were
pretty close,
so it wasn't based on nothing.
So let me just give a couple of results on keyword spotting just to give
you sort of an idea of
what type things were talking about what the results are.
On the left part of the graph I give results
problem
2006, it was the spoken term detection task that was run by NIST and it
was done on a more
cases in this one. This is on broadcast news and conversational data and you can
see that the measure that is used
here is MTWV, which is the Maximum Term Weighted Value and
I don't wanna go into it
but basically it's a measure of false alarms and misses and you can put the
penalty to it.
The higher the number the better, so on the other slide
we wanted lower number because it was word error rate
and on these ones we want high number.
And so we can see that for the broadcast news data it was about eighty
two or eighty five percent and for
the CTS data it's pretty close up around eighty
but if we look at the Babel languages now
we are down between forty five. So once again now the
worst language is here which is around forty five percent for
sixty of full training against sixty to ninety hours of supervised training
and the best one goes up to about seventy two percent.
Now look at my notes so I get these rates and the worst language was
Pashto
and the best languages were Cantonese and Vietnamese.
And this is now the limited condition and you can see that you take a
really big hit
for the worst language here
but in fact on the best ones, we're not doing so much worse. So these
systems were trained on the ten
hours and then the additional data was used in unsupervised manner
and then there's a bunch of bells and whistles and a bunch of techniques used
to get these
keyword spotting that I didn't talk about and I won't talk about
but there's a lot of talks on it that you'll see here that
you can go to and I think there are two sessions tomorrow
and maybe another poster.
Once again there's talks from Mary Harper if you feel interested in finding out more.
So some findings from Babel, so you've seen unsupervised training
is helping a little bit at least even though we have very poor language models.
The multilingual acoustic models don't seem to be very successful
but there is
something of hope from some research going on.
The multilingual MLPs are bit more successful, meaning there's quite a few papers talking about
that.
Something that
we've used in LIMSI for awhile, but was also shown in Babel
programs that pitch features are useful even for non-tonal languages.
It was in the past we used pitch for work on tonal languages and we
don't need to use it all the time.
And now I think a lot of people are just systematically using it in their
systems.
Graphemic models are once again becoming
popular and they give results very close to phonemic ones
and then for keyword spotting there's a bunch of important things,
score normalization
is extremely important there was a talk
the last ASRU meeting
and dealing with out-of-vocabulary keyword so basically when you get a keyword you don't necessarily
know all those words and particularly when you have ten hours of data, transcript of
that
you've got very small vocabulary. You have no idea what type of query
person will give and you need to do something, do tricks
to be able to recognize and find these keywords in the audio
and
typically is being investigated now separate units
and proxy type things and I'm sure you'll find papers on that here.
So let me switch gears now in my last fifteen minutes,
ten minutes. Okay, to talk about some linguistic
studies and the idea is to use speech technologies
as tools to study language variation, to do error analysis,
there are two recent workshops that I listed on the slides.
And I'm gonna take case study from Luxembourgish - the Luxembourg's language.
This is done working closely with Martine Adda-Decker from MC who is Luxembourgish for those
who don't know her.
She says that Luxembourgish is really true multilingual environment, sort of like Singapore
and in fact it seems a lot like Singapore.
The capital city is the same name as the country for both of these
well there's a little bit different
it's a little bit warmer here.
But Luxembourg is about three times the size of Singapore
and Singapore has about ten times the amount of people.
So it's
not quite the same.
So basic question we're asking for Luxembourgish is that given you've got a lot of
contact with English, French and German which language is the closest?
And there was a paper, there was a couple of papers that Martine's first author
of.
A different workshops and most recent one was the last SLTU.
This is a plot showing the number of shared words between Luxembourgish, French, English and
German
and so the bottom curve is English,
the middle one is German and the top one is French
and
along the x-axis is the size of the word list sorted by frequency
and on the y-axis is the number of shared words and so you can see
that at the low end we've got the
function words as we expect those the most frequent in the languages,
then you get more general content words.
And as higher up you get technical terms and a proper names.
And you can see that in general there's more sharing with French
than with German or English at least at the lexical level.
And you have
once again the highest amount of sharing when you get
technical terms and it's because these are shared across languages more generally.
So what we try to do that's the question of given that we have this
similarity
to some extent at the lexical level there is this type of similarity at the
phonological level.
And so what we did we took acoustic models from English, French and German.
We tried to do an equivalence between these
IPS symbols
and those in Luxembourgish. So Martine defined the set up
phones for
Luxembourgish
and then we
did hacked up pronunciation dictionary that would allow
any language change to happen after any phoneme, so if you have a
this can get pretty big you have a lot of pronunciations because you have
if you had three letters you're going to be able to decide each point go
to
the other ones. You can see the illustration here with the pad you go anywhere.
And the we said when I also trained a model on a three, multilingual model
trained on a three data
together so we took a subset of the English, French and German data and did
what we called a pooled model.
And so the first experiment with it is we tried to align
the audio data with
these three models in parallel, so that the system could choose which acoustic model likes
best
the English, French, German and pooled
and then we did a second experiment so we train the
Luxembourgish model in unsupervised manner just like I showed for Hungarian and we said now
let's use that and we replaced the pooled model with Luxembourgish model.
And so of course or expectation is that once we put Luxembourgish model in there
it should get
the data, so the alignment should go to that model that's what we expect.
And
so here's what we got
so on the left is experiment 1, the one where we have the pooled model.
On the right we have Luxembourgish model and
the top is
German then we have French, English and pooled Luxembourgish and so we were really
disappointed, so the first thing we see is first of all Luxembourgish doesn't take everything
and second so we have pretty much the same distribution, there's very little change.
So we said okay let's try and look at this a little bit more. Martine
said let's look at this
more carefully because she knows the language.
I was looking at and so we looked at
some diphthongs
which only exist in Luxembourgish and so we had this ?? card base. We're trying
to choose something close when we took
English and French and now we see the effect that we want
so originally they wanted English which has diphthongs or more diphthongs.
And now they want to Luxembourgish. So we are happy we've got some results we
wanted.
We should do some more working, looking more things we are happy with this result.
The second thing i wanted to mention was talking a bit about language change and
this was
associated phonetic corpus based study that was also presented last year at Interspeech by Maria
Candeias.
And we were looking at three different phenomena that are going to be growing
in the society, so you have consonant cluster reduction so explique
exclaim, so you have eXCLaim
you get rid of the ??.
There is too many things to pronounce.
The palatalization and affrication of dental stops which is a sign of the
social status in immigrant population.
And in fact that for me when you hear the cha or ja they sounds
very normal to me, because we have them
in English and I'm used to it, so we do that in English and then
the third one is the
fricative epithesis
which is at the end of word, you had this
?? type sound. Sorry.
And I'll play you an example
And that was something that I remember very distinctly when I first came to France
I heard it all the
time and women did this. It was some characteristic of women speech that at the
end there's eesh.
And it's very common
but in fact is now
growing more in
even male speech.
But these were examples that were taken from broadcast
data, so this is people talking on the radio and the television
so you imagine that if they're doing it it's something that is really
now accepted by the community. That's really are a growing trend and so
Maria was looking at these over the last decade and so what we did was
same type of thing. We took
a dictionary and we allowed after Es to have this eesh
type sound. We allowed different phonemes to go there
and then looked at alignments and how many counts we got of the different
occurrences.
And so here we're just showing that between 2003
and 2007
this is becoming longer
and it's also increased in frequency by about twenty percent.
So now let me just, last thing I wantes to talk about
was human performance and we all know that humans do better
then machines on transcription tasks
and machines have trouble dealing with variability that humans do much better with.
So here is a plot of, this is based on some work of doctor ??
and his colleagues.
That
we took
samples stimuli from the what the recognizer got wrong. So everything you see is
100 % word error rate by the recognizer that were very confusable little function words
like ah.
And an in English
Things like that.
And we played them stimuli readers. With 14 native
French stimuli subjects and 70 English subjects.
And everyone who listened understood the stimuli
and so here you can see that if we give just a local context, a
three
gram context which is what many recognisers have to the humans
they have make thirty percent errors on this
but the system was a hundred percent wrong.
If we up that context to five grams, so we've got one word each side
they now go down by about twenty percent. So this is nice going the right
directions the context is
helping us as it seems a little bit.
And if we go up to seven or nine gram
we are doing little bit better but we still have about fifteen percent error rate
by humans on this
task, so our feelings that these are intrinsically
ambiguous given even a small context. We need a larger one.
And just to have some control we also put in some samples where the recognizer
was correct
and here now zero word error rate for the recogniser and you see the humans
make very few errors also
which comforts us that's not an experimental problem that we have higher rates for humans.
So I just wanna play one more example
from the human misunderstanding.
This coming from French talk show I think there's enough French people here that will
follow it.
And the
other person
So the error that happens is that one speaker said là aussi
which is very close to là aussi en
which is very close là aussi en.
I pronounce it poorly.
And in fact what was really interesting about this
the time the correction came was about twenty words
later than the person actually said là aussi en.
And so the two that were talking they had own mindset
and they weren't really listening to the other one completely and this is once again
a broadcast talk show.
I can play the longer sentence for people later if you're interested.
And so my last slide
is that as a community we are processing more languages and wider variety of data.
We are able to get by with less supervision at least to some of the
training data.
We're seeing some successful applications
with this imperfect technology.
Something we
like to extend to is to use the technology for other
purposes. We still have little semantic and world knowledge in our models.
And we still have a lot of progress to do, because those word error rates
are still flying and there's a lot of task there
and so maybe we need to some deep thinking
and how to deal with this.
So that's all.
Thank you.
We have time for some questions?
No questions.
Hi Lori.
Hi Malcolm. In the semi-super-sample summary supervised learning sense cases do you have any sense
of when things fail?
Why things converge or diverge?
We had some problems with some languages ... this is on broadcast data?
We had some problems if you had
poor text normalisation or if you didn't do good filtering to make sure that the
text data were really
from the language that you were targeting
it can fail, it just doesn't converge. So this one case and in fact we
had two languages where the
problem was like that. So basically the word segmentation wasn't good.
I think if you have
too much garbage in your
language model you're going to have a
poor information you're giving. What amazes me and we haven't
done too much of the work actually ourselves at LIMSI yet
is that it still seems you working to some degree for the Babel data.
Where we're flying with these word error rates and we have very little
language model data but probably what we have is
correct because manual transcripts we're using for it
and the case where you're downloading data from web, but you don't really know what
are getting.
And so if you put garbage in, you're getting garbage out. That's why we need
human to supervise what's
going on at least to some extent.
So it was quantified to some extent?
I don't
really have enough information and I know that one of the languages that we tried,
so basically you'd get
some improvement, but you'd stagnate maybe at the level of
the second or third iteration just to improve further.
It didn't happen too often.
And it's something that
I don't really have a good feeling for. Something I didn't talk about was text
normalisation that really is an important part of our work. It is something sort of
considered I think front work
and people don't talk about too much.
Any more questions?
Well if not, I would like to invite the organiser our chairman
to give enough
of our appreciation to Lori.
Let's thank her again.