0:00:15Thank you Isabel very much our feelings are shared. I'd like to thank the organisation
0:00:21here for inviting me
0:00:22to be a keynote speaker. It's really an honor. It's also a big challenge, so
0:00:28I hope I will
0:00:30make some messages that come across. Well to ... at least some of you.
0:00:35As Isabel said I've been working in speech and speech processing for many years now
0:00:39and today I'll
0:00:40focus mostly on the act of speech recognition. But a little bit of context that
0:00:44I'll talk about first.
0:00:46So.
0:00:48At LIMSI - being in Europe of course - we try to work on speech
0:00:53recognition and the multilingual
0:00:55context and processing at least a fair number of the European languages
0:00:59This isn't really new. We are seeing sort of a regrowth in wanting to do
0:01:03speech recognition in
0:01:04different languages but, if you go back a long time ago there was some
0:01:08research that was there. We just didn't hear about it this much.
0:01:10And that's probably because there weren't too many common corpora and benchmark tests to be
0:01:16able to compare results on and so people tended to report on your papers were
0:01:20accepted more easily if
0:01:21you use common data. Which is still the case.
0:01:23And it's logical you want to compare results to other peoples'
0:01:27results, but now there's more and more data out there, there's more test sets out
0:01:31there and so we
0:01:32can do more comparisons and we're seeing more languages covered. So I think that's really
0:01:36nice.
0:01:37So I'll speak about some of our research results in Quaero and Babel programs.
0:01:44Sure.
0:01:46Is that better? Sorry, o.k. I was popping this morning a bit so I wanted
0:01:52to be not too close.
0:01:54So, I'll speak highlights and research results from Quaero and Babel.
0:01:57And then I want to touch upon some activities
0:01:59I did with some colleagues at LIMSI and at the Laboratory of Phonetics in Paris
0:02:03to trying new speech technologies for other
0:02:06applications to carry out linguistic studies and corpora based studies
0:02:10And I'll mentioned briefly a couple of a perceptual experiments that we've done
0:02:14and then finally some concluding results, remarks.
0:02:18So I guess we probably all agree in this community that we've seen a lot
0:02:22of progress over the last
0:02:23decade or two decades.
0:02:25We're actually seeing some
0:02:27technologies that are using
0:02:29speech or are working on it. I think that's kind of fun, it's really nice.
0:02:32But we see it for a few languages
0:02:34and as we heard yesterday from Haizhou. He mentioned that about 1 % of the
0:02:40world's languages are
0:02:41actually seen in our proceedings, so we have something about them.
0:02:44That's pretty low, but it's
0:02:46up there from one or two that we did maybe twenty years ago. We're soaking
0:02:49up the sun.
0:02:51One of the problems I mentioned before is that our
0:02:54technology typically relies on having a lot of language resources and these are difficult and
0:02:59expensive to get. And so therefore it's harder to get them for limited languages
0:03:03and current practice still
0:03:05takes several months to years to bring up systems for language and if you count
0:03:09the time for collecting
0:03:10data and processing it and transcribing it and things like that.
0:03:14So this is sort of a step back in time and we'll see if we
0:03:17go back to say the late eighties,
0:03:18early nineties. We had systems that were basicly command controlled dictation and we had some
0:03:24dialogue systems, usually
0:03:25pretty limited to a specific task like ATIS (Air Traffic Information System),
0:03:28travel reservation, things like that.
0:03:30Some of you here probably know well and some of you are maybe too young
0:03:33and don't even know it, because the
0:03:35publications are not necessarily scanned in online, so you don't see them.
0:03:39But in fact we're now seeing a regrowth in the same activities. When you look
0:03:43at voice mail and
0:03:44dictation machines in your phones and the different
0:03:51the personal assistants that we're seeing that are finally coming out now.
0:03:54And so that's really exciting to see this sort of a pick up again that
0:03:57we saw in the past.
0:03:59And then of course we have some new applications or for someone new applications
0:04:02that are growing as we've got better processing capability both in terms of computers and
0:04:10data out there. And so we have speech analytics of both call center data or
0:04:14meeting type data,
0:04:14lectures, there's a bunch of things.
0:04:16We have speech-to-speech translation and also indexation of audio, video,
0:04:21documents which is used for media mining,
0:04:24tasks and companies are very interested in that.
0:04:27And of course for the speech analytics people that are really interested in finding out
0:04:30what people
0:04:31wanna buy and trying to sell them things and stuff like that.
0:04:35So let me back up a little bit and talk about something.
0:04:40Why is speech
0:04:41processing difficult. All of us speak easily and
0:04:46sort of mentioned this is sort of natural we learn language but
0:04:50I think any of us who learned a foreign language at least as an adult
0:04:54understand that's a little bit
0:04:55harder than it really seems and so I learned French when I was ...
0:04:58... after my PhD. I won't say my age.
0:05:02And it wasn't so easy to learn, it wasn't so natural and my daughter who
0:05:06grew up in lingual environment speaks
0:05:07French and English fluently and she's better in the other languages than I am.
0:05:10So I was good unilingual American speaker.
0:05:15No other contact with the other languages.
0:05:17We need context to understand what's being said, so you speak differently if you're speaking
0:05:22in public then
0:05:24if you're speaking to someone who you know. We all know this.
0:05:26?? projector screen with speech is continuous and so if I'm
0:05:30talking to you I might say it is not easy or it's not easy
0:05:34but if I'm talking to my mother it's not easy.
0:05:38I reduced that it's not easy. Well that's not so clear where the words are
0:05:43and so I think that we all know once again that humans
0:05:47reduce the pronunciation in regions that are of low information. Where it's not very important
0:05:52you're
0:05:53putting the minimum effort into saying what you want to say.
0:05:56And of course there's other variability factors that
0:05:58and I'll also mention that the speaker's characteristic accent,
0:06:01the context we are in. Humans do very well in adapting to this.
0:06:05Machines don't, in general.
0:06:08So here I wanna, since I am taking a step back
0:06:10in time I wanted to play a couple of very simple samples.
0:06:20Is there anyone in this room that doesn't know this type of sentence?
0:06:23You all heard it.
0:06:24Good! Okay. That's timid, that's going back really, really long time ago.
0:06:29I was involved in
0:06:32some of the selection for timid but not in the non-sense's.
0:06:34Those were selected to elicit pronunciation variants for
0:06:37different styles of speaking
0:06:39and you can hear that in the sample here
0:06:41even in a very, very simple read text
0:06:44we have different realisations of the word greasy.
0:06:51So in the first case we have an S, in the second case we have
0:06:53a Z.
0:06:53And we can see that ...
0:06:55I can't really point to the screen there. I don't have a pointer. Do you
0:06:58have a pointer?
0:06:59I think I've refused it before.
0:07:02So in any case you can see in blue there that S is quite a
0:07:04bit longer, you can see the voicing and the Z.
0:07:06And is everyone here familiar with spectrograms? I sort of assumed the worst. It's okay,
0:07:12good.
0:07:14So here's another example of more conversational type speech.
0:07:18And we'll see that people interrupt each other. You can hear some of the hesitations.
0:07:40So in this example it's office corpus and participants called each other
0:07:46and they're supposed to talk about some topic that they were given and they have
0:07:49a mutual topic.
0:07:50You're supposed to talk about it but not everybody does.
0:07:53But even in this ...
0:07:55They don't know each other very well but they still interrupted each other. They did
0:07:58some turn taking
0:07:58and you hear you can see the
0:08:00hmm and laughter and there's someone else in another
0:08:05presentation. I don't where it is.
0:08:08Now I'm gonna play an example from Mandarin and I'm having
0:08:11confidence that Billy Hartman who is probably in ?? with his wife
0:08:15gave me the correct translation,
0:08:17correct text here, because I don't understand it. Which is even more spontaneous so it's
0:08:22an example
0:08:23taken from the callhome mandarin corpus where we think it's a mother and a daughter
0:08:27who are talking to
0:08:29each other about the daughters job interview.
0:08:32So for those who speak Mandarin.
0:08:48So if I understood correctly and the translation is correct
0:08:52basically talking about it and the mother doesn't understand what the job interview is about
0:08:57and the daughter says
0:08:58she says: Don't speak to me in another
0:09:01language. Speak to me in Chinese. And the daughter says: You wouldn't have understand anyway,
0:09:04even if I spoke in Chinese
0:09:05And so I've had some similar situations speaking with my mother
0:09:10That's what I do.
0:09:13So now I'm gonna switch gears a little bit and talk about the
0:09:17Quaero program which is one of the two topics I want to mainly focus on
0:09:22here, we talk about the speech
0:09:23recognition in different languages.
0:09:25This is a large project
0:09:27in France, it's research innovation project
0:09:30which was funded by OUZIO, a French innovation agency.
0:09:34It was initiated in 2004
0:09:37but didn't start until 2008 and then ran for almost six years until the end
0:09:41of 2013
0:09:41so it's relatively recent that it finished. It was really fun
0:09:45but when we started putting it together
0:09:47the web was a lot of than it is now. So as we also heard,
0:09:50I think it was this morning, there was no YouTube,
0:09:52no Facebook, no Twitter, no Google Books, iPhones. All that didn't exist.
0:09:57So life was boring, what do we do with our free time, right?
0:10:01Instead of spending your time on the ??.
0:10:03I think it's hard to be in the position of young people who don't know
0:10:07life without all of this
0:10:08and my daughter grew up with all of it.
0:10:10And so it's very hard to relate to what this situation really is
0:10:13but in any case
0:10:15to get back to sort of processing of this data
0:10:17we have tons and tons of data. I read that there's roughly 100 hours of
0:10:24video uploaded to YouTube every minute.
0:10:27And that's a huge amount of data and 61 languages. So if we are treating
0:10:31about 7 of them,
0:10:32we are not so bad. Maybe we cover the languages doing the videos there.
0:10:35But we don't have to organise this data, we don't not know how to accesses
0:10:39this data.
0:10:39And so Quaero was trying to aim at this. How can we organize the data,
0:10:43how can we access it, how can we index it
0:10:45how can we build applications that can make use of today's technology and do something
0:10:49interesting with it. I'm not gonna talk about all that. If you're interested I suggest
0:10:54you to go to Quaero website and
0:10:57you can find some demos and links and things like that
0:10:59there. I'm gonna focus on the work that we did in speech processing
0:11:03and at LIMSI we spoke about, we worked on mostly speech and text processing
0:11:08including this applied to speech data, so named entity searchable text and speech,
0:11:13work translation,
0:11:14both of text and speech.
0:11:17So here is
0:11:19showing the speech processing technologies that we
0:11:23worked on in the project. So the first box we have is audio speaker segmentation
0:11:28such as chopping
0:11:29signal and trying to decide speech and nonspeech regions
0:11:32and
0:11:33dividing into segments corresponding to different speakers
0:11:37detecting speaker changes
0:11:39then we may or may not know
0:11:40the identity of the language being spoken
0:11:43so we have a box of language identification if we don't know it.
0:11:47Most of the time we want to transcribe the speech data because
0:11:51speech is ubiquitous
0:11:52there's speech all over the place and it has a high amount of information content
0:11:55and so we believe it is
0:11:57that the most useful. We work in speech and not in an image. Image people
0:12:00might tell us that
0:12:01image is more useful for indexing this type of data.
0:12:06One advantage we have speech relative to image that I've just mentioned
0:12:10is that the speech has the underlying writen representation that we've all pretty much agree
0:12:17upon
0:12:17more or less. Able decide where the word is.
0:12:19We might differ a little bit but we pretty much agree upon it. The image
0:12:22is not the case
0:12:23if you give an image to two different people
0:12:25someone will tell you it's a blue house, someone will tell you it's trees in
0:12:28the park
0:12:29with a little blue cabin in it. Something like that. You get a
0:12:31very different description based on what people are interested in
0:12:35and their matter expressing things. For speech in general
0:12:38we're a little bit more normalized we
0:12:40pretty much agree on what would be there.
0:12:42Then you might wanna do
0:12:44other type of processing such as speaker diarization.
0:12:47This morning doctor ?? spoke about the Twitter statistics during the presidential elections and
0:12:53that was something we actually worked on in Quaero.
0:12:56Which was to try and look at a corpus of recordings
0:12:59and look at speaker times within this corpus of recordings, you might have hundreds or
0:13:03thousands of hours of recordings and look at how many
0:13:05speakers are speaking when and how much time is allocated
0:13:08and that's actually something that's got a potential to use at least in France
0:13:11where they control that during the election period all the parties get the same amount
0:13:15speaking time.
0:13:16As you want very accurate measures of who is speaking when
0:13:20so that everybody has a fair game
0:13:21during the elections.
0:13:23Other things that we worked on were adding the metadata to the
0:13:27transcriptions, you might add punctuation or markers to
0:13:30make it be more readable, you might want to transform numbers from
0:13:34words into a number sequences like in newspaper text in 1997.
0:13:38And you might want to identify
0:13:41entities or speakers or topics that can be useful for automatic processing so you could
0:13:47tags in
0:13:48where same identities are.
0:13:50And then finally the other box there is speech
0:13:52translation typically based on the speech transcription
0:13:57but we're also trying to work on having a tighter link between the speech and
0:14:01the translation
0:14:01portions, so you don't just transcribe and then
0:14:05translate.
0:14:06But we're trying to have more tight
0:14:08relation between the two.
0:14:11Let me talk a little bit now about speech recognition.
0:14:14Everybody I think know there's a box, so basically we have
0:14:18the main point is just that we have three important models. The language model,
0:14:23the pronunciation model and the acoustic model.
0:14:25And these are all typically estimated on very large corporas
0:14:28This is where we're getting into problems with the low resource languages.
0:14:32And I want to give a couple the illustrations on
0:14:35why at least I believe and I spent effort doing this on pronunciation model
0:14:39is really important that we have the right pronunciation in the dictionary.
0:14:42So we take these two examples, on the left we have two versions of coupon.
0:14:51And on the right we have
0:14:53two versions of interest.
0:14:56So in the case of the coupon in one case we have the ya sound
0:15:00inserted there
0:15:01and our models for speech recognition are typically
0:15:04modeling phonetics in it's context.
0:15:07And so we can see that if we have a transcription just k. u. p.
0:15:11for it.
0:15:12We're gonna have this ya there and the case is not gonna be
0:15:14very good match to the one that we have the second case.
0:15:17That's really very big difference and also the U becomes almost frented EU.
0:15:22That is not
0:15:23distinguishable technically in English.
0:15:25And the same thing for interest. We have interest or interest.
0:15:29Well in one case you N and the other case you have the TR cluster.
0:15:32These are very different and you can imagine that if we ...
0:15:35since our
0:15:37acoustic models are based on alignment of these transcriptions with the audio signal
0:15:41if we have more accurate pronunciations we're going to
0:15:44have better acoustic models at the end and that's what our goal is.
0:15:49So now I want to speak a little bit about
0:15:52culture lightly supervised learning, there's many terms
0:15:55being used for it now.
0:15:56Unsupervised training, semi-supervised training, lightly supervised training
0:16:01and so
0:16:02basicly one goal is that .. and Ann mentioned something
0:16:05about this yesterday, maybe machines can just learn on their own.
0:16:08So here we have a machine
0:16:09he's reading the newspaper, he's looking at the TV and he's learning.
0:16:14Okay that's great.
0:16:15That's something we would like to happen.
0:16:17But we still believe that we need to put some guidance there and so this
0:16:20is
0:16:21researcher here trying to
0:16:23give some informations and supervision the machine who's learning.
0:16:28When we look at traditional acoustic modeling we typically use between several hundreds to
0:16:33several thousands of hours of carefully annotated data and once again I said before that
0:16:37this is expensive
0:16:38and so people trying to look into
0:16:40ways to reduce this information, reducing the amount of supervision for the training process.
0:16:45And so
0:16:46I believe that some people in this room are doing it
0:16:50to automate the process of collecting the data. To automate the iterative learning of the
0:16:54systems by themselves
0:16:55even including the evaluation so having some
0:16:59data used to evaluate on that is not necessarily really carefully annotated and most the
0:17:03time it is
0:17:04but there's been some work trying to use
0:17:06unannotated data to improve the system
0:17:08which I think is really exciting.
0:17:11So we talk about reduced supervision and
0:17:13unsupervised training that has a lot of different names that are used.
0:17:16The basic idea is to use some existing speech recognizer,
0:17:19you transcribe some data, you assume that this transcription is true.
0:17:24Then you build new models estimating with this transcription and you reiterate and there's been
0:17:28a
0:17:28lot of work on it for about fifteen years now
0:17:31and many different variance that have been explored, where to filter the data, where to
0:17:36use confidence factors, do you
0:17:37train on things that are only good, do you take things in the
0:17:39middle range, this many things you can read about it.
0:17:42Something that's pretty exciting that we see in the Babel work
0:17:46is that even now if we apply these two systems starting
0:17:49with very high word error rate, it still seems to be converging
0:17:52and that's really nice.
0:17:54The first things I'll talk about are going to be in case for a broadcast
0:17:57news data but we have a lot of
0:17:59data we can use as a supervision. And by this I mean we're using language
0:18:02models that are trained in many
0:18:04millions of words of text and this is giving some information to the systems. It's
0:18:08not completely
0:18:08unsupervised which is why you these different names for
0:18:11what's being done by different researchers calling it.
0:18:13It's all about the same but it's called by different names
0:18:16and so here I wanted to illustrate
0:18:19this was the case study for the Hungarian that we did in
0:18:22the Quaero program and it was presented at last year's Interspeech, so maybe some of
0:18:26you saw it, by A. Roy.
0:18:29And we started off with
0:18:31having
0:18:33seed models at this point appear of about eighty percent or seed models that come
0:18:36from other
0:18:37languages or five languages we took them from, so we did what most people would
0:18:41call cross language transfer.
0:18:42These models came from if I
0:18:45have it correctly: English, French, Russian, Italian and German
0:18:48and we tried to just choose the best match between
0:18:50the phone set of Hungarian to one of these languages.
0:18:54And then we
0:18:56use this model here to transcribe about 40 hours of data which is this point
0:19:00here
0:19:01and this size of the circle
0:19:03is showing you
0:19:05roughly how much data is used
0:19:06that's this is forty hours then we double again here
0:19:09and go about eighty hours and this is the word error rate
0:19:12and this is the iteration number
0:19:14and so we often use increased amount of data, increase the model
0:19:20size, we have more parameters and models is going on
0:19:23so the second point here is using
0:19:26the same amount of data, but using more context
0:19:28so we built a bigger model, so we once again took this model, we redecoded
0:19:32all the data the forty hours
0:19:33and built another model and so now we went down to about sixty percent, so
0:19:37we still kind of flying
0:19:38we doubled the data again and we're probably about a hundred fifty
0:19:42hundred fifty something like that. Then we got down to about
0:19:45fifty percent. These are all using same language model
0:19:48so that wasn't changed in study
0:19:50and then finally here we use about three hundred hours of
0:19:53training data we're down to about thirty or thirty five percent
0:19:56and of course everybody knows that
0:19:59these were done with just standard PLP F0 features
0:20:03and
0:20:04pretty much everybody's using features generated by the MLP
0:20:07so we took
0:20:08our english MLP,
0:20:09we generated features on the Hungarian data since across
0:20:13lingual transfer
0:20:16models there
0:20:17and we see there begins small gain a little bit
0:20:20once our amount of data is fixed
0:20:22and then we took the transcripts that were generated by this system
0:20:26here
0:20:26and
0:20:28we built an MLP
0:20:29training MLP for the Hungarian language and there we also get now about a two
0:20:33or three percent
0:20:34absolute gain and we're down to a word error rate of about twenty five percent
0:20:38which isn't wonderful
0:20:39but it's still relatively high, but it's good enough for some applications such as media
0:20:43monitoring and
0:20:44things like that.
0:20:46And so this was done completely un-transcribed and we did it with a bunch of
0:20:49languages
0:20:50so now let me show you some results for the ...
0:20:54I think it's about nineteen languages we did in Quaero, we did more
0:20:56we did twenty three but this is only for nineteen of them
0:21:00and if we look here, if you go up to check
0:21:03these were trained in a standard supervised manner
0:21:05with somewhere between a hundred and five hundred hours
0:21:09of data depending upon the language.
0:21:13And so these with the blue shading were trained in unsupervised manner
0:21:17once again we have the word error rate on the left and this is the
0:21:20average error rate across
0:21:22three to four hours of data per show,
0:21:25per language, sorry.
0:21:27And so we can see that while in general
0:21:30the error rates are a little bit lower for
0:21:33the supervised training, these arent't so bad some of them are really about the same
0:21:39range and
0:21:40you have to take the result with a little bit of grain of salt, because
0:21:43some of these languages here
0:21:45might be a little bit less well trained or a little bit less well advanced
0:21:49than the lower scoring languages. These might be
0:21:52doing a little bit better if we worked more on them.
0:21:56But this isn't the full story so now I'm going to complicate the figure
0:22:00and in green you have
0:22:02word error rate on the lowest file, so that the audio file that had the
0:22:07lowest word error
0:22:08rate per language so these are in green.
0:22:11Okay, so these files are easy they're probably news like files
0:22:15okay.
0:22:16And it gets very low, even Portuguese we're down around three percent for one of
0:22:19these segments
0:22:20and then in yellow we have the worst scoring one
0:22:24and these were scoring files
0:22:26typically or more interactive spontaneous speech talk shows, debates, noisy recording that's offside,
0:22:33that's a lot of variability factors that come in.
0:22:35So even though this blue curve is kinda nice we really see we have a
0:22:38lot of work to do
0:22:40if we want to be able to process all the state up here.
0:22:44So now i'm going to switch gears and not talking any more about Quaero and
0:22:48talk a little bit about Babel.
0:22:50Where it's a lot harder to have supervision from
0:22:53language model because you are working on languages in Babel
0:22:56that have very little data, that is hard to get or typically have little data
0:22:59but not all of them
0:23:00are really in that situation.
0:23:02And so this is the
0:23:06a sentence I took from Mary Harper slide's
0:23:09that she presented at ?? calling and so the idea
0:23:13that's being investigated to apply
0:23:15different techniques of linguistic machine language,
0:23:18machine learning and speech processing methods
0:23:20to be able to do speech recognition for keyword search and I highly recommend for
0:23:25people that are
0:23:25not familiar with Mary's talk so you see them.
0:23:28I know that the ASRU one is online on Superlectures,
0:23:31and the ?? column one I don't know, so people here probably know better than
0:23:34me if ?? is there.
0:23:35But there it's really interesting talks
0:23:38and if you're interested in this topic I suggest you to
0:23:41go there.
0:23:42So, keyword spotting. Yesterday Ann spoke about that children can do keyword spotting very young
0:23:49and so I wanna do first test for you because basic keyword spotting
0:23:54what I mean is that
0:23:55you're gonna localise in the audio signals some points where you have
0:24:00your detected keyword
0:24:02so these two you detected right
0:24:03here
0:24:04you missed it, it's the same word whatever keyword it was or occurred but you
0:24:08didn't get it
0:24:09and here you detected a keyword but you didnt get it
0:24:11so that's a false alarms. So here you've missed the false alarm and the correct.
0:24:15So now let me play you a couple of samples
0:24:18and this is actually a test of two things same time.
0:24:21One is language IDs so I'm gonna play samples at
0:24:24different languages and there's two times six different languages
0:24:27and there's a common words in all of these
0:24:30samples.
0:24:30And so I'd like people to let me know if you can detect
0:24:33this words, so see if we as adults can do like children
0:24:37can do.
0:24:58Do you want to here it again?
0:25:00And do we make it a little louder?
0:25:03Is it possible to be a little bit louder on the audio because I can't
0:25:06control it here.
0:25:09I don't think it goes any louder.
0:25:11I have it on the loudest.
0:25:31Okay so I'll show you the languages there first. Anyone get the languages there's probably
0:25:35speaker of each
0:25:36language here, so you probably recognised your own language.
0:25:40So the languages were: Tagalog, Arabic, French, Dutch, Haitian and Lithuanian.
0:26:08Shall I play it again?
0:26:09It's okay?
0:26:11Alright, so.
0:26:12So here's this second set of languages that we have there, the last one is
0:26:16Tamil. I'm not really
0:26:17sure the end where there were taxi in different places. Google translate told us it
0:26:23was.
0:26:23But there might be some native speakers here that can
0:26:26tell us if that is or not. To me it sounded like income taxes and
0:26:29sales tax.
0:26:30But I don't
0:26:33really know. Google told us that it was: to income from
0:26:35taxes and sale of taxes, or something like that so anyway so
0:26:41basically I did, everyone
0:26:43catched the word taxes or only some of you did?
0:26:46Taxes is one of those words that seems to be relatively
0:26:50common and
0:26:51in many languages anyway that's same thing.
0:26:57Before talking about keyword spotting I'm not gonna talk about it too much actually, is
0:27:00I wanted to
0:27:01show some results on conversational telephone speech. So we'll talk about term error rate here
0:27:06rather than word error rate, because in Mandarin we
0:27:09measure the character error rate rather than in order. So for English
0:27:12and Arabic we're measuring word error rate and for Mandarin its character
0:27:17and these results are for I believe the NIST archives of
0:27:21for
0:27:21transcription task
0:27:23and English systems are trained on about
0:27:25two thousand hours of
0:27:28data with annotations.
0:27:30The Arabic and Mandarin systems were probably
0:27:32trained on about two hundred or three hundred hours of data.
0:27:34It's quite a bit less
0:27:35and we can see that the English system gets pretty good. We're down to about
0:27:39eighteen percent
0:27:40of the word error rate. The Arabic is really quite high
0:27:43about forty five percent. Maybe in part due to different dialects
0:27:47and also maybe in part due to pronunciation modeling because
0:27:50it's very
0:27:51difficult in Arabic if you don't have the diacriticised form.
0:27:58We also at LIMSI work on some other languages
0:28:00including French, Spanish, Russian, Italian and these are
0:28:03just some results to show you that we're sort of in the same ballpark
0:28:06of error rates
0:28:07for these systems, for once again conversational speech
0:28:10and these are trained on about a hundred to two hundred hours of data.
0:28:14Now let's go to Babel which can just be very challenging compared to what we
0:28:17see here which is
0:28:18already harder that we had for the broadcast type data.
0:28:22And before that I just want to say a few words what we mean by
0:28:26low resource language so in general
0:28:28these days it means it has got low presence on the Internet.
0:28:31That's probably not what ethnologists in English would agree
0:28:34upon but I think from the technology community we are gonna say
0:28:37you cannot get any data it's a low resource language.
0:28:40It's got limited text resources
0:28:42well at least in electronic form
0:28:45there is
0:28:46little or
0:28:47some, but not too much I\O data,
0:28:49you may or may not find some pronunciation dictionaries and it can be difficult to
0:28:54find
0:28:54maybe reliable knowledge about the language if you google different things and you find some
0:29:02characteristics about the language you get three different peoples telling you three different
0:29:05things and you don't really know what to believe.
0:29:08And one point I'd like to make is that this is true for what we're
0:29:12calling these low resource languages
0:29:13but is also true many times for different types of applications that has passed that
0:29:16we dealt with
0:29:17even in well resourced languages. You might not have any data on the type of
0:29:21test you're addressing.
0:29:22So here's an overview of the Babel languages for the first two years of the
0:29:27program
0:29:27and I'm roughly trying to give an idea of the characteristics of the language I'm
0:29:31sure that these
0:29:32are not really hundred percent correct.
0:29:34I tried to classify the characteristics into general classes and give it something we can
0:29:40easily understand
0:29:41and so for example we see the list of languages we ?? assume is make
0:29:46any better
0:29:47relatively closely related
0:29:48and
0:29:50Cantonese allow
0:29:52and Vietnamese
0:29:53that are used
0:29:55different scripts that's Bengali and Assamese share the same written script.
0:29:59We also have the Pashto
0:30:02which uses the Arabic script, the one we have to problem to of diacritization in
0:30:05it.
0:30:06And then we have
0:30:08Turkish, Tagalog, Vietnamese and ??
0:30:12which was actually very challenging because there we had clicks we needed to deal with.
0:30:16So they use different scripts,
0:30:18some of them languages have tones so in this case we had four that had
0:30:22tone,
0:30:22we were trying to classify the morphology into being easy, hard and medium,
0:30:27okay, this is not very
0:30:28I'm sure it is not very reliable but basically three of them we consider to
0:30:33have a difficult
0:30:33morphology so that was the Pashto, the Turkish and the Zulu.
0:30:39And the others of them are low.
0:30:41The next column is the number of dialects in this is not
0:30:44the number of dialects in the language, this is the number of dialects in the
0:30:48corpus collected
0:30:48in the context of Babel.
0:30:50So in some cases we only had one as in Lao and Zulu, but in
0:30:53another cases we had for Cantonese as many as
0:30:55five, in Turkish as many as seven.
0:30:58And then once again whether or not
0:31:00the G2P
0:31:02is easy or difficult
0:31:04and so some of them are easy, some of them seem to be hard.
0:31:07In particular the Pashto
0:31:09and for the Cantonese is basically the dictionary lookup
0:31:14limited character set.
0:31:16So here and the
0:31:18last column I'm showing the word error rates for
0:31:21the Babel languages and its joint in a different style.
0:31:24If you look at the top of the blue bar
0:31:26that's the
0:31:28word error rate
0:31:29of the
0:31:30worst language. So in this case for the .. in fact for both of them
0:31:34with the top of the blue
0:31:36this language here is about
0:31:38fifty and some percent and sixty and some percent, that's Pashto
0:31:41and the top of the oranges just showing you the range of the
0:31:45word error rates across the different languages.
0:31:50This word error rate I said backwards. This is the best language
0:31:54and this is the worst language. The top here are Pashto
0:31:57which is about seventy percent in one case and
0:31:59fifty five percent for another
0:32:02and this is the best which I believe is Vietnamese and Cantonese.
0:32:06Sorry, if I confused you there.
0:32:12And I'm wrong again with that too. I mixed up the keyword spotting
0:32:15So this is, I should've read my notes,
0:32:16the lowest word error rate was for Haitian and Tagalog
0:32:21and the highest was for Pashto.
0:32:23And in this case we had, what's called in our community, you can see it
0:32:27in another papers, is Full LP
0:32:29which means you have somewhere between sixty and ninety hours of annotated data for training
0:32:33and
0:32:34there's the LLP, which is the low resourced
0:32:37which is only ten hours of annotated data per language, but you can use the
0:32:41additional data here
0:32:41in unsupervised or semi-supervised manner.
0:32:46So some of the research directions that you've probably seen a fair amount of talks
0:32:50about here
0:32:51are looking into language-independent methods
0:32:54to develop
0:32:55speech-to-text and keyword spotting for the languages looking into multilingual acoustic modeling.
0:33:00Yesterday there was some talk by the Cambridge people and there was also talk from
0:33:04MIT
0:33:05trying to improve model accuracy with these limited training conditions
0:33:10using unsupervised or semi-supervised
0:33:13techniques for the conversational data
0:33:15which we don't have too much
0:33:17information that's coming for the language model.
0:33:19It's a very weak language model that we have
0:33:22and trying to explore multilingual and
0:33:24unsupervised MLP training. And both of those have been pretty successful
0:33:28where is the multilingual acoustic modeling using standard ?? hmms is a little bit less
0:33:32successful.
0:33:32And one other thing that we're seeing
0:33:35is interest in is using graphemic models because these could sort of avoid the problem
0:33:40of
0:33:40having to do grapheme to phoneme
0:33:42and it reduces the problem of pronunciation modeling to
0:33:45something closer to text normalization you have to do anyway
0:33:48for language modeling.
0:33:51So now I wanted to talk just
0:33:53briefly about something that didn't work that we tried at LIMSI. So one of the
0:33:56languages is Haitian, so this is great you know
0:33:58we work in French we developed decent French system
0:34:01so why not try using French models to help our Haitian system
0:34:05and so the first thing we do is to try to run our French system
0:34:09on Haitian data, it was a disaster
0:34:11it was really bad
0:34:12then we took the French models,
0:34:15acoustic models and the language model for Haitian data but also
0:34:18wasn't very good
0:34:20then we said okay let's try adding varying amounts of
0:34:24French data to
0:34:25Haitian system. So this is the Haitian baseline, so we have about
0:34:28seventy and some percent word error rate so seventy two ?? much yourself
0:34:33If we had ten hours of French we get worse, we got about seventy four
0:34:36or seventy five.
0:34:37We had twenty hours to go, got worse again.
0:34:39We had fifty hours to get worse again, we said hups! This is not working,
0:34:43stop,
0:34:43this was
0:34:45work that we never really got back to. We wanted to look a little more
0:34:48in trying to understand better why
0:34:49this was happening, we don't know if it's due to the fact that the recording
0:34:52conditions were very different, we
0:34:53don't if there were really phonetic or phonological differences between languages
0:34:58and then we had another bright idea let's just say
0:35:01okay, let's not use standard French data we also have some accented French data from
0:35:05Africa
0:35:06we have some data from North Africa, from
0:35:10I don't remember where the other was from
0:35:12and so we said let's trying do that
0:35:14same results. We took ten hours of data we had and basically
0:35:17degrade the same way.
0:35:18So we were kinda disappointed by the results and then
0:35:21dropped working on it for awhile.
0:35:22We hoped to get back to some of this again. There
0:35:24was a paper from KIT that was talking about using
0:35:30multilingual and bilingual models
0:35:32for recognition of non-native speech and that actually was getting some gain, so I thought
0:35:36that was
0:35:37a positive result despite
0:35:39instead of our negative result here.
0:35:41Let me just,
0:35:44one of the.
0:35:46One of things we also tried to do some joint models for Bengali and Assamese,
0:35:50because we have been naive and not speaking
0:35:52these languages decided this was something we can try
0:35:54and put them together and see if we can get some gain.
0:35:56In one condition we got tiny little gain from the language model trainable set of
0:36:01data, but really tiny
0:36:03and the acoustic model once again didn't help us.
0:36:06And I heard that yesterday somebody commented on it
0:36:08saying that they really are quite different languages and we shouldn't be
0:36:11assuming just because we don't understand that they are very close.
0:36:14But we did have Bengali speakers in our lab and they told us they were
0:36:18pretty close,
0:36:18so it wasn't based on nothing.
0:36:20So let me just give a couple of results on keyword spotting just to give
0:36:24you sort of an idea of
0:36:26what type things were talking about what the results are.
0:36:29On the left part of the graph I give results
0:36:33problem
0:36:352006, it was the spoken term detection task that was run by NIST and it
0:36:39was done on a more
0:36:41cases in this one. This is on broadcast news and conversational data and you can
0:36:45see that the measure that is used
0:36:47here is MTWV, which is the Maximum Term Weighted Value and
0:36:52I don't wanna go into it
0:36:53but basically it's a measure of false alarms and misses and you can put the
0:36:57penalty to it.
0:36:58The higher the number the better, so on the other slide
0:37:00we wanted lower number because it was word error rate
0:37:02and on these ones we want high number.
0:37:05And so we can see that for the broadcast news data it was about eighty
0:37:08two or eighty five percent and for
0:37:10the CTS data it's pretty close up around eighty
0:37:13but if we look at the Babel languages now
0:37:16we are down between forty five. So once again now the
0:37:20worst language is here which is around forty five percent for
0:37:24sixty of full training against sixty to ninety hours of supervised training
0:37:27and the best one goes up to about seventy two percent.
0:37:31Now look at my notes so I get these rates and the worst language was
0:37:35Pashto
0:37:36and the best languages were Cantonese and Vietnamese.
0:37:39And this is now the limited condition and you can see that you take a
0:37:43really big hit
0:37:44for the worst language here
0:37:47but in fact on the best ones, we're not doing so much worse. So these
0:37:51systems were trained on the ten
0:37:52hours and then the additional data was used in unsupervised manner
0:37:56and then there's a bunch of bells and whistles and a bunch of techniques used
0:37:59to get these
0:37:59keyword spotting that I didn't talk about and I won't talk about
0:38:02but there's a lot of talks on it that you'll see here that
0:38:05you can go to and I think there are two sessions tomorrow
0:38:07and maybe another poster.
0:38:09Once again there's talks from Mary Harper if you feel interested in finding out more.
0:38:16So some findings from Babel, so you've seen unsupervised training
0:38:22is helping a little bit at least even though we have very poor language models.
0:38:26The multilingual acoustic models don't seem to be very successful
0:38:29but there is
0:38:31something of hope from some research going on.
0:38:34The multilingual MLPs are bit more successful, meaning there's quite a few papers talking about
0:38:38that.
0:38:39Something that
0:38:40we've used in LIMSI for awhile, but was also shown in Babel
0:38:44programs that pitch features are useful even for non-tonal languages.
0:38:48It was in the past we used pitch for work on tonal languages and we
0:38:51don't need to use it all the time.
0:38:53And now I think a lot of people are just systematically using it in their
0:38:56systems.
0:38:56Graphemic models are once again becoming
0:38:59popular and they give results very close to phonemic ones
0:39:03and then for keyword spotting there's a bunch of important things,
0:39:06score normalization
0:39:09is extremely important there was a talk
0:39:11the last ASRU meeting
0:39:13and dealing with out-of-vocabulary keyword so basically when you get a keyword you don't necessarily
0:39:17know all those words and particularly when you have ten hours of data, transcript of
0:39:21that
0:39:21you've got very small vocabulary. You have no idea what type of query
0:39:25person will give and you need to do something, do tricks
0:39:28to be able to recognize and find these keywords in the audio
0:39:32and
0:39:33typically is being investigated now separate units
0:39:35and proxy type things and I'm sure you'll find papers on that here.
0:39:39So let me switch gears now in my last fifteen minutes,
0:39:44ten minutes. Okay, to talk about some linguistic
0:39:46studies and the idea is to use speech technologies
0:39:49as tools to study language variation, to do error analysis,
0:39:53there are two recent workshops that I listed on the slides.
0:39:58And I'm gonna take case study from Luxembourgish - the Luxembourg's language.
0:40:02This is done working closely with Martine Adda-Decker from MC who is Luxembourgish for those
0:40:06who don't know her.
0:40:08She says that Luxembourgish is really true multilingual environment, sort of like Singapore
0:40:13and in fact it seems a lot like Singapore.
0:40:15The capital city is the same name as the country for both of these
0:40:19well there's a little bit different
0:40:22it's a little bit warmer here.
0:40:24But Luxembourg is about three times the size of Singapore
0:40:29and Singapore has about ten times the amount of people.
0:40:32So it's
0:40:33not quite the same.
0:40:34So basic question we're asking for Luxembourgish is that given you've got a lot of
0:40:39contact with English, French and German which language is the closest?
0:40:44And there was a paper, there was a couple of papers that Martine's first author
0:40:48of.
0:40:48A different workshops and most recent one was the last SLTU.
0:40:53This is a plot showing the number of shared words between Luxembourgish, French, English and
0:40:59German
0:41:00and so the bottom curve is English,
0:41:02the middle one is German and the top one is French
0:41:05and
0:41:06along the x-axis is the size of the word list sorted by frequency
0:41:10and on the y-axis is the number of shared words and so you can see
0:41:13that at the low end we've got the
0:41:15function words as we expect those the most frequent in the languages,
0:41:18then you get more general content words.
0:41:22And as higher up you get technical terms and a proper names.
0:41:27And you can see that in general there's more sharing with French
0:41:31than with German or English at least at the lexical level.
0:41:35And you have
0:41:36once again the highest amount of sharing when you get
0:41:39technical terms and it's because these are shared across languages more generally.
0:41:44So what we try to do that's the question of given that we have this
0:41:47similarity
0:41:47to some extent at the lexical level there is this type of similarity at the
0:41:53phonological level.
0:41:54And so what we did we took acoustic models from English, French and German.
0:41:58We tried to do an equivalence between these
0:42:01IPS symbols
0:42:02and those in Luxembourgish. So Martine defined the set up
0:42:05phones for
0:42:06Luxembourgish
0:42:07and then we
0:42:08did hacked up pronunciation dictionary that would allow
0:42:12any language change to happen after any phoneme, so if you have a
0:42:15this can get pretty big you have a lot of pronunciations because you have
0:42:18if you had three letters you're going to be able to decide each point go
0:42:22to
0:42:22the other ones. You can see the illustration here with the pad you go anywhere.
0:42:25And the we said when I also trained a model on a three, multilingual model
0:42:29trained on a three data
0:42:30together so we took a subset of the English, French and German data and did
0:42:33what we called a pooled model.
0:42:36And so the first experiment with it is we tried to align
0:42:41the audio data with
0:42:43these three models in parallel, so that the system could choose which acoustic model likes
0:42:47best
0:42:48the English, French, German and pooled
0:42:50and then we did a second experiment so we train the
0:42:54Luxembourgish model in unsupervised manner just like I showed for Hungarian and we said now
0:42:59let's use that and we replaced the pooled model with Luxembourgish model.
0:43:03And so of course or expectation is that once we put Luxembourgish model in there
0:43:07it should get
0:43:08the data, so the alignment should go to that model that's what we expect.
0:43:12And
0:43:12so here's what we got
0:43:14so on the left is experiment 1, the one where we have the pooled model.
0:43:16On the right we have Luxembourgish model and
0:43:20the top is
0:43:20German then we have French, English and pooled Luxembourgish and so we were really
0:43:25disappointed, so the first thing we see is first of all Luxembourgish doesn't take everything
0:43:29and second so we have pretty much the same distribution, there's very little change.
0:43:33So we said okay let's try and look at this a little bit more. Martine
0:43:36said let's look at this
0:43:37more carefully because she knows the language.
0:43:39I was looking at and so we looked at
0:43:42some diphthongs
0:43:43which only exist in Luxembourgish and so we had this ?? card base. We're trying
0:43:47to choose something close when we took
0:43:48English and French and now we see the effect that we want
0:43:51so originally they wanted English which has diphthongs or more diphthongs.
0:43:55And now they want to Luxembourgish. So we are happy we've got some results we
0:44:00wanted.
0:44:01We should do some more working, looking more things we are happy with this result.
0:44:07The second thing i wanted to mention was talking a bit about language change and
0:44:10this was
0:44:11associated phonetic corpus based study that was also presented last year at Interspeech by Maria
0:44:16Candeias.
0:44:17And we were looking at three different phenomena that are going to be growing
0:44:21in the society, so you have consonant cluster reduction so explique
0:44:26exclaim, so you have eXCLaim
0:44:29you get rid of the ??.
0:44:30There is too many things to pronounce.
0:44:33The palatalization and affrication of dental stops which is a sign of the
0:44:37social status in immigrant population.
0:44:40And in fact that for me when you hear the cha or ja they sounds
0:44:44very normal to me, because we have them
0:44:46in English and I'm used to it, so we do that in English and then
0:44:50the third one is the
0:44:52fricative epithesis
0:44:53which is at the end of word, you had this
0:44:55?? type sound. Sorry.
0:44:58And I'll play you an example
0:45:03And that was something that I remember very distinctly when I first came to France
0:45:06I heard it all the
0:45:07time and women did this. It was some characteristic of women speech that at the
0:45:12end there's eesh.
0:45:13And it's very common
0:45:14but in fact is now
0:45:16growing more in
0:45:17even male speech.
0:45:20But these were examples that were taken from broadcast
0:45:24data, so this is people talking on the radio and the television
0:45:27so you imagine that if they're doing it it's something that is really
0:45:29now accepted by the community. That's really are a growing trend and so
0:45:33Maria was looking at these over the last decade and so what we did was
0:45:36same type of thing. We took
0:45:38a dictionary and we allowed after Es to have this eesh
0:45:41type sound. We allowed different phonemes to go there
0:45:44and then looked at alignments and how many counts we got of the different
0:45:49occurrences.
0:45:50And so here we're just showing that between 2003
0:45:55and 2007
0:45:57this is becoming longer
0:45:59and it's also increased in frequency by about twenty percent.
0:46:05So now let me just, last thing I wantes to talk about
0:46:07was human performance and we all know that humans do better
0:46:10then machines on transcription tasks
0:46:12and machines have trouble dealing with variability that humans do much better with.
0:46:18So here is a plot of, this is based on some work of doctor ??
0:46:23and his colleagues.
0:46:24That
0:46:26we took
0:46:27samples stimuli from the what the recognizer got wrong. So everything you see is
0:46:32100 % word error rate by the recognizer that were very confusable little function words
0:46:36like ah.
0:46:38And an in English
0:46:41Things like that.
0:46:42And we played them stimuli readers. With 14 native
0:46:44French stimuli subjects and 70 English subjects.
0:46:48And everyone who listened understood the stimuli
0:46:50and so here you can see that if we give just a local context, a
0:46:53three
0:46:53gram context which is what many recognisers have to the humans
0:46:57they have make thirty percent errors on this
0:46:59but the system was a hundred percent wrong.
0:47:01If we up that context to five grams, so we've got one word each side
0:47:06they now go down by about twenty percent. So this is nice going the right
0:47:09directions the context is
0:47:10helping us as it seems a little bit.
0:47:12And if we go up to seven or nine gram
0:47:14we are doing little bit better but we still have about fifteen percent error rate
0:47:18by humans on this
0:47:19task, so our feelings that these are intrinsically
0:47:21ambiguous given even a small context. We need a larger one.
0:47:26And just to have some control we also put in some samples where the recognizer
0:47:30was correct
0:47:30and here now zero word error rate for the recogniser and you see the humans
0:47:34make very few errors also
0:47:36which comforts us that's not an experimental problem that we have higher rates for humans.
0:47:41So I just wanna play one more example
0:47:43from the human misunderstanding.
0:47:45This coming from French talk show I think there's enough French people here that will
0:47:49follow it.
0:47:55And the
0:47:56other person
0:48:01So the error that happens is that one speaker said là aussi
0:48:05which is very close to là aussi en
0:48:07which is very close là aussi en.
0:48:10I pronounce it poorly.
0:48:14And in fact what was really interesting about this
0:48:18the time the correction came was about twenty words
0:48:21later than the person actually said là aussi en.
0:48:24And so the two that were talking they had own mindset
0:48:27and they weren't really listening to the other one completely and this is once again
0:48:30a broadcast talk show.
0:48:31I can play the longer sentence for people later if you're interested.
0:48:35And so my last slide
0:48:37is that as a community we are processing more languages and wider variety of data.
0:48:42We are able to get by with less supervision at least to some of the
0:48:47training data.
0:48:48We're seeing some successful applications
0:48:50with this imperfect technology.
0:48:53Something we
0:48:55like to extend to is to use the technology for other
0:48:58purposes. We still have little semantic and world knowledge in our models.
0:49:02And we still have a lot of progress to do, because those word error rates
0:49:05are still flying and there's a lot of task there
0:49:07and so maybe we need to some deep thinking
0:49:11and how to deal with this.
0:49:13So that's all.
0:49:15Thank you.
0:49:20We have time for some questions?
0:49:35No questions.
0:49:48Hi Lori.
0:49:49Hi Malcolm. In the semi-super-sample summary supervised learning sense cases do you have any sense
0:49:53of when things fail?
0:49:54Why things converge or diverge?
0:50:00We had some problems with some languages ... this is on broadcast data?
0:50:04We had some problems if you had
0:50:06poor text normalisation or if you didn't do good filtering to make sure that the
0:50:09text data were really
0:50:10from the language that you were targeting
0:50:12it can fail, it just doesn't converge. So this one case and in fact we
0:50:16had two languages where the
0:50:17problem was like that. So basically the word segmentation wasn't good.
0:50:20I think if you have
0:50:24too much garbage in your
0:50:26language model you're going to have a
0:50:28poor information you're giving. What amazes me and we haven't
0:50:31done too much of the work actually ourselves at LIMSI yet
0:50:35is that it still seems you working to some degree for the Babel data.
0:50:39Where we're flying with these word error rates and we have very little
0:50:42language model data but probably what we have is
0:50:45correct because manual transcripts we're using for it
0:50:48and the case where you're downloading data from web, but you don't really know what
0:50:50are getting.
0:50:51And so if you put garbage in, you're getting garbage out. That's why we need
0:50:54human to supervise what's
0:50:55going on at least to some extent.
0:50:58So it was quantified to some extent?
0:51:02I don't
0:51:03really have enough information and I know that one of the languages that we tried,
0:51:07so basically you'd get
0:51:08some improvement, but you'd stagnate maybe at the level of
0:51:11the second or third iteration just to improve further.
0:51:15It didn't happen too often.
0:51:17And it's something that
0:51:20I don't really have a good feeling for. Something I didn't talk about was text
0:51:24normalisation that really is an important part of our work. It is something sort of
0:51:27considered I think front work
0:51:28and people don't talk about too much.
0:51:32Any more questions?
0:51:35Well if not, I would like to invite the organiser our chairman
0:51:41to give enough
0:51:43of our appreciation to Lori.
0:51:46Let's thank her again.