0:00:15 | Thank you Isabel very much our feelings are shared. I'd like to thank the organisation |
---|
0:00:21 | here for inviting me |
---|
0:00:22 | to be a keynote speaker. It's really an honor. It's also a big challenge, so |
---|
0:00:28 | I hope I will |
---|
0:00:30 | make some messages that come across. Well to ... at least some of you. |
---|
0:00:35 | As Isabel said I've been working in speech and speech processing for many years now |
---|
0:00:39 | and today I'll |
---|
0:00:40 | focus mostly on the act of speech recognition. But a little bit of context that |
---|
0:00:44 | I'll talk about first. |
---|
0:00:46 | So. |
---|
0:00:48 | At LIMSI - being in Europe of course - we try to work on speech |
---|
0:00:53 | recognition and the multilingual |
---|
0:00:55 | context and processing at least a fair number of the European languages |
---|
0:00:59 | This isn't really new. We are seeing sort of a regrowth in wanting to do |
---|
0:01:03 | speech recognition in |
---|
0:01:04 | different languages but, if you go back a long time ago there was some |
---|
0:01:08 | research that was there. We just didn't hear about it this much. |
---|
0:01:10 | And that's probably because there weren't too many common corpora and benchmark tests to be |
---|
0:01:16 | able to compare results on and so people tended to report on your papers were |
---|
0:01:20 | accepted more easily if |
---|
0:01:21 | you use common data. Which is still the case. |
---|
0:01:23 | And it's logical you want to compare results to other peoples' |
---|
0:01:27 | results, but now there's more and more data out there, there's more test sets out |
---|
0:01:31 | there and so we |
---|
0:01:32 | can do more comparisons and we're seeing more languages covered. So I think that's really |
---|
0:01:36 | nice. |
---|
0:01:37 | So I'll speak about some of our research results in Quaero and Babel programs. |
---|
0:01:44 | Sure. |
---|
0:01:46 | Is that better? Sorry, o.k. I was popping this morning a bit so I wanted |
---|
0:01:52 | to be not too close. |
---|
0:01:54 | So, I'll speak highlights and research results from Quaero and Babel. |
---|
0:01:57 | And then I want to touch upon some activities |
---|
0:01:59 | I did with some colleagues at LIMSI and at the Laboratory of Phonetics in Paris |
---|
0:02:03 | to trying new speech technologies for other |
---|
0:02:06 | applications to carry out linguistic studies and corpora based studies |
---|
0:02:10 | And I'll mentioned briefly a couple of a perceptual experiments that we've done |
---|
0:02:14 | and then finally some concluding results, remarks. |
---|
0:02:18 | So I guess we probably all agree in this community that we've seen a lot |
---|
0:02:22 | of progress over the last |
---|
0:02:23 | decade or two decades. |
---|
0:02:25 | We're actually seeing some |
---|
0:02:27 | technologies that are using |
---|
0:02:29 | speech or are working on it. I think that's kind of fun, it's really nice. |
---|
0:02:32 | But we see it for a few languages |
---|
0:02:34 | and as we heard yesterday from Haizhou. He mentioned that about 1 % of the |
---|
0:02:40 | world's languages are |
---|
0:02:41 | actually seen in our proceedings, so we have something about them. |
---|
0:02:44 | That's pretty low, but it's |
---|
0:02:46 | up there from one or two that we did maybe twenty years ago. We're soaking |
---|
0:02:49 | up the sun. |
---|
0:02:51 | One of the problems I mentioned before is that our |
---|
0:02:54 | technology typically relies on having a lot of language resources and these are difficult and |
---|
0:02:59 | expensive to get. And so therefore it's harder to get them for limited languages |
---|
0:03:03 | and current practice still |
---|
0:03:05 | takes several months to years to bring up systems for language and if you count |
---|
0:03:09 | the time for collecting |
---|
0:03:10 | data and processing it and transcribing it and things like that. |
---|
0:03:14 | So this is sort of a step back in time and we'll see if we |
---|
0:03:17 | go back to say the late eighties, |
---|
0:03:18 | early nineties. We had systems that were basicly command controlled dictation and we had some |
---|
0:03:24 | dialogue systems, usually |
---|
0:03:25 | pretty limited to a specific task like ATIS (Air Traffic Information System), |
---|
0:03:28 | travel reservation, things like that. |
---|
0:03:30 | Some of you here probably know well and some of you are maybe too young |
---|
0:03:33 | and don't even know it, because the |
---|
0:03:35 | publications are not necessarily scanned in online, so you don't see them. |
---|
0:03:39 | But in fact we're now seeing a regrowth in the same activities. When you look |
---|
0:03:43 | at voice mail and |
---|
0:03:44 | dictation machines in your phones and the different |
---|
0:03:51 | the personal assistants that we're seeing that are finally coming out now. |
---|
0:03:54 | And so that's really exciting to see this sort of a pick up again that |
---|
0:03:57 | we saw in the past. |
---|
0:03:59 | And then of course we have some new applications or for someone new applications |
---|
0:04:02 | that are growing as we've got better processing capability both in terms of computers and |
---|
0:04:10 | data out there. And so we have speech analytics of both call center data or |
---|
0:04:14 | meeting type data, |
---|
0:04:14 | lectures, there's a bunch of things. |
---|
0:04:16 | We have speech-to-speech translation and also indexation of audio, video, |
---|
0:04:21 | documents which is used for media mining, |
---|
0:04:24 | tasks and companies are very interested in that. |
---|
0:04:27 | And of course for the speech analytics people that are really interested in finding out |
---|
0:04:30 | what people |
---|
0:04:31 | wanna buy and trying to sell them things and stuff like that. |
---|
0:04:35 | So let me back up a little bit and talk about something. |
---|
0:04:40 | Why is speech |
---|
0:04:41 | processing difficult. All of us speak easily and |
---|
0:04:46 | sort of mentioned this is sort of natural we learn language but |
---|
0:04:50 | I think any of us who learned a foreign language at least as an adult |
---|
0:04:54 | understand that's a little bit |
---|
0:04:55 | harder than it really seems and so I learned French when I was ... |
---|
0:04:58 | ... after my PhD. I won't say my age. |
---|
0:05:02 | And it wasn't so easy to learn, it wasn't so natural and my daughter who |
---|
0:05:06 | grew up in lingual environment speaks |
---|
0:05:07 | French and English fluently and she's better in the other languages than I am. |
---|
0:05:10 | So I was good unilingual American speaker. |
---|
0:05:15 | No other contact with the other languages. |
---|
0:05:17 | We need context to understand what's being said, so you speak differently if you're speaking |
---|
0:05:22 | in public then |
---|
0:05:24 | if you're speaking to someone who you know. We all know this. |
---|
0:05:26 | ?? projector screen with speech is continuous and so if I'm |
---|
0:05:30 | talking to you I might say it is not easy or it's not easy |
---|
0:05:34 | but if I'm talking to my mother it's not easy. |
---|
0:05:38 | I reduced that it's not easy. Well that's not so clear where the words are |
---|
0:05:43 | and so I think that we all know once again that humans |
---|
0:05:47 | reduce the pronunciation in regions that are of low information. Where it's not very important |
---|
0:05:52 | you're |
---|
0:05:53 | putting the minimum effort into saying what you want to say. |
---|
0:05:56 | And of course there's other variability factors that |
---|
0:05:58 | and I'll also mention that the speaker's characteristic accent, |
---|
0:06:01 | the context we are in. Humans do very well in adapting to this. |
---|
0:06:05 | Machines don't, in general. |
---|
0:06:08 | So here I wanna, since I am taking a step back |
---|
0:06:10 | in time I wanted to play a couple of very simple samples. |
---|
0:06:20 | Is there anyone in this room that doesn't know this type of sentence? |
---|
0:06:23 | You all heard it. |
---|
0:06:24 | Good! Okay. That's timid, that's going back really, really long time ago. |
---|
0:06:29 | I was involved in |
---|
0:06:32 | some of the selection for timid but not in the non-sense's. |
---|
0:06:34 | Those were selected to elicit pronunciation variants for |
---|
0:06:37 | different styles of speaking |
---|
0:06:39 | and you can hear that in the sample here |
---|
0:06:41 | even in a very, very simple read text |
---|
0:06:44 | we have different realisations of the word greasy. |
---|
0:06:51 | So in the first case we have an S, in the second case we have |
---|
0:06:53 | a Z. |
---|
0:06:53 | And we can see that ... |
---|
0:06:55 | I can't really point to the screen there. I don't have a pointer. Do you |
---|
0:06:58 | have a pointer? |
---|
0:06:59 | I think I've refused it before. |
---|
0:07:02 | So in any case you can see in blue there that S is quite a |
---|
0:07:04 | bit longer, you can see the voicing and the Z. |
---|
0:07:06 | And is everyone here familiar with spectrograms? I sort of assumed the worst. It's okay, |
---|
0:07:12 | good. |
---|
0:07:14 | So here's another example of more conversational type speech. |
---|
0:07:18 | And we'll see that people interrupt each other. You can hear some of the hesitations. |
---|
0:07:40 | So in this example it's office corpus and participants called each other |
---|
0:07:46 | and they're supposed to talk about some topic that they were given and they have |
---|
0:07:49 | a mutual topic. |
---|
0:07:50 | You're supposed to talk about it but not everybody does. |
---|
0:07:53 | But even in this ... |
---|
0:07:55 | They don't know each other very well but they still interrupted each other. They did |
---|
0:07:58 | some turn taking |
---|
0:07:58 | and you hear you can see the |
---|
0:08:00 | hmm and laughter and there's someone else in another |
---|
0:08:05 | presentation. I don't where it is. |
---|
0:08:08 | Now I'm gonna play an example from Mandarin and I'm having |
---|
0:08:11 | confidence that Billy Hartman who is probably in ?? with his wife |
---|
0:08:15 | gave me the correct translation, |
---|
0:08:17 | correct text here, because I don't understand it. Which is even more spontaneous so it's |
---|
0:08:22 | an example |
---|
0:08:23 | taken from the callhome mandarin corpus where we think it's a mother and a daughter |
---|
0:08:27 | who are talking to |
---|
0:08:29 | each other about the daughters job interview. |
---|
0:08:32 | So for those who speak Mandarin. |
---|
0:08:48 | So if I understood correctly and the translation is correct |
---|
0:08:52 | basically talking about it and the mother doesn't understand what the job interview is about |
---|
0:08:57 | and the daughter says |
---|
0:08:58 | she says: Don't speak to me in another |
---|
0:09:01 | language. Speak to me in Chinese. And the daughter says: You wouldn't have understand anyway, |
---|
0:09:04 | even if I spoke in Chinese |
---|
0:09:05 | And so I've had some similar situations speaking with my mother |
---|
0:09:10 | That's what I do. |
---|
0:09:13 | So now I'm gonna switch gears a little bit and talk about the |
---|
0:09:17 | Quaero program which is one of the two topics I want to mainly focus on |
---|
0:09:22 | here, we talk about the speech |
---|
0:09:23 | recognition in different languages. |
---|
0:09:25 | This is a large project |
---|
0:09:27 | in France, it's research innovation project |
---|
0:09:30 | which was funded by OUZIO, a French innovation agency. |
---|
0:09:34 | It was initiated in 2004 |
---|
0:09:37 | but didn't start until 2008 and then ran for almost six years until the end |
---|
0:09:41 | of 2013 |
---|
0:09:41 | so it's relatively recent that it finished. It was really fun |
---|
0:09:45 | but when we started putting it together |
---|
0:09:47 | the web was a lot of than it is now. So as we also heard, |
---|
0:09:50 | I think it was this morning, there was no YouTube, |
---|
0:09:52 | no Facebook, no Twitter, no Google Books, iPhones. All that didn't exist. |
---|
0:09:57 | So life was boring, what do we do with our free time, right? |
---|
0:10:01 | Instead of spending your time on the ??. |
---|
0:10:03 | I think it's hard to be in the position of young people who don't know |
---|
0:10:07 | life without all of this |
---|
0:10:08 | and my daughter grew up with all of it. |
---|
0:10:10 | And so it's very hard to relate to what this situation really is |
---|
0:10:13 | but in any case |
---|
0:10:15 | to get back to sort of processing of this data |
---|
0:10:17 | we have tons and tons of data. I read that there's roughly 100 hours of |
---|
0:10:24 | video uploaded to YouTube every minute. |
---|
0:10:27 | And that's a huge amount of data and 61 languages. So if we are treating |
---|
0:10:31 | about 7 of them, |
---|
0:10:32 | we are not so bad. Maybe we cover the languages doing the videos there. |
---|
0:10:35 | But we don't have to organise this data, we don't not know how to accesses |
---|
0:10:39 | this data. |
---|
0:10:39 | And so Quaero was trying to aim at this. How can we organize the data, |
---|
0:10:43 | how can we access it, how can we index it |
---|
0:10:45 | how can we build applications that can make use of today's technology and do something |
---|
0:10:49 | interesting with it. I'm not gonna talk about all that. If you're interested I suggest |
---|
0:10:54 | you to go to Quaero website and |
---|
0:10:57 | you can find some demos and links and things like that |
---|
0:10:59 | there. I'm gonna focus on the work that we did in speech processing |
---|
0:11:03 | and at LIMSI we spoke about, we worked on mostly speech and text processing |
---|
0:11:08 | including this applied to speech data, so named entity searchable text and speech, |
---|
0:11:13 | work translation, |
---|
0:11:14 | both of text and speech. |
---|
0:11:17 | So here is |
---|
0:11:19 | showing the speech processing technologies that we |
---|
0:11:23 | worked on in the project. So the first box we have is audio speaker segmentation |
---|
0:11:28 | such as chopping |
---|
0:11:29 | signal and trying to decide speech and nonspeech regions |
---|
0:11:32 | and |
---|
0:11:33 | dividing into segments corresponding to different speakers |
---|
0:11:37 | detecting speaker changes |
---|
0:11:39 | then we may or may not know |
---|
0:11:40 | the identity of the language being spoken |
---|
0:11:43 | so we have a box of language identification if we don't know it. |
---|
0:11:47 | Most of the time we want to transcribe the speech data because |
---|
0:11:51 | speech is ubiquitous |
---|
0:11:52 | there's speech all over the place and it has a high amount of information content |
---|
0:11:55 | and so we believe it is |
---|
0:11:57 | that the most useful. We work in speech and not in an image. Image people |
---|
0:12:00 | might tell us that |
---|
0:12:01 | image is more useful for indexing this type of data. |
---|
0:12:06 | One advantage we have speech relative to image that I've just mentioned |
---|
0:12:10 | is that the speech has the underlying writen representation that we've all pretty much agree |
---|
0:12:17 | upon |
---|
0:12:17 | more or less. Able decide where the word is. |
---|
0:12:19 | We might differ a little bit but we pretty much agree upon it. The image |
---|
0:12:22 | is not the case |
---|
0:12:23 | if you give an image to two different people |
---|
0:12:25 | someone will tell you it's a blue house, someone will tell you it's trees in |
---|
0:12:28 | the park |
---|
0:12:29 | with a little blue cabin in it. Something like that. You get a |
---|
0:12:31 | very different description based on what people are interested in |
---|
0:12:35 | and their matter expressing things. For speech in general |
---|
0:12:38 | we're a little bit more normalized we |
---|
0:12:40 | pretty much agree on what would be there. |
---|
0:12:42 | Then you might wanna do |
---|
0:12:44 | other type of processing such as speaker diarization. |
---|
0:12:47 | This morning doctor ?? spoke about the Twitter statistics during the presidential elections and |
---|
0:12:53 | that was something we actually worked on in Quaero. |
---|
0:12:56 | Which was to try and look at a corpus of recordings |
---|
0:12:59 | and look at speaker times within this corpus of recordings, you might have hundreds or |
---|
0:13:03 | thousands of hours of recordings and look at how many |
---|
0:13:05 | speakers are speaking when and how much time is allocated |
---|
0:13:08 | and that's actually something that's got a potential to use at least in France |
---|
0:13:11 | where they control that during the election period all the parties get the same amount |
---|
0:13:15 | speaking time. |
---|
0:13:16 | As you want very accurate measures of who is speaking when |
---|
0:13:20 | so that everybody has a fair game |
---|
0:13:21 | during the elections. |
---|
0:13:23 | Other things that we worked on were adding the metadata to the |
---|
0:13:27 | transcriptions, you might add punctuation or markers to |
---|
0:13:30 | make it be more readable, you might want to transform numbers from |
---|
0:13:34 | words into a number sequences like in newspaper text in 1997. |
---|
0:13:38 | And you might want to identify |
---|
0:13:41 | entities or speakers or topics that can be useful for automatic processing so you could |
---|
0:13:47 | tags in |
---|
0:13:48 | where same identities are. |
---|
0:13:50 | And then finally the other box there is speech |
---|
0:13:52 | translation typically based on the speech transcription |
---|
0:13:57 | but we're also trying to work on having a tighter link between the speech and |
---|
0:14:01 | the translation |
---|
0:14:01 | portions, so you don't just transcribe and then |
---|
0:14:05 | translate. |
---|
0:14:06 | But we're trying to have more tight |
---|
0:14:08 | relation between the two. |
---|
0:14:11 | Let me talk a little bit now about speech recognition. |
---|
0:14:14 | Everybody I think know there's a box, so basically we have |
---|
0:14:18 | the main point is just that we have three important models. The language model, |
---|
0:14:23 | the pronunciation model and the acoustic model. |
---|
0:14:25 | And these are all typically estimated on very large corporas |
---|
0:14:28 | This is where we're getting into problems with the low resource languages. |
---|
0:14:32 | And I want to give a couple the illustrations on |
---|
0:14:35 | why at least I believe and I spent effort doing this on pronunciation model |
---|
0:14:39 | is really important that we have the right pronunciation in the dictionary. |
---|
0:14:42 | So we take these two examples, on the left we have two versions of coupon. |
---|
0:14:51 | And on the right we have |
---|
0:14:53 | two versions of interest. |
---|
0:14:56 | So in the case of the coupon in one case we have the ya sound |
---|
0:15:00 | inserted there |
---|
0:15:01 | and our models for speech recognition are typically |
---|
0:15:04 | modeling phonetics in it's context. |
---|
0:15:07 | And so we can see that if we have a transcription just k. u. p. |
---|
0:15:11 | for it. |
---|
0:15:12 | We're gonna have this ya there and the case is not gonna be |
---|
0:15:14 | very good match to the one that we have the second case. |
---|
0:15:17 | That's really very big difference and also the U becomes almost frented EU. |
---|
0:15:22 | That is not |
---|
0:15:23 | distinguishable technically in English. |
---|
0:15:25 | And the same thing for interest. We have interest or interest. |
---|
0:15:29 | Well in one case you N and the other case you have the TR cluster. |
---|
0:15:32 | These are very different and you can imagine that if we ... |
---|
0:15:35 | since our |
---|
0:15:37 | acoustic models are based on alignment of these transcriptions with the audio signal |
---|
0:15:41 | if we have more accurate pronunciations we're going to |
---|
0:15:44 | have better acoustic models at the end and that's what our goal is. |
---|
0:15:49 | So now I want to speak a little bit about |
---|
0:15:52 | culture lightly supervised learning, there's many terms |
---|
0:15:55 | being used for it now. |
---|
0:15:56 | Unsupervised training, semi-supervised training, lightly supervised training |
---|
0:16:01 | and so |
---|
0:16:02 | basicly one goal is that .. and Ann mentioned something |
---|
0:16:05 | about this yesterday, maybe machines can just learn on their own. |
---|
0:16:08 | So here we have a machine |
---|
0:16:09 | he's reading the newspaper, he's looking at the TV and he's learning. |
---|
0:16:14 | Okay that's great. |
---|
0:16:15 | That's something we would like to happen. |
---|
0:16:17 | But we still believe that we need to put some guidance there and so this |
---|
0:16:20 | is |
---|
0:16:21 | researcher here trying to |
---|
0:16:23 | give some informations and supervision the machine who's learning. |
---|
0:16:28 | When we look at traditional acoustic modeling we typically use between several hundreds to |
---|
0:16:33 | several thousands of hours of carefully annotated data and once again I said before that |
---|
0:16:37 | this is expensive |
---|
0:16:38 | and so people trying to look into |
---|
0:16:40 | ways to reduce this information, reducing the amount of supervision for the training process. |
---|
0:16:45 | And so |
---|
0:16:46 | I believe that some people in this room are doing it |
---|
0:16:50 | to automate the process of collecting the data. To automate the iterative learning of the |
---|
0:16:54 | systems by themselves |
---|
0:16:55 | even including the evaluation so having some |
---|
0:16:59 | data used to evaluate on that is not necessarily really carefully annotated and most the |
---|
0:17:03 | time it is |
---|
0:17:04 | but there's been some work trying to use |
---|
0:17:06 | unannotated data to improve the system |
---|
0:17:08 | which I think is really exciting. |
---|
0:17:11 | So we talk about reduced supervision and |
---|
0:17:13 | unsupervised training that has a lot of different names that are used. |
---|
0:17:16 | The basic idea is to use some existing speech recognizer, |
---|
0:17:19 | you transcribe some data, you assume that this transcription is true. |
---|
0:17:24 | Then you build new models estimating with this transcription and you reiterate and there's been |
---|
0:17:28 | a |
---|
0:17:28 | lot of work on it for about fifteen years now |
---|
0:17:31 | and many different variance that have been explored, where to filter the data, where to |
---|
0:17:36 | use confidence factors, do you |
---|
0:17:37 | train on things that are only good, do you take things in the |
---|
0:17:39 | middle range, this many things you can read about it. |
---|
0:17:42 | Something that's pretty exciting that we see in the Babel work |
---|
0:17:46 | is that even now if we apply these two systems starting |
---|
0:17:49 | with very high word error rate, it still seems to be converging |
---|
0:17:52 | and that's really nice. |
---|
0:17:54 | The first things I'll talk about are going to be in case for a broadcast |
---|
0:17:57 | news data but we have a lot of |
---|
0:17:59 | data we can use as a supervision. And by this I mean we're using language |
---|
0:18:02 | models that are trained in many |
---|
0:18:04 | millions of words of text and this is giving some information to the systems. It's |
---|
0:18:08 | not completely |
---|
0:18:08 | unsupervised which is why you these different names for |
---|
0:18:11 | what's being done by different researchers calling it. |
---|
0:18:13 | It's all about the same but it's called by different names |
---|
0:18:16 | and so here I wanted to illustrate |
---|
0:18:19 | this was the case study for the Hungarian that we did in |
---|
0:18:22 | the Quaero program and it was presented at last year's Interspeech, so maybe some of |
---|
0:18:26 | you saw it, by A. Roy. |
---|
0:18:29 | And we started off with |
---|
0:18:31 | having |
---|
0:18:33 | seed models at this point appear of about eighty percent or seed models that come |
---|
0:18:36 | from other |
---|
0:18:37 | languages or five languages we took them from, so we did what most people would |
---|
0:18:41 | call cross language transfer. |
---|
0:18:42 | These models came from if I |
---|
0:18:45 | have it correctly: English, French, Russian, Italian and German |
---|
0:18:48 | and we tried to just choose the best match between |
---|
0:18:50 | the phone set of Hungarian to one of these languages. |
---|
0:18:54 | And then we |
---|
0:18:56 | use this model here to transcribe about 40 hours of data which is this point |
---|
0:19:00 | here |
---|
0:19:01 | and this size of the circle |
---|
0:19:03 | is showing you |
---|
0:19:05 | roughly how much data is used |
---|
0:19:06 | that's this is forty hours then we double again here |
---|
0:19:09 | and go about eighty hours and this is the word error rate |
---|
0:19:12 | and this is the iteration number |
---|
0:19:14 | and so we often use increased amount of data, increase the model |
---|
0:19:20 | size, we have more parameters and models is going on |
---|
0:19:23 | so the second point here is using |
---|
0:19:26 | the same amount of data, but using more context |
---|
0:19:28 | so we built a bigger model, so we once again took this model, we redecoded |
---|
0:19:32 | all the data the forty hours |
---|
0:19:33 | and built another model and so now we went down to about sixty percent, so |
---|
0:19:37 | we still kind of flying |
---|
0:19:38 | we doubled the data again and we're probably about a hundred fifty |
---|
0:19:42 | hundred fifty something like that. Then we got down to about |
---|
0:19:45 | fifty percent. These are all using same language model |
---|
0:19:48 | so that wasn't changed in study |
---|
0:19:50 | and then finally here we use about three hundred hours of |
---|
0:19:53 | training data we're down to about thirty or thirty five percent |
---|
0:19:56 | and of course everybody knows that |
---|
0:19:59 | these were done with just standard PLP F0 features |
---|
0:20:03 | and |
---|
0:20:04 | pretty much everybody's using features generated by the MLP |
---|
0:20:07 | so we took |
---|
0:20:08 | our english MLP, |
---|
0:20:09 | we generated features on the Hungarian data since across |
---|
0:20:13 | lingual transfer |
---|
0:20:16 | models there |
---|
0:20:17 | and we see there begins small gain a little bit |
---|
0:20:20 | once our amount of data is fixed |
---|
0:20:22 | and then we took the transcripts that were generated by this system |
---|
0:20:26 | here |
---|
0:20:26 | and |
---|
0:20:28 | we built an MLP |
---|
0:20:29 | training MLP for the Hungarian language and there we also get now about a two |
---|
0:20:33 | or three percent |
---|
0:20:34 | absolute gain and we're down to a word error rate of about twenty five percent |
---|
0:20:38 | which isn't wonderful |
---|
0:20:39 | but it's still relatively high, but it's good enough for some applications such as media |
---|
0:20:43 | monitoring and |
---|
0:20:44 | things like that. |
---|
0:20:46 | And so this was done completely un-transcribed and we did it with a bunch of |
---|
0:20:49 | languages |
---|
0:20:50 | so now let me show you some results for the ... |
---|
0:20:54 | I think it's about nineteen languages we did in Quaero, we did more |
---|
0:20:56 | we did twenty three but this is only for nineteen of them |
---|
0:21:00 | and if we look here, if you go up to check |
---|
0:21:03 | these were trained in a standard supervised manner |
---|
0:21:05 | with somewhere between a hundred and five hundred hours |
---|
0:21:09 | of data depending upon the language. |
---|
0:21:13 | And so these with the blue shading were trained in unsupervised manner |
---|
0:21:17 | once again we have the word error rate on the left and this is the |
---|
0:21:20 | average error rate across |
---|
0:21:22 | three to four hours of data per show, |
---|
0:21:25 | per language, sorry. |
---|
0:21:27 | And so we can see that while in general |
---|
0:21:30 | the error rates are a little bit lower for |
---|
0:21:33 | the supervised training, these arent't so bad some of them are really about the same |
---|
0:21:39 | range and |
---|
0:21:40 | you have to take the result with a little bit of grain of salt, because |
---|
0:21:43 | some of these languages here |
---|
0:21:45 | might be a little bit less well trained or a little bit less well advanced |
---|
0:21:49 | than the lower scoring languages. These might be |
---|
0:21:52 | doing a little bit better if we worked more on them. |
---|
0:21:56 | But this isn't the full story so now I'm going to complicate the figure |
---|
0:22:00 | and in green you have |
---|
0:22:02 | word error rate on the lowest file, so that the audio file that had the |
---|
0:22:07 | lowest word error |
---|
0:22:08 | rate per language so these are in green. |
---|
0:22:11 | Okay, so these files are easy they're probably news like files |
---|
0:22:15 | okay. |
---|
0:22:16 | And it gets very low, even Portuguese we're down around three percent for one of |
---|
0:22:19 | these segments |
---|
0:22:20 | and then in yellow we have the worst scoring one |
---|
0:22:24 | and these were scoring files |
---|
0:22:26 | typically or more interactive spontaneous speech talk shows, debates, noisy recording that's offside, |
---|
0:22:33 | that's a lot of variability factors that come in. |
---|
0:22:35 | So even though this blue curve is kinda nice we really see we have a |
---|
0:22:38 | lot of work to do |
---|
0:22:40 | if we want to be able to process all the state up here. |
---|
0:22:44 | So now i'm going to switch gears and not talking any more about Quaero and |
---|
0:22:48 | talk a little bit about Babel. |
---|
0:22:50 | Where it's a lot harder to have supervision from |
---|
0:22:53 | language model because you are working on languages in Babel |
---|
0:22:56 | that have very little data, that is hard to get or typically have little data |
---|
0:22:59 | but not all of them |
---|
0:23:00 | are really in that situation. |
---|
0:23:02 | And so this is the |
---|
0:23:06 | a sentence I took from Mary Harper slide's |
---|
0:23:09 | that she presented at ?? calling and so the idea |
---|
0:23:13 | that's being investigated to apply |
---|
0:23:15 | different techniques of linguistic machine language, |
---|
0:23:18 | machine learning and speech processing methods |
---|
0:23:20 | to be able to do speech recognition for keyword search and I highly recommend for |
---|
0:23:25 | people that are |
---|
0:23:25 | not familiar with Mary's talk so you see them. |
---|
0:23:28 | I know that the ASRU one is online on Superlectures, |
---|
0:23:31 | and the ?? column one I don't know, so people here probably know better than |
---|
0:23:34 | me if ?? is there. |
---|
0:23:35 | But there it's really interesting talks |
---|
0:23:38 | and if you're interested in this topic I suggest you to |
---|
0:23:41 | go there. |
---|
0:23:42 | So, keyword spotting. Yesterday Ann spoke about that children can do keyword spotting very young |
---|
0:23:49 | and so I wanna do first test for you because basic keyword spotting |
---|
0:23:54 | what I mean is that |
---|
0:23:55 | you're gonna localise in the audio signals some points where you have |
---|
0:24:00 | your detected keyword |
---|
0:24:02 | so these two you detected right |
---|
0:24:03 | here |
---|
0:24:04 | you missed it, it's the same word whatever keyword it was or occurred but you |
---|
0:24:08 | didn't get it |
---|
0:24:09 | and here you detected a keyword but you didnt get it |
---|
0:24:11 | so that's a false alarms. So here you've missed the false alarm and the correct. |
---|
0:24:15 | So now let me play you a couple of samples |
---|
0:24:18 | and this is actually a test of two things same time. |
---|
0:24:21 | One is language IDs so I'm gonna play samples at |
---|
0:24:24 | different languages and there's two times six different languages |
---|
0:24:27 | and there's a common words in all of these |
---|
0:24:30 | samples. |
---|
0:24:30 | And so I'd like people to let me know if you can detect |
---|
0:24:33 | this words, so see if we as adults can do like children |
---|
0:24:37 | can do. |
---|
0:24:58 | Do you want to here it again? |
---|
0:25:00 | And do we make it a little louder? |
---|
0:25:03 | Is it possible to be a little bit louder on the audio because I can't |
---|
0:25:06 | control it here. |
---|
0:25:09 | I don't think it goes any louder. |
---|
0:25:11 | I have it on the loudest. |
---|
0:25:31 | Okay so I'll show you the languages there first. Anyone get the languages there's probably |
---|
0:25:35 | speaker of each |
---|
0:25:36 | language here, so you probably recognised your own language. |
---|
0:25:40 | So the languages were: Tagalog, Arabic, French, Dutch, Haitian and Lithuanian. |
---|
0:26:08 | Shall I play it again? |
---|
0:26:09 | It's okay? |
---|
0:26:11 | Alright, so. |
---|
0:26:12 | So here's this second set of languages that we have there, the last one is |
---|
0:26:16 | Tamil. I'm not really |
---|
0:26:17 | sure the end where there were taxi in different places. Google translate told us it |
---|
0:26:23 | was. |
---|
0:26:23 | But there might be some native speakers here that can |
---|
0:26:26 | tell us if that is or not. To me it sounded like income taxes and |
---|
0:26:29 | sales tax. |
---|
0:26:30 | But I don't |
---|
0:26:33 | really know. Google told us that it was: to income from |
---|
0:26:35 | taxes and sale of taxes, or something like that so anyway so |
---|
0:26:41 | basically I did, everyone |
---|
0:26:43 | catched the word taxes or only some of you did? |
---|
0:26:46 | Taxes is one of those words that seems to be relatively |
---|
0:26:50 | common and |
---|
0:26:51 | in many languages anyway that's same thing. |
---|
0:26:57 | Before talking about keyword spotting I'm not gonna talk about it too much actually, is |
---|
0:27:00 | I wanted to |
---|
0:27:01 | show some results on conversational telephone speech. So we'll talk about term error rate here |
---|
0:27:06 | rather than word error rate, because in Mandarin we |
---|
0:27:09 | measure the character error rate rather than in order. So for English |
---|
0:27:12 | and Arabic we're measuring word error rate and for Mandarin its character |
---|
0:27:17 | and these results are for I believe the NIST archives of |
---|
0:27:21 | for |
---|
0:27:21 | transcription task |
---|
0:27:23 | and English systems are trained on about |
---|
0:27:25 | two thousand hours of |
---|
0:27:28 | data with annotations. |
---|
0:27:30 | The Arabic and Mandarin systems were probably |
---|
0:27:32 | trained on about two hundred or three hundred hours of data. |
---|
0:27:34 | It's quite a bit less |
---|
0:27:35 | and we can see that the English system gets pretty good. We're down to about |
---|
0:27:39 | eighteen percent |
---|
0:27:40 | of the word error rate. The Arabic is really quite high |
---|
0:27:43 | about forty five percent. Maybe in part due to different dialects |
---|
0:27:47 | and also maybe in part due to pronunciation modeling because |
---|
0:27:50 | it's very |
---|
0:27:51 | difficult in Arabic if you don't have the diacriticised form. |
---|
0:27:58 | We also at LIMSI work on some other languages |
---|
0:28:00 | including French, Spanish, Russian, Italian and these are |
---|
0:28:03 | just some results to show you that we're sort of in the same ballpark |
---|
0:28:06 | of error rates |
---|
0:28:07 | for these systems, for once again conversational speech |
---|
0:28:10 | and these are trained on about a hundred to two hundred hours of data. |
---|
0:28:14 | Now let's go to Babel which can just be very challenging compared to what we |
---|
0:28:17 | see here which is |
---|
0:28:18 | already harder that we had for the broadcast type data. |
---|
0:28:22 | And before that I just want to say a few words what we mean by |
---|
0:28:26 | low resource language so in general |
---|
0:28:28 | these days it means it has got low presence on the Internet. |
---|
0:28:31 | That's probably not what ethnologists in English would agree |
---|
0:28:34 | upon but I think from the technology community we are gonna say |
---|
0:28:37 | you cannot get any data it's a low resource language. |
---|
0:28:40 | It's got limited text resources |
---|
0:28:42 | well at least in electronic form |
---|
0:28:45 | there is |
---|
0:28:46 | little or |
---|
0:28:47 | some, but not too much I\O data, |
---|
0:28:49 | you may or may not find some pronunciation dictionaries and it can be difficult to |
---|
0:28:54 | find |
---|
0:28:54 | maybe reliable knowledge about the language if you google different things and you find some |
---|
0:29:02 | characteristics about the language you get three different peoples telling you three different |
---|
0:29:05 | things and you don't really know what to believe. |
---|
0:29:08 | And one point I'd like to make is that this is true for what we're |
---|
0:29:12 | calling these low resource languages |
---|
0:29:13 | but is also true many times for different types of applications that has passed that |
---|
0:29:16 | we dealt with |
---|
0:29:17 | even in well resourced languages. You might not have any data on the type of |
---|
0:29:21 | test you're addressing. |
---|
0:29:22 | So here's an overview of the Babel languages for the first two years of the |
---|
0:29:27 | program |
---|
0:29:27 | and I'm roughly trying to give an idea of the characteristics of the language I'm |
---|
0:29:31 | sure that these |
---|
0:29:32 | are not really hundred percent correct. |
---|
0:29:34 | I tried to classify the characteristics into general classes and give it something we can |
---|
0:29:40 | easily understand |
---|
0:29:41 | and so for example we see the list of languages we ?? assume is make |
---|
0:29:46 | any better |
---|
0:29:47 | relatively closely related |
---|
0:29:48 | and |
---|
0:29:50 | Cantonese allow |
---|
0:29:52 | and Vietnamese |
---|
0:29:53 | that are used |
---|
0:29:55 | different scripts that's Bengali and Assamese share the same written script. |
---|
0:29:59 | We also have the Pashto |
---|
0:30:02 | which uses the Arabic script, the one we have to problem to of diacritization in |
---|
0:30:05 | it. |
---|
0:30:06 | And then we have |
---|
0:30:08 | Turkish, Tagalog, Vietnamese and ?? |
---|
0:30:12 | which was actually very challenging because there we had clicks we needed to deal with. |
---|
0:30:16 | So they use different scripts, |
---|
0:30:18 | some of them languages have tones so in this case we had four that had |
---|
0:30:22 | tone, |
---|
0:30:22 | we were trying to classify the morphology into being easy, hard and medium, |
---|
0:30:27 | okay, this is not very |
---|
0:30:28 | I'm sure it is not very reliable but basically three of them we consider to |
---|
0:30:33 | have a difficult |
---|
0:30:33 | morphology so that was the Pashto, the Turkish and the Zulu. |
---|
0:30:39 | And the others of them are low. |
---|
0:30:41 | The next column is the number of dialects in this is not |
---|
0:30:44 | the number of dialects in the language, this is the number of dialects in the |
---|
0:30:48 | corpus collected |
---|
0:30:48 | in the context of Babel. |
---|
0:30:50 | So in some cases we only had one as in Lao and Zulu, but in |
---|
0:30:53 | another cases we had for Cantonese as many as |
---|
0:30:55 | five, in Turkish as many as seven. |
---|
0:30:58 | And then once again whether or not |
---|
0:31:00 | the G2P |
---|
0:31:02 | is easy or difficult |
---|
0:31:04 | and so some of them are easy, some of them seem to be hard. |
---|
0:31:07 | In particular the Pashto |
---|
0:31:09 | and for the Cantonese is basically the dictionary lookup |
---|
0:31:14 | limited character set. |
---|
0:31:16 | So here and the |
---|
0:31:18 | last column I'm showing the word error rates for |
---|
0:31:21 | the Babel languages and its joint in a different style. |
---|
0:31:24 | If you look at the top of the blue bar |
---|
0:31:26 | that's the |
---|
0:31:28 | word error rate |
---|
0:31:29 | of the |
---|
0:31:30 | worst language. So in this case for the .. in fact for both of them |
---|
0:31:34 | with the top of the blue |
---|
0:31:36 | this language here is about |
---|
0:31:38 | fifty and some percent and sixty and some percent, that's Pashto |
---|
0:31:41 | and the top of the oranges just showing you the range of the |
---|
0:31:45 | word error rates across the different languages. |
---|
0:31:50 | This word error rate I said backwards. This is the best language |
---|
0:31:54 | and this is the worst language. The top here are Pashto |
---|
0:31:57 | which is about seventy percent in one case and |
---|
0:31:59 | fifty five percent for another |
---|
0:32:02 | and this is the best which I believe is Vietnamese and Cantonese. |
---|
0:32:06 | Sorry, if I confused you there. |
---|
0:32:12 | And I'm wrong again with that too. I mixed up the keyword spotting |
---|
0:32:15 | So this is, I should've read my notes, |
---|
0:32:16 | the lowest word error rate was for Haitian and Tagalog |
---|
0:32:21 | and the highest was for Pashto. |
---|
0:32:23 | And in this case we had, what's called in our community, you can see it |
---|
0:32:27 | in another papers, is Full LP |
---|
0:32:29 | which means you have somewhere between sixty and ninety hours of annotated data for training |
---|
0:32:33 | and |
---|
0:32:34 | there's the LLP, which is the low resourced |
---|
0:32:37 | which is only ten hours of annotated data per language, but you can use the |
---|
0:32:41 | additional data here |
---|
0:32:41 | in unsupervised or semi-supervised manner. |
---|
0:32:46 | So some of the research directions that you've probably seen a fair amount of talks |
---|
0:32:50 | about here |
---|
0:32:51 | are looking into language-independent methods |
---|
0:32:54 | to develop |
---|
0:32:55 | speech-to-text and keyword spotting for the languages looking into multilingual acoustic modeling. |
---|
0:33:00 | Yesterday there was some talk by the Cambridge people and there was also talk from |
---|
0:33:04 | MIT |
---|
0:33:05 | trying to improve model accuracy with these limited training conditions |
---|
0:33:10 | using unsupervised or semi-supervised |
---|
0:33:13 | techniques for the conversational data |
---|
0:33:15 | which we don't have too much |
---|
0:33:17 | information that's coming for the language model. |
---|
0:33:19 | It's a very weak language model that we have |
---|
0:33:22 | and trying to explore multilingual and |
---|
0:33:24 | unsupervised MLP training. And both of those have been pretty successful |
---|
0:33:28 | where is the multilingual acoustic modeling using standard ?? hmms is a little bit less |
---|
0:33:32 | successful. |
---|
0:33:32 | And one other thing that we're seeing |
---|
0:33:35 | is interest in is using graphemic models because these could sort of avoid the problem |
---|
0:33:40 | of |
---|
0:33:40 | having to do grapheme to phoneme |
---|
0:33:42 | and it reduces the problem of pronunciation modeling to |
---|
0:33:45 | something closer to text normalization you have to do anyway |
---|
0:33:48 | for language modeling. |
---|
0:33:51 | So now I wanted to talk just |
---|
0:33:53 | briefly about something that didn't work that we tried at LIMSI. So one of the |
---|
0:33:56 | languages is Haitian, so this is great you know |
---|
0:33:58 | we work in French we developed decent French system |
---|
0:34:01 | so why not try using French models to help our Haitian system |
---|
0:34:05 | and so the first thing we do is to try to run our French system |
---|
0:34:09 | on Haitian data, it was a disaster |
---|
0:34:11 | it was really bad |
---|
0:34:12 | then we took the French models, |
---|
0:34:15 | acoustic models and the language model for Haitian data but also |
---|
0:34:18 | wasn't very good |
---|
0:34:20 | then we said okay let's try adding varying amounts of |
---|
0:34:24 | French data to |
---|
0:34:25 | Haitian system. So this is the Haitian baseline, so we have about |
---|
0:34:28 | seventy and some percent word error rate so seventy two ?? much yourself |
---|
0:34:33 | If we had ten hours of French we get worse, we got about seventy four |
---|
0:34:36 | or seventy five. |
---|
0:34:37 | We had twenty hours to go, got worse again. |
---|
0:34:39 | We had fifty hours to get worse again, we said hups! This is not working, |
---|
0:34:43 | stop, |
---|
0:34:43 | this was |
---|
0:34:45 | work that we never really got back to. We wanted to look a little more |
---|
0:34:48 | in trying to understand better why |
---|
0:34:49 | this was happening, we don't know if it's due to the fact that the recording |
---|
0:34:52 | conditions were very different, we |
---|
0:34:53 | don't if there were really phonetic or phonological differences between languages |
---|
0:34:58 | and then we had another bright idea let's just say |
---|
0:35:01 | okay, let's not use standard French data we also have some accented French data from |
---|
0:35:05 | Africa |
---|
0:35:06 | we have some data from North Africa, from |
---|
0:35:10 | I don't remember where the other was from |
---|
0:35:12 | and so we said let's trying do that |
---|
0:35:14 | same results. We took ten hours of data we had and basically |
---|
0:35:17 | degrade the same way. |
---|
0:35:18 | So we were kinda disappointed by the results and then |
---|
0:35:21 | dropped working on it for awhile. |
---|
0:35:22 | We hoped to get back to some of this again. There |
---|
0:35:24 | was a paper from KIT that was talking about using |
---|
0:35:30 | multilingual and bilingual models |
---|
0:35:32 | for recognition of non-native speech and that actually was getting some gain, so I thought |
---|
0:35:36 | that was |
---|
0:35:37 | a positive result despite |
---|
0:35:39 | instead of our negative result here. |
---|
0:35:41 | Let me just, |
---|
0:35:44 | one of the. |
---|
0:35:46 | One of things we also tried to do some joint models for Bengali and Assamese, |
---|
0:35:50 | because we have been naive and not speaking |
---|
0:35:52 | these languages decided this was something we can try |
---|
0:35:54 | and put them together and see if we can get some gain. |
---|
0:35:56 | In one condition we got tiny little gain from the language model trainable set of |
---|
0:36:01 | data, but really tiny |
---|
0:36:03 | and the acoustic model once again didn't help us. |
---|
0:36:06 | And I heard that yesterday somebody commented on it |
---|
0:36:08 | saying that they really are quite different languages and we shouldn't be |
---|
0:36:11 | assuming just because we don't understand that they are very close. |
---|
0:36:14 | But we did have Bengali speakers in our lab and they told us they were |
---|
0:36:18 | pretty close, |
---|
0:36:18 | so it wasn't based on nothing. |
---|
0:36:20 | So let me just give a couple of results on keyword spotting just to give |
---|
0:36:24 | you sort of an idea of |
---|
0:36:26 | what type things were talking about what the results are. |
---|
0:36:29 | On the left part of the graph I give results |
---|
0:36:33 | problem |
---|
0:36:35 | 2006, it was the spoken term detection task that was run by NIST and it |
---|
0:36:39 | was done on a more |
---|
0:36:41 | cases in this one. This is on broadcast news and conversational data and you can |
---|
0:36:45 | see that the measure that is used |
---|
0:36:47 | here is MTWV, which is the Maximum Term Weighted Value and |
---|
0:36:52 | I don't wanna go into it |
---|
0:36:53 | but basically it's a measure of false alarms and misses and you can put the |
---|
0:36:57 | penalty to it. |
---|
0:36:58 | The higher the number the better, so on the other slide |
---|
0:37:00 | we wanted lower number because it was word error rate |
---|
0:37:02 | and on these ones we want high number. |
---|
0:37:05 | And so we can see that for the broadcast news data it was about eighty |
---|
0:37:08 | two or eighty five percent and for |
---|
0:37:10 | the CTS data it's pretty close up around eighty |
---|
0:37:13 | but if we look at the Babel languages now |
---|
0:37:16 | we are down between forty five. So once again now the |
---|
0:37:20 | worst language is here which is around forty five percent for |
---|
0:37:24 | sixty of full training against sixty to ninety hours of supervised training |
---|
0:37:27 | and the best one goes up to about seventy two percent. |
---|
0:37:31 | Now look at my notes so I get these rates and the worst language was |
---|
0:37:35 | Pashto |
---|
0:37:36 | and the best languages were Cantonese and Vietnamese. |
---|
0:37:39 | And this is now the limited condition and you can see that you take a |
---|
0:37:43 | really big hit |
---|
0:37:44 | for the worst language here |
---|
0:37:47 | but in fact on the best ones, we're not doing so much worse. So these |
---|
0:37:51 | systems were trained on the ten |
---|
0:37:52 | hours and then the additional data was used in unsupervised manner |
---|
0:37:56 | and then there's a bunch of bells and whistles and a bunch of techniques used |
---|
0:37:59 | to get these |
---|
0:37:59 | keyword spotting that I didn't talk about and I won't talk about |
---|
0:38:02 | but there's a lot of talks on it that you'll see here that |
---|
0:38:05 | you can go to and I think there are two sessions tomorrow |
---|
0:38:07 | and maybe another poster. |
---|
0:38:09 | Once again there's talks from Mary Harper if you feel interested in finding out more. |
---|
0:38:16 | So some findings from Babel, so you've seen unsupervised training |
---|
0:38:22 | is helping a little bit at least even though we have very poor language models. |
---|
0:38:26 | The multilingual acoustic models don't seem to be very successful |
---|
0:38:29 | but there is |
---|
0:38:31 | something of hope from some research going on. |
---|
0:38:34 | The multilingual MLPs are bit more successful, meaning there's quite a few papers talking about |
---|
0:38:38 | that. |
---|
0:38:39 | Something that |
---|
0:38:40 | we've used in LIMSI for awhile, but was also shown in Babel |
---|
0:38:44 | programs that pitch features are useful even for non-tonal languages. |
---|
0:38:48 | It was in the past we used pitch for work on tonal languages and we |
---|
0:38:51 | don't need to use it all the time. |
---|
0:38:53 | And now I think a lot of people are just systematically using it in their |
---|
0:38:56 | systems. |
---|
0:38:56 | Graphemic models are once again becoming |
---|
0:38:59 | popular and they give results very close to phonemic ones |
---|
0:39:03 | and then for keyword spotting there's a bunch of important things, |
---|
0:39:06 | score normalization |
---|
0:39:09 | is extremely important there was a talk |
---|
0:39:11 | the last ASRU meeting |
---|
0:39:13 | and dealing with out-of-vocabulary keyword so basically when you get a keyword you don't necessarily |
---|
0:39:17 | know all those words and particularly when you have ten hours of data, transcript of |
---|
0:39:21 | that |
---|
0:39:21 | you've got very small vocabulary. You have no idea what type of query |
---|
0:39:25 | person will give and you need to do something, do tricks |
---|
0:39:28 | to be able to recognize and find these keywords in the audio |
---|
0:39:32 | and |
---|
0:39:33 | typically is being investigated now separate units |
---|
0:39:35 | and proxy type things and I'm sure you'll find papers on that here. |
---|
0:39:39 | So let me switch gears now in my last fifteen minutes, |
---|
0:39:44 | ten minutes. Okay, to talk about some linguistic |
---|
0:39:46 | studies and the idea is to use speech technologies |
---|
0:39:49 | as tools to study language variation, to do error analysis, |
---|
0:39:53 | there are two recent workshops that I listed on the slides. |
---|
0:39:58 | And I'm gonna take case study from Luxembourgish - the Luxembourg's language. |
---|
0:40:02 | This is done working closely with Martine Adda-Decker from MC who is Luxembourgish for those |
---|
0:40:06 | who don't know her. |
---|
0:40:08 | She says that Luxembourgish is really true multilingual environment, sort of like Singapore |
---|
0:40:13 | and in fact it seems a lot like Singapore. |
---|
0:40:15 | The capital city is the same name as the country for both of these |
---|
0:40:19 | well there's a little bit different |
---|
0:40:22 | it's a little bit warmer here. |
---|
0:40:24 | But Luxembourg is about three times the size of Singapore |
---|
0:40:29 | and Singapore has about ten times the amount of people. |
---|
0:40:32 | So it's |
---|
0:40:33 | not quite the same. |
---|
0:40:34 | So basic question we're asking for Luxembourgish is that given you've got a lot of |
---|
0:40:39 | contact with English, French and German which language is the closest? |
---|
0:40:44 | And there was a paper, there was a couple of papers that Martine's first author |
---|
0:40:48 | of. |
---|
0:40:48 | A different workshops and most recent one was the last SLTU. |
---|
0:40:53 | This is a plot showing the number of shared words between Luxembourgish, French, English and |
---|
0:40:59 | German |
---|
0:41:00 | and so the bottom curve is English, |
---|
0:41:02 | the middle one is German and the top one is French |
---|
0:41:05 | and |
---|
0:41:06 | along the x-axis is the size of the word list sorted by frequency |
---|
0:41:10 | and on the y-axis is the number of shared words and so you can see |
---|
0:41:13 | that at the low end we've got the |
---|
0:41:15 | function words as we expect those the most frequent in the languages, |
---|
0:41:18 | then you get more general content words. |
---|
0:41:22 | And as higher up you get technical terms and a proper names. |
---|
0:41:27 | And you can see that in general there's more sharing with French |
---|
0:41:31 | than with German or English at least at the lexical level. |
---|
0:41:35 | And you have |
---|
0:41:36 | once again the highest amount of sharing when you get |
---|
0:41:39 | technical terms and it's because these are shared across languages more generally. |
---|
0:41:44 | So what we try to do that's the question of given that we have this |
---|
0:41:47 | similarity |
---|
0:41:47 | to some extent at the lexical level there is this type of similarity at the |
---|
0:41:53 | phonological level. |
---|
0:41:54 | And so what we did we took acoustic models from English, French and German. |
---|
0:41:58 | We tried to do an equivalence between these |
---|
0:42:01 | IPS symbols |
---|
0:42:02 | and those in Luxembourgish. So Martine defined the set up |
---|
0:42:05 | phones for |
---|
0:42:06 | Luxembourgish |
---|
0:42:07 | and then we |
---|
0:42:08 | did hacked up pronunciation dictionary that would allow |
---|
0:42:12 | any language change to happen after any phoneme, so if you have a |
---|
0:42:15 | this can get pretty big you have a lot of pronunciations because you have |
---|
0:42:18 | if you had three letters you're going to be able to decide each point go |
---|
0:42:22 | to |
---|
0:42:22 | the other ones. You can see the illustration here with the pad you go anywhere. |
---|
0:42:25 | And the we said when I also trained a model on a three, multilingual model |
---|
0:42:29 | trained on a three data |
---|
0:42:30 | together so we took a subset of the English, French and German data and did |
---|
0:42:33 | what we called a pooled model. |
---|
0:42:36 | And so the first experiment with it is we tried to align |
---|
0:42:41 | the audio data with |
---|
0:42:43 | these three models in parallel, so that the system could choose which acoustic model likes |
---|
0:42:47 | best |
---|
0:42:48 | the English, French, German and pooled |
---|
0:42:50 | and then we did a second experiment so we train the |
---|
0:42:54 | Luxembourgish model in unsupervised manner just like I showed for Hungarian and we said now |
---|
0:42:59 | let's use that and we replaced the pooled model with Luxembourgish model. |
---|
0:43:03 | And so of course or expectation is that once we put Luxembourgish model in there |
---|
0:43:07 | it should get |
---|
0:43:08 | the data, so the alignment should go to that model that's what we expect. |
---|
0:43:12 | And |
---|
0:43:12 | so here's what we got |
---|
0:43:14 | so on the left is experiment 1, the one where we have the pooled model. |
---|
0:43:16 | On the right we have Luxembourgish model and |
---|
0:43:20 | the top is |
---|
0:43:20 | German then we have French, English and pooled Luxembourgish and so we were really |
---|
0:43:25 | disappointed, so the first thing we see is first of all Luxembourgish doesn't take everything |
---|
0:43:29 | and second so we have pretty much the same distribution, there's very little change. |
---|
0:43:33 | So we said okay let's try and look at this a little bit more. Martine |
---|
0:43:36 | said let's look at this |
---|
0:43:37 | more carefully because she knows the language. |
---|
0:43:39 | I was looking at and so we looked at |
---|
0:43:42 | some diphthongs |
---|
0:43:43 | which only exist in Luxembourgish and so we had this ?? card base. We're trying |
---|
0:43:47 | to choose something close when we took |
---|
0:43:48 | English and French and now we see the effect that we want |
---|
0:43:51 | so originally they wanted English which has diphthongs or more diphthongs. |
---|
0:43:55 | And now they want to Luxembourgish. So we are happy we've got some results we |
---|
0:44:00 | wanted. |
---|
0:44:01 | We should do some more working, looking more things we are happy with this result. |
---|
0:44:07 | The second thing i wanted to mention was talking a bit about language change and |
---|
0:44:10 | this was |
---|
0:44:11 | associated phonetic corpus based study that was also presented last year at Interspeech by Maria |
---|
0:44:16 | Candeias. |
---|
0:44:17 | And we were looking at three different phenomena that are going to be growing |
---|
0:44:21 | in the society, so you have consonant cluster reduction so explique |
---|
0:44:26 | exclaim, so you have eXCLaim |
---|
0:44:29 | you get rid of the ??. |
---|
0:44:30 | There is too many things to pronounce. |
---|
0:44:33 | The palatalization and affrication of dental stops which is a sign of the |
---|
0:44:37 | social status in immigrant population. |
---|
0:44:40 | And in fact that for me when you hear the cha or ja they sounds |
---|
0:44:44 | very normal to me, because we have them |
---|
0:44:46 | in English and I'm used to it, so we do that in English and then |
---|
0:44:50 | the third one is the |
---|
0:44:52 | fricative epithesis |
---|
0:44:53 | which is at the end of word, you had this |
---|
0:44:55 | ?? type sound. Sorry. |
---|
0:44:58 | And I'll play you an example |
---|
0:45:03 | And that was something that I remember very distinctly when I first came to France |
---|
0:45:06 | I heard it all the |
---|
0:45:07 | time and women did this. It was some characteristic of women speech that at the |
---|
0:45:12 | end there's eesh. |
---|
0:45:13 | And it's very common |
---|
0:45:14 | but in fact is now |
---|
0:45:16 | growing more in |
---|
0:45:17 | even male speech. |
---|
0:45:20 | But these were examples that were taken from broadcast |
---|
0:45:24 | data, so this is people talking on the radio and the television |
---|
0:45:27 | so you imagine that if they're doing it it's something that is really |
---|
0:45:29 | now accepted by the community. That's really are a growing trend and so |
---|
0:45:33 | Maria was looking at these over the last decade and so what we did was |
---|
0:45:36 | same type of thing. We took |
---|
0:45:38 | a dictionary and we allowed after Es to have this eesh |
---|
0:45:41 | type sound. We allowed different phonemes to go there |
---|
0:45:44 | and then looked at alignments and how many counts we got of the different |
---|
0:45:49 | occurrences. |
---|
0:45:50 | And so here we're just showing that between 2003 |
---|
0:45:55 | and 2007 |
---|
0:45:57 | this is becoming longer |
---|
0:45:59 | and it's also increased in frequency by about twenty percent. |
---|
0:46:05 | So now let me just, last thing I wantes to talk about |
---|
0:46:07 | was human performance and we all know that humans do better |
---|
0:46:10 | then machines on transcription tasks |
---|
0:46:12 | and machines have trouble dealing with variability that humans do much better with. |
---|
0:46:18 | So here is a plot of, this is based on some work of doctor ?? |
---|
0:46:23 | and his colleagues. |
---|
0:46:24 | That |
---|
0:46:26 | we took |
---|
0:46:27 | samples stimuli from the what the recognizer got wrong. So everything you see is |
---|
0:46:32 | 100 % word error rate by the recognizer that were very confusable little function words |
---|
0:46:36 | like ah. |
---|
0:46:38 | And an in English |
---|
0:46:41 | Things like that. |
---|
0:46:42 | And we played them stimuli readers. With 14 native |
---|
0:46:44 | French stimuli subjects and 70 English subjects. |
---|
0:46:48 | And everyone who listened understood the stimuli |
---|
0:46:50 | and so here you can see that if we give just a local context, a |
---|
0:46:53 | three |
---|
0:46:53 | gram context which is what many recognisers have to the humans |
---|
0:46:57 | they have make thirty percent errors on this |
---|
0:46:59 | but the system was a hundred percent wrong. |
---|
0:47:01 | If we up that context to five grams, so we've got one word each side |
---|
0:47:06 | they now go down by about twenty percent. So this is nice going the right |
---|
0:47:09 | directions the context is |
---|
0:47:10 | helping us as it seems a little bit. |
---|
0:47:12 | And if we go up to seven or nine gram |
---|
0:47:14 | we are doing little bit better but we still have about fifteen percent error rate |
---|
0:47:18 | by humans on this |
---|
0:47:19 | task, so our feelings that these are intrinsically |
---|
0:47:21 | ambiguous given even a small context. We need a larger one. |
---|
0:47:26 | And just to have some control we also put in some samples where the recognizer |
---|
0:47:30 | was correct |
---|
0:47:30 | and here now zero word error rate for the recogniser and you see the humans |
---|
0:47:34 | make very few errors also |
---|
0:47:36 | which comforts us that's not an experimental problem that we have higher rates for humans. |
---|
0:47:41 | So I just wanna play one more example |
---|
0:47:43 | from the human misunderstanding. |
---|
0:47:45 | This coming from French talk show I think there's enough French people here that will |
---|
0:47:49 | follow it. |
---|
0:47:55 | And the |
---|
0:47:56 | other person |
---|
0:48:01 | So the error that happens is that one speaker said là aussi |
---|
0:48:05 | which is very close to là aussi en |
---|
0:48:07 | which is very close là aussi en. |
---|
0:48:10 | I pronounce it poorly. |
---|
0:48:14 | And in fact what was really interesting about this |
---|
0:48:18 | the time the correction came was about twenty words |
---|
0:48:21 | later than the person actually said là aussi en. |
---|
0:48:24 | And so the two that were talking they had own mindset |
---|
0:48:27 | and they weren't really listening to the other one completely and this is once again |
---|
0:48:30 | a broadcast talk show. |
---|
0:48:31 | I can play the longer sentence for people later if you're interested. |
---|
0:48:35 | And so my last slide |
---|
0:48:37 | is that as a community we are processing more languages and wider variety of data. |
---|
0:48:42 | We are able to get by with less supervision at least to some of the |
---|
0:48:47 | training data. |
---|
0:48:48 | We're seeing some successful applications |
---|
0:48:50 | with this imperfect technology. |
---|
0:48:53 | Something we |
---|
0:48:55 | like to extend to is to use the technology for other |
---|
0:48:58 | purposes. We still have little semantic and world knowledge in our models. |
---|
0:49:02 | And we still have a lot of progress to do, because those word error rates |
---|
0:49:05 | are still flying and there's a lot of task there |
---|
0:49:07 | and so maybe we need to some deep thinking |
---|
0:49:11 | and how to deal with this. |
---|
0:49:13 | So that's all. |
---|
0:49:15 | Thank you. |
---|
0:49:20 | We have time for some questions? |
---|
0:49:35 | No questions. |
---|
0:49:48 | Hi Lori. |
---|
0:49:49 | Hi Malcolm. In the semi-super-sample summary supervised learning sense cases do you have any sense |
---|
0:49:53 | of when things fail? |
---|
0:49:54 | Why things converge or diverge? |
---|
0:50:00 | We had some problems with some languages ... this is on broadcast data? |
---|
0:50:04 | We had some problems if you had |
---|
0:50:06 | poor text normalisation or if you didn't do good filtering to make sure that the |
---|
0:50:09 | text data were really |
---|
0:50:10 | from the language that you were targeting |
---|
0:50:12 | it can fail, it just doesn't converge. So this one case and in fact we |
---|
0:50:16 | had two languages where the |
---|
0:50:17 | problem was like that. So basically the word segmentation wasn't good. |
---|
0:50:20 | I think if you have |
---|
0:50:24 | too much garbage in your |
---|
0:50:26 | language model you're going to have a |
---|
0:50:28 | poor information you're giving. What amazes me and we haven't |
---|
0:50:31 | done too much of the work actually ourselves at LIMSI yet |
---|
0:50:35 | is that it still seems you working to some degree for the Babel data. |
---|
0:50:39 | Where we're flying with these word error rates and we have very little |
---|
0:50:42 | language model data but probably what we have is |
---|
0:50:45 | correct because manual transcripts we're using for it |
---|
0:50:48 | and the case where you're downloading data from web, but you don't really know what |
---|
0:50:50 | are getting. |
---|
0:50:51 | And so if you put garbage in, you're getting garbage out. That's why we need |
---|
0:50:54 | human to supervise what's |
---|
0:50:55 | going on at least to some extent. |
---|
0:50:58 | So it was quantified to some extent? |
---|
0:51:02 | I don't |
---|
0:51:03 | really have enough information and I know that one of the languages that we tried, |
---|
0:51:07 | so basically you'd get |
---|
0:51:08 | some improvement, but you'd stagnate maybe at the level of |
---|
0:51:11 | the second or third iteration just to improve further. |
---|
0:51:15 | It didn't happen too often. |
---|
0:51:17 | And it's something that |
---|
0:51:20 | I don't really have a good feeling for. Something I didn't talk about was text |
---|
0:51:24 | normalisation that really is an important part of our work. It is something sort of |
---|
0:51:27 | considered I think front work |
---|
0:51:28 | and people don't talk about too much. |
---|
0:51:32 | Any more questions? |
---|
0:51:35 | Well if not, I would like to invite the organiser our chairman |
---|
0:51:41 | to give enough |
---|
0:51:43 | of our appreciation to Lori. |
---|
0:51:46 | Let's thank her again. |
---|