Speech Transcript - Unsupervised Acoustic Model Training with Limited Linguistic Resources

so a good morning everyone

i'm going to you do a sort of a

this struggle

passage also some the work we did before

and the we i hope this'll work links and easy to

so basic only to be talking about semi supervised and unsupervised acoustic model training with

limited linguistic resources

V i mean and as most of us know i'm going to this with a

lot of actually overstate last decade instead of

research

and so i'm gonna talk about some experience we've had a team see

about like the unsupervised and super unsupervised training

L give a couple of case studies

and i actually first case study will be on english which at this there

then we'll all talk

very briefly about some different types of lexical units from modeling switches graphemic units versus

phonemic units in babble

and already mentioned just briefly i added that slightest an acoustic model interpolation "'cause" we're

talking about how to deal with all this header engenders data

and all five fish with some comments

over the last

decade or two we've seen part of advances in speech processing technologies lot of the

technologies are getting out there from industrial companies and there's that kind of commonplace for

a lot of the people right now and so people expect that this stuff really

works and i think

this is great that we're seeing or not really get out there at the other

people's expectations are really

i and we still have problems that are systems are pretty much developed for a

given

task a given language

and we still have a lot of work to do deported get good performance on

other tasks and languages we only cover a few tens maybe you fifty years old

languages now as a community

and many times language variance are actually even consider different languages is just easier to

develop a system for different very

and we still rely and language resources a lot

but over the last decade or two we've been seeing that's a reliance on human

intervention so we can use them

with a little bit less human work

so i guess this is sort of just

everybody knows this or maybe everybody doesn't die if there's some people that are working

on speech recognition here where we are holy grail listed all this technology that works

on anything that it's independent of the speakers of the task

there's no problem noise is no problem changing your microphone

and i guess some says maybe fortunately for us still resource do "'cause" this remains

the dream for us

but we do have

a lot lower error rates and we had a decade or two ago

we can process many more types of data with different speaking styles different conditions you

originally in the work that we we're doing was always requiring read speech who needs

to recognise something that was read from attacks it doesn't

that's a logical now look back at it

we cover more languages and we have a fair amount of work reaching the output

to make the transcripts more usable you know by systems or machines which is not

exactly the same thing so you might wanna quit different information you're going downstream processing

by machines

purses if you're doing it for you to be reading

so what's a low resource language i don't really have an answer but i think

in many of us in this community

typically mean that there are too many E resources so we don't find information online

"'cause" that's what we're using now to develop systems

if you speak to link was i think it's may be very different answers we

get and i'm not don't really wanna get into that

but

basically the

we need to be able to find it if we want to develop systems

and that type of thing going to talk about our languages that are low resource

in the in the sense that el the ldc don't have resources that they distribute

google probably has them

and you can we get them online can we develop systems with data that we

find online

i'm not going to really talk about the babble type languages or other rare languages

where

but you really don't even have mostly writing conventions you don't necessarily understand you have

any information about the language except maybe some linguists that have spoken to some people

aren't gonna visited

and so i guess you're a little bit more in that direction for marianne from

time in the

next part of the talk

and of course this framework with by outside on the fusion room

i'm trying to do the speech translation for this text languages with this really no

resources

so we have little or essentially no

available audio data

you have

probably nothing in terms of dictionaries you don't even necessarily have word this languages in

general very limited knowledge about the language

but you can also consider the many types of data for well-resourced language user language

variance

or almost low resource because we just don't have

much available data for

so let me take a little stuff back in time to the late nineteenth and

early two thousand

and one of the questions that you get all the time from funding agencies is

how much stated you need

okay

we try to answer this that i don't think anybody knows which is an hour

say what depends where you want to be it depends what you want to do

the funding agency users were leases time brawls complaining the data collection is

it is costly

i'm why you always asking to find data that is see this is a recurrent

question

and so this is the curves that we did back in two thousand showing with

supervised training on broadcast news data in english

how much you word error rate is as a function of business have a pointer

the red one

no i mean well that's anyway you start with

the little number of the really high number of the left is one and a

half hours of audio data distraught bootstrapping system

with a well trained language model

the second point there is about thirty three with what a set of ten minutes

the next one is one and half hours where you see that the word error

rate is about thirty three percent and then as we have more data we go

down we see that once we get to fifty or a hundred hours

the sort of starts to platform so we're getting diminishing returns really additional data and

so this one thing we can say

the red

we do

okay so once we get you could you know what hundred hours of data

basically

you don't wanna spend a lot of money for that additional data could you just

not getting much returns

once again this is on broadcast news data we had a reasonably well train a

language model so we're seeing this asymptotic behavior of the error rate something we observed

in the community at large is that when you start a new task get

rapid progress it's really fine "'cause" everything here the error rates are dropping we're getting

twenty percent thirty percent and one here is great

that once you get some reasonable

every we're getting about six percent per year and where did some count if you

look over say ten or fifteen years of progress

it seems like the average

improvement we're getting is about six percent per year

so this groups

i don't wanna do that

additional data should cost less

and we need to learn how to use to predict this is sort of what

was going to remind back in two thousand K which is still i think quite

relevant to that

you can think about different types of bubbles the supervision so way that when people

were saying we should use phonetic

or phone level transcriptions for training or phone models as logical

it's gives you more information is better than using words

and

people did that

our experience that they can see when we did some tests on this using timit

type data switchboard and a breath is read speech corpus in french

is that actually humans like a segmentation is that we're human phonetic transcriptions better but

the system like the automatic ones better

so basically if you use the word level transcription with the dictionary the covers a

reasonable about variance the systems were better than training them on the phonetic transcriptions maybe

that would not be true nowadays i don't know we have redone it

but that sort of satisfied us to say okay we can go ahead we can

do this approach that we do the standard alignment of word level transcriptions with the

audio so then if you go to be next you can say okay we can

have a large amount of actual reality data for large as round hundred hours or

greater than a hundred hours

we can have some

after the annotated data but a lot of unlabeled data with approximate transcriptions not gonna

give some results on this

we can have no annotated data

okay but we can find some sort of related text related information

or we can have some small amount and then use this to bootstrap are systems

that will be sort of semi supervised and this is what we heard about a

little bit yesterday this is what people been doing so you basically transcribed raw data

you say this is ground truth and you do your standard training to build a

models this work

there's no lot of variance that have been published so you can filter it you

can use confidence measures you can use consensus networks you can do rover you can

lattices lots of different sort of variance

and i listed some of the early work but in my recollection it was people

involved years project and people involved in the project in europe

and i if i forgot people i'm sorry i don't mean to but this is

what comes to mind absolutely

early adopters of this

type of activities

if we just go back to supervised training and i think most people in this

room know this i'm not gonna stand for a long time

to normalize transcriptions what you do that for the language model anyway that's not so

bad

you need to do things and creating a word this you need to come up

phonemic transcriptions and you meet you in the old days we collected so when we

start errors in the transcripts we actually spend time correcting that because we only had

thirty hours fifty hours we thought it would give a something i think young people

today wouldn't even think about this but we spent a lot of times that we

but that

and then you you're standard training

this is this showing the results are using what we called semi supervised training

so you had a language model that was

trained on a certain amount of hours amazing the justice right okay so they the

manual word error rate was eighteen percent if we had a fully train system

we used closed captions as a language model one am showing here be done different

variance

so it's a sort of an approximate transcription that we had

and we took

in these numbers we started we

every now i think

ten hours of original data

and we then transcribe varying amounts of

unlabeled data so this is the raw unlabeled data

and we said okay we can use it is so that this unfiltered this is

close to an unsupervised training courses we can do a semi supervised what we say

where the this not too much of an error rate difference between the transcript we

generate and what the caption was that's good

and so we took in this at a sort of phrase level where the segment

level we just kept and train the segments where the word error rate for the

captions to be from automatic transcriptions with less than X and i don't remember what

the experts

which probably less than twenty or thirty percent error rate

and you can see that we get pretty close

and so we get within ten percent absolute of the manual transcriptions using both in

i words this is what we do mostly then we just don't bother filtering it's

easy you just train on everything

it seems to give about the same type of results

be a measure that was introduced by dbn is was called the word error

rate recovery and so basically you look at the difference between

how much you get some supervised training and how much we get from unsupervised training

from your initial starting point and so what we get here is about eighty five

to ninety percent is we're covering most of what we could have gotten had we

don't supervised training

one problem we had this work is that there is some knowledge in the system

because we did

have prior knowledge from the dictionary we did have a pretty good language model was

close to the data it didn't wasn't exactly same data but was close

so we discuss the set i think it was in years meeting or maybe a

conference mostly was with rich worse we were discussing it and said well you know

take into an extreme let's see if we can use

one hour training data work ten minutes of training data

and we were crazy enough to do this in the time it was a lot

of computation because every time every time you value with a different language model every

time you use different amounts of data you have to be the code reader could

multiple times

these days be very easy to be one of these experiments but at the time

to time to do that

and so here we see that if we start with a ten minute bedroom system

we've got a word error rate of sixty five percent we actually did this didn't

think it would work

and that was okay and that's just take some data so three to four hours

data

and non people

in improve the fact we throw away these ten minutes could was just more complicated

to build models merging it to do that okay we take the ten to four

hours of automatic

and we go to fifty four or we go down and we stopped heard about

under forty hours and we got thirty seven point four percent if you use the

same language model with the full training data supervised

you only get to about thirty percent

so we're getting pretty

good difference of where we need to get

so we're happy with that

and it about this time

so what the green in came to do this is with this C and we

sort of tried to we don't really apply this method to his work as we

don't have enough audio data but we did try and look at

questions that we've been asking for a long time

as to how much

data you need to train models

what improve performance can you expect when you have limited resources what's more important audio

data or text data

and how can you prove the language models when you have very little bit so

she can twenty around two thousand four if i remember correctly and we had available

i guess we consider this reasonably small amounts of data we had thirty seven hours

was

bodily good but it's a not of a

and we had about five million words of transcripts

and we nothing about number

and so what when the first things you did was to look at what's the

influence of the audio data transcripts versus what transcripts on out-of-vocabulary

and so on the left we have the out-of-vocabulary rate

and here we're showing for two hours of transcripts ten hours of transcriptions about seven

K words twenty K words and fifty K words with the thirty five

and how it is your oov rate

go down as you adding more transcripts and so you can see here and the

top her

if you add in sre and on this

so i adding more text data so as the curves are the amount of transcripts

we have

and on there

bottom x-axis we reading different amounts of text data that protect five million text sources

and so if you start with

just two hours of audio data if you had ten K

you don't really lower your be too much

if you have hundred K right a little bit more et cetera et cetera

if you're have ten hours of data

there isn't much of an effect and so you see that the effect of adding

the text data is less than adding the audio data

that actually at a cat to because we probably have some sort mismatch we know

the audio data is the same type of data we're trying to look at

and the text data is related but not really the same

so then here's another curve that we're trying to

look at the amount of audio data

versus text data for

language model is a little bit complicated discrete on hers here or you just two

hours of audio data in the acoustic model

and the bottom one than the green once again it's ten hours and the red

is thirty five

and

but you can see that even if you add in more text data you're not

really improving the word error rate now

and everyone said okay is it coming from the acoustic data where's it coming from

the transcripts we know the transcripts or less close

and so we added in on the purple in the blue curves are using the

transcripts from the ten hours for the thirty five hours and so we can see

that if you only have two hours of audio data is just not going to

do very well and you need more once you get to ten hours

it seems like a improvement you get is

little lesson this is interesting "'cause" this is what was being used currently in the

babel project we're actually

working with ten hours for some the conditions

let me a few minutes about some other work he did for

the language modeling and it's with on work comp word decomposition so i'm work is

a very rich morphology and has lots of to poke our high out-of-vocabulary rates

a problem also for language modeling is that is very took is not very well

models so therefore it's interesting to use a word decompounding

and when you look at the literature you can see needs results languages you get

a nice gains some you don't always get it again you know we don't necessarily

get that in word error rate

and so

one a

idea that we had in this work was to try and avoid with the generate

once you don't want to create visible units and

so that's what i'm going to give a couple of ideas about and to do

this we build matched conditions these types of we train language models retrained acoustic models

for all the

conditions

so this is showing

here we had the we use them morfessor algorithm

which is relatively recent at the time

so this basic morfessor will be the curse would

then there's

no reason it's referred to as harris we're basically you look at the number of

strings letters the can succeed another better and that gives you an idea of what

the perplexity is if you have a big a lot of different letters that can

follow it's like the dbn you word if it's were you more and if it's

not it's likely to be within the same

a more

we also then tried to use and distinctive feature properties to train at some speech

information into the decomposition and looked using phonemic confusion constraints that were generated using phone

alignments and so basically here if you have used to sequences neighbour a and may

well

in queens like that may because if the not the lot and the well we

easily confusable if they were easily confusable was okay display

the idea that

constraints it's relatively language-independent but of course you do need to know the phonemes in

the language or

have and you set of phonemes in general

so this is looking at what happens in terms of the number of tokens you

get after T V splitting for different configurations

and

the length of them so it's something that was also the weight was represented rest

everything that was to phonemes so that's where you see things a two four six

eight et cetera

and basically the main point is that anything

that is that's this is your baseline by the words in the black

and once you start cutting use

units get shorter as you expect that's a goal of it

and then if you use this confusion constraints those ones we cease uses green in

the purple general there are a little bit less shifted to the left so we're

creating slightly

if you were very short units

and that was the goal what we're trying to do and then here's a table

that we probably don't want to go

into too much but if you look at the baseline system we had twenty two

point six percent

no numbers are relatively close okay but

if you split anything

the error rate general gets worse isn't the black ones so you can use to

the distinctive features that you really help

you can

the only ones that there's were used disk this phoneme confusion constraint and so here

all of these two slightly better

the and the baseline and those the only ones and so is really important to

avoid adding

we need to confusion your homophones and your system

so that sort of the typical message for this

so we

the other one thing today we got fifty percent reduction in oov rates that was

good except we were introducing errors and confusions and the little affix as

that we're compensating or recording this more than yours we recovered

and basically we did some studies and look at the previously over previous oov words

and basically about half of them were correctly recognized using this method but we would

swap it out

with a recently introduced

on the different aspects is that work

so just another slide sorta not of all of his work what was more logical

to put here in the talk so i

but here is we've used unsupervised decomposition once again usually based on morfessor or some

modifications of it for finnish hungarian and german

and russian

for the first three languages we got reasonable gains a between one and three percent

and we can reduce our vocabulary size is from seven hundred thousand two million words

to around three hundred thousand

which are a little bit more usable for the system and probably more

easy to train a reliable estimates

for some of them we need acoustic model retraining so we could do not for

german

for finnish we tried both the

acoustic model retraining or not

and we

well time's got three percent difference using the morphologically decomposed system whether or not retrain

the acoustic model to

interesting for us morfessor worked well for finnish i think in part because the authors

were

and so there

the output was maybe design for that

we also tried to do this and russian cts where we only had the time

about ten hours of training data so conversational telephone speech

six yes for some people that might know what

and we got a reduction in the oov we were able to use a smaller

vocabulary but we can get an egg in word error rate

but once again this is very preliminary

work we get done

so now i'm going to shift your gym where my time

fourteen is gone

that okay several faster

so to speak a few minutes about

finish where we do have is one of the first languages and deeper we didn't

have any

audio and untranscribed audio and so

we have found some online data with an approximate transcripts that comes from a

initially used for foreigners finish

and there is no transcribed development data either and said how are we gonna do

this for many companies is easy to hire someone to transcribe some data for us

is not so we see so it takes time to find the person

if we're government research labs to

this is a complicated so can we get ahead

by doing something simpler and so we did is we

use this is approximate transcriptions but also for the development not just for the

unsupervised training

and once again as i said before we use morphological decomposition for this

so here's occur showing the

estimated word error rate as we increase the amount of unsupervised training data

so we have two hours of five hours and then sir stabilised again once we

get around ten to fifteen hours

we're stabilising C we get a beginning here and this approximate but is going the

right direction that about two months later we had somebody the came in and

transcribed data for us it was a two or three hours sets is not a

lot it still took awhile first to get the person for them to do it

and you can see the human error rate

use

in the following exactly the same curve

are error rates higher zero underestimating here because what we did is we selected regions

as and the done for the unsupervised training where there was a good knots between

these sort of approximate transcriptions

and

what the system did we measured on that but we're not because it allowed us

to develop without necessarily having to wait for this data to become

available

so the message on that is that the unsupervised acoustic model training worked reasonably well

using these approximate transcripts

with since then it on

some sorry it is also worked on

for the language models so we can improve our language models using the sort approximate

transcription it worked

we then added into the system some cross lingual mlp

so we tried both french and english

and we got about ten percent improvement

and i said before with the morphological decomposition

so now i'm gonna talk about not a language which also is consider somewhat low

resourced so that in

and this was work it was done

with all the other operand was that nancy can was russian so you sort of

down that unit interesting language for him and basically his words where they just know

nothing for that and out there

this assistance is not distribute corpora but you can find text and audio on the

net so therefore something we could reasonably do

it's a baltic language is not so many speakers of one point five million it's

a complicated language but uses a lot now forget half of it

and you please reasonably straightforward

so i this is sort of the overview of the language models we found a

fair amount of data

good

one point six million words

and in domain data and hundred forty two million words newspapers so the in domain

means it comes from like radio and T V stations and this just newspapers

we used about a five hundred thousand word vocabulary just keeping words that occurred more

than three times

text processing thing is or standard

however this isn't really important stuff it's if you don't do the text processing carefully

you have problems when trying to cancer supervised training means that seems to be our

experience and it was pretty much standard language models he threw in some neural network

language models at the end so given distressed talk that was interesting to

for that line into

so this is this figure showing the

word error rate have goes up so that these curves here the word error rate

as a function of the iteration

and me circles are shown you roughly the size of the acoustic units were roughly

doubling at each stop

the amount of audio data used in an unsupervised manner

for the systems

at this level here for we added in the mlp from russian

are initial seed models were here

came from the mixer three languages english french russian the audio data wiper about sixty

hours at this stage to about seven hundred eight hundred hours at this stage raw

so you're only using about half

when you have to build models

and of course something that's important used to increase the number of context in the

states that you model the same time

so it doesn't suffice just add more data to keep the model topology fixed you

don't get much of again from the

afterwards he did some additional tuning parameters and you pass decoding and

use the four gram lm and you can see that we it's just the original

so see i is

case insensitive and C D is case sensitive not context independent context dependent

as we're looking at the word error rate if you take into account case "'cause"

what people want to read is really having case correct

and even for different search engines sometimes is important to have the case correct

because you want to know for the proper name or not

and so for people that are found neural net language model got about what have

to two percent gain

adding them and this is on dev data and then and validated we got pretty

much similar results

so we were happy with that so it's completely unsupervised we developed a system

in about less than a month

mainly at the end we were and this is trying for hungarian roughly the same

thing

we used a few data from

five languages we had less audio data so we only one two

about three hundred hours

and we used a originally and mlp trained on english

and then we use the transcripts of this level to then generate an mlp trail

area

using unsupervised transcription we got a another two point eight or so that napster

so just to you been

overview this just some results from the program which some of you are where i'm

sure in some of you or less

the systems to the one the including channel

are trained on supervised data

and the supervised data varies from fifty to see a hundred fifty two hundred hours

upon language

on the right the role train unsupervised

the green sorry the low

mine is the average error rate across the test data about three hours

and so you see it's are going up and the ones on the side our

general little bit higher the ones on the left not so much they're pretty good

bulgarian and with the when you are a little bit higher here

then on the a look some vocational come back to a few minutes

if you look at the

lowest word error rate i anyone to the segments we had from T V radio

they're pretty known in fact even some of the unsupervised or the word and the

supervised

and finally this is the worst case since the worst-case word error rate is still

pretty ice we still got a fair amount of work to do

these data are mixed news and conversations

and some of them are some languages a more interactive than others things like that

so i'm going to skip the next slide which is too much stuff was to

show the amount of data we use of people are interested come finally later

and i want to say two or three words about dictionary so when the

think that we're this them that passes very costly to do with dictionaries and so

there's been

more recently use a growing interest in using graphemic type units rather than phonetic units

in years just in our systems

and the first K work but i found was contact in i

maybe people are aware earlier work

doing this that are not aware of

and avoid this production of the pronunciation dictionary

basically the G two P problem becomes a text normalisation problems we can have numbers

there are things like that you have to convert dates and times and all those

types of things into words in order to do this and then you have units

so this we then it means see for turkish tagalog passed to within the babel

program

and we get about

as like other previous studies got about comparable results

in general

but for some languages we actually do better with the graphemic systems and the

phonemic systems in fact i should mention that back in the gale days that was

work using graphemic systems rare

and basically this is some results we don't passed to for

in the babel program we had a two pass system using the dbn voice activity

detection in the but features thank you and we use both graphemic and phonemic systems

and we can see that there about the same that the phonemic is about one

point higher than the graphemic but if we do what you pass system where we

do we need you graphemic we actually got a reasonable

getting from that

we believe that the one of the problems in the past was actually having for

pronunciation generation so therefore they're bad

or a lot of variability you don't have also where the graphemic systems can actually

outperformed the phonemic

so let me now speak about look some work just because it's a and this

is work done with

marty noted that there is looks from looks and work for those of you know

her and it's a little country where the not too many people but it's really

a multilingual environments of the

the

people

when they go to school their first language is german and then i believe it's

french and english at this study but it won't speak their local language about submission

apparently even those this type of the country the few close your eyes and you

guys are you don't see in you have exaggerated a little bit

but you even have multiple dialects in different regions

so what we did this we initialize to originally the first studies we did was

just try and look at segmentation experiments for how

which languages are favoured by look supportish data so we had basically no transcribed data

time

and we transcribe ten or fifteen minutes of some

the data and so we do is we did some approximate mappings are saying that

if you take the

in like to mortgage okay maps pretty well to french english german but if you

take you well that doesn't really exist in english but can see that from germany

french and so it's okay in english will use the it

to get a mapping so we have the same number of phonemes for sort of

phonemes in each one

and basically when we said okay we build models would put them in parable parallel

and we can have a superset of models and we try and align

he looks more this data with this so we had somewhere transcriptions of a small

amount and see which languages are

referred we do this we allowed

i don't know that much about the language myself but basically you can have french

words inserted in the middle so the apart from the language is that they're also

indian

and so we allowed the language is to change a word boundaries you had to

use the phones from a given seed model

within the word change of word boundaries and basically we found that as you can

expect since the looks mortgages the dramatic language in general the segmentation for german

second was english which is closest but there's about ten percent that what you french

in this was typically needs allows an effect for english

typically dip sounds remind we don't really know exactly why

so based on that we then said okay is now a couple years later we

got some transcribed broadcast news data in button we're going

and which are easy models richer context independent they're tiny they're not gonna perform well

and we just decoded the two or three hours of training data

and you can see that the word error rates are flying this you expect some

in the right range for the amount of data for the fact that the context

independent

but the german models for preferred

we actually did models that we're pooled estimate the data together and told it and

those for

like less than the german however we get already before we knew this "'cause" we

didn't have the data we had started this so we used will models to

do the automatic decoding and once again we did or standard techniques you can see

that we're going from about thirty five to about twenty nine percent

word error rate by doubling data and adding new increasing the context adding mlp features

et cetera

and we were able to model more context

but

is there is kind of high converge some the other languages and so we martine

looks at that classification of errors and you can see

basically there's a lot of confusions between

homophones

some of the data this is pretty interactive data so it's not the same bn

data we have two types you and with human production various of people did false

starts repetitions in this pronounce something

or the distance at work

and then a large percentage re-estimated somewhere between fifteen twenty percent writing variance so because

look some work is just sort of this

spoken language

are these really errors or not and so this is an example of some of

the writing variances of the words saturday and i'm not going to

was times they are not that's probably not really how you pronounce it okay all

these are written text all allowable so basically you can

depending on what regional variant or you can say so or show

you can say tiered you can change the bow

and all these are accepted in the written form we can find them in the

text

and in what they say

even though this is i don't know people really consider this a low resource language

you're not there's not much data were almost none available all the languages used in

speaking it's not really used in writing so how much for time i think it's

good

i'm going to speak

one minute about we're trying to do this on korean but we once again don't

have transcribed dev data and we were trying to do a study where we look

at the side of the language model to use for decoding an unsupervised recording using

a hundred twenty thousand words two hundred two million

two K character

language model

just for the decoding here we looked at using phone units and have syllable acoustic

units

and the only again we had was from ldc there's about a ten hour dataset

if we do a standard train model on it we took to the last two

files we held for deaf because we didn't have any

we got a word error rate of about thirty percent and the character error rate

of about twenty percent

on this data is probably optimistic because really seen data is all the training data

just the last two so that it's very close to it

and so what years were increasing the amount of

data we use from the web and looking at influence on the word error rate

and the character error rate using different size language models and you can see that

for the two hundred thousand where the chilean

it's about the same we sorry these results are all decoded with the same size

language model the role decoded with the to indicate language model

so it's only what we use for the unsupervised training that's changing

and can we see that the real results are basically the same as we go

with the same data

but the character language model which we skip this step

is doing slightly better in terms of character error rate than the others

we don't know it's real we need to look into some more since is just

really recent stuff we're doing so for people to think it's easy to get transcribers

it's been a month and how we're looking for someone in france that has the

right to work we can transcribe the query and for us

we finally found someone and they can start working in till february

okay so yes it's an easy thing to do but

not necessarily depending on your constraints for hiring things is not so easy

so we're gonna follow up on this more we hope to have some clear results

at the end

to words about acoustic model interpolation because you're string we spoke about we have this

heterogeneous data

how do we combine the data from different sources enrich make the statement that you

want to use all that you don't want to throw it away but you

data weightings that's

when we doing it is you can just

more we just some of the data remover for others in such a go frog

at the syllables into is working on acoustic model interpolation

and had a paper

speech i think

and looking at if you can do something random polling you can use a different

sets and then interpolate them and use this on the european portuguese

with the baseline putting gave you thirty one point seven percent in the interpolation give

you about the same result but almost easier to deal with

"'cause" you can you can train your data on smaller sets you

and then interpolate done the same idea of what's done for language modeling for years

now

then we also looked at what we hear knife that using different a variance for

english and this is some work that have been published with me to your back

in two thousand and ten and basically

we get a little bit of data for some of these

we don't degrade for any of the variance you with respect to the visual pull

model whereas with the map adaptation we actually did a little bit more some one

or two with the variance i don't remember which ones

so let me finish up

i guess the take a message like say that the unsupervised acoustic model training successful

it's been applied in a lot to broadcast news type data more recently in the

babel project to

wider range of the data i think it's really exciting that we can do that

we still have to find the data

but it it's really nice that we can do this type stuff

but the error rates are still kind of time

you it's a sorry even though the eraser so kind of i we can we're

going in the right direction general

the

i'm sure rich or people but can see more some more about this they are

during the meeting

this is something that's interesting is that it seems in this will make people from

yesterday happy it seems that the

mlps a more robust to the fact that the transcriptions are imperfect in and they

take less of a hit in the hmms

as that sort of interesting

observation

and so we can use this untranscribed data were automatically and on automatically transcribed data

to produce

references for the training the mlps that's really nice

the your hopefulness type of approach will allow us to extend two different types of

task more easily miss you don't have to use the time of collecting the data

entry transcribing the data you collect

we still have to collected

i didn't speak about multilingual acoustic modeling

which is something that we in general

shows and to do bootstrapping restore should just taking like models from other languages

is it better use multilingual can we do better

i think it depends on what you have been hand would be nice if we

could everything we've tried in babble has gotten worse the sparse of a little bit

disappointed with what we've been doing

then of course something i didn't talk about what you do we languages have no

written standard formats or touched upon it with the bottom for example i don't really

now we're trying to do some work in others are in this even for about

it's a paper round two thousand five i think of trying to automatically discover lexical

units

but when the main problems you also have you know they're meaningful

i said i'm a bunch of times that here myself saying it's or systems or

the kinetic the going to learn that people that say like and you know you

can either in that

then the word-like even if you get it is meaningful in some cases and it's

not meaningful another cases and so how do you with that how do not was

useful

but i think it's really exciting it's been fun stuff i hope the

those of you that have worked on unsupervised training will continue in those of you

that have in my money give it a try

and so

thank you for giving me the opportunity speak

and these are all but i've worked with closely on this work and there's probably

other people i've forgotten and sorry

so thank you

thanks lori we have some time for questions

natural unit is gradually improvements more data you have any idea what is being can

improve i mean in this which words are getting better which ones that state maybe

that probably not that yes now we have about that that's an interesting thing to

look at

something that we

what we would like to do that we haven't done yet is to actually not

just continue we don't normally we incrementally increasing amount of data would be interesting even

change the datasets is just use different round of portions of i think we should

cover better

"'cause" when you're models are like something gonna continue liking it

we have a look at words would be interesting

so the question it so

you know if you talk to machine learning person that works in some a supervised

learning the get really nervous when you say self training or some supervision because it's

this thing where you're starting with something which isn't working that well it can counters

and actually go unstable and the opposite direction so there's this sort of sensitive

you're starting with a baseline recognizer trained on a small amount that's work reasonably well

you can improve if you're starting something was working really bad i can get worse

and i and i noticed with some of a lot of the results that you

had in this talk shows a lot of broadcast news we are starting with

but are performing susan hours those all these results for all languages get your data

we had nothing started zero

zero you are transcribed data

in language or you're making language

okay so we started on a sort of all these languages here to the right

we have zero

in language data

and we started with seed models of word context independent if you use you know

that from another language and reduced to the max and so the noise model you

will roughly sixty to eighty percent word error rate when you start

on your data

so we're starting really high kicks but

or language models even though they're trained on

newspaper text newswire text and things like that

are pretty well representative the task there can be very strong constraints in there

which is why i find the babble were really exciting which means you we haven't

done it ourselves on the unsupervised part

but the it's really exciting because there don't have the screen coming from a language

model

all you have a small amount of

transcriptions so ten hours of transcriptions so here there is information is coming into the

system from the text

and that we're

i personally believe the

why works

just something to see if you don't normalise

correctly so we had certain situations people that i don't know how to pronounce numbers

i'm just gonna keep them as numbers

convergence is a lot harder doesn't work very well if you say i'm gonna throw

away the numbers it which is what some of the people to be the language

modeling did

it also doesn't work so well

so you really need to have something that represents pretty well

well cosine it seems

from my we also had some languages we're would you

people from the litter here we think the problem when you take text that are

online sometimes yes

it's a texan other languages you actually have to filter out the text that are

not language you're targeting tended did you wanna

come up i think you're so that users come up during questions or

okay

no during the question you wanted her to change

it wasn't it i don't think questions business at your and the formant to all

and the last question from or okay one L

depending application at the end of the day you may want to have readable text

like to queue for translation or broadcast news

and at that point two H and four by hand think of names

it's probably more important than

and also percent of the K L I

let me just call always systems are case them punctuation

but we're not measuring the word error rate on punctuation the case where all the

systems produce the punctuated case double

the named entities are not specifically detected but hopefully if there proper nouns will be

uppercase diff we did a language model right

so this is something that actually we've

we tried in that's why had this slide work in the case insensitive in case

okay and then in case insensitive word error rates this is about a two percent

difference

in that the punctuation is a lot harder to evaluate so some work that we

do with the acting colour who's now and that

trying to evaluate the punctuation based on the road program and it's very difficult if

you take to is for humans they don't agree how to punctuate things maybe not

of speech

for a

full stops

closer and big if there's eighty percent inter annotator agreement and if you go to

common sits down relative

so it's very heart

but what really want something that's acceptable no you don't really care about a ground

to i think sort of like translation you know really care exactly what it is

as long as a reasonable

reasonably correct punctuation if you have multiple forms of are possible just like you can

translate something in multiple ways if you get one of them that's correct this could

not so i think punctuation false same category if you use a common or after

something it's not very important as long as you can understand correctly

as i think as a really heart problems to evaluate

fact even more so than doing something that seems reasonable

the used as a separate comes on that the punctuation is that in the postprocessing

step

and other sites of done you bbn is done punctuation unity the other sites also

over ten years you know

but you're right

okay

Unsupervised Acoustic Model Training with Limited Linguistic Resources

Limited Resources Day

Lori Lamel (CNRS-LIMSI)