so a good morning everyone
i'm going to you do a sort of a
this struggle
passage also some the work we did before
and the we i hope this'll work links and easy to
so basic only to be talking about semi supervised and unsupervised acoustic model training with
limited linguistic resources
V i mean and as most of us know i'm going to this with a
lot of actually overstate last decade instead of
research
and so i'm gonna talk about some experience we've had a team see
about like the unsupervised and super unsupervised training
L give a couple of case studies
and i actually first case study will be on english which at this there
then we'll all talk
very briefly about some different types of lexical units from modeling switches graphemic units versus
phonemic units in babble
and already mentioned just briefly i added that slightest an acoustic model interpolation "'cause" we're
talking about how to deal with all this header engenders data
and all five fish with some comments
so
over the last
decade or two we've seen part of advances in speech processing technologies lot of the
technologies are getting out there from industrial companies and there's that kind of commonplace for
a lot of the people right now and so people expect that this stuff really
works and i think
this is great that we're seeing or not really get out there at the other
people's expectations are really
i and we still have problems that are systems are pretty much developed for a
given
task a given language
and we still have a lot of work to do deported get good performance on
other tasks and languages we only cover a few tens maybe you fifty years old
languages now as a community
and many times language variance are actually even consider different languages is just easier to
develop a system for different very
and we still rely and language resources a lot
but over the last decade or two we've been seeing that's a reliance on human
intervention so we can use them
with a little bit less human work
so i guess this is sort of just
everybody knows this or maybe everybody doesn't die if there's some people that are working
on speech recognition here where we are holy grail listed all this technology that works
on anything that it's independent of the speakers of the task
there's no problem noise is no problem changing your microphone
and i guess some says maybe fortunately for us still resource do "'cause" this remains
the dream for us
but we do have
a lot lower error rates and we had a decade or two ago
we can process many more types of data with different speaking styles different conditions you
originally in the work that we we're doing was always requiring read speech who needs
to recognise something that was read from attacks it doesn't
that's a logical now look back at it
we cover more languages and we have a fair amount of work reaching the output
to make the transcripts more usable you know by systems or machines which is not
exactly the same thing so you might wanna quit different information you're going downstream processing
by machines
purses if you're doing it for you to be reading
so what's a low resource language i don't really have an answer but i think
in many of us in this community
typically mean that there are too many E resources so we don't find information online
"'cause" that's what we're using now to develop systems
if you speak to link was i think it's may be very different answers we
get and i'm not don't really wanna get into that
but
basically the
we need to be able to find it if we want to develop systems
and that type of thing going to talk about our languages that are low resource
in the in the sense that el the ldc don't have resources that they distribute
google probably has them
and you can we get them online can we develop systems with data that we
find online
i'm not going to really talk about the babble type languages or other rare languages
where
but you really don't even have mostly writing conventions you don't necessarily understand you have
any information about the language except maybe some linguists that have spoken to some people
aren't gonna visited
and so i guess you're a little bit more in that direction for marianne from
time in the
next part of the talk
and of course this framework with by outside on the fusion room
i'm trying to do the speech translation for this text languages with this really no
resources
so we have little or essentially no
available audio data
you have
probably nothing in terms of dictionaries you don't even necessarily have word this languages in
general very limited knowledge about the language
but you can also consider the many types of data for well-resourced language user language
variance
or almost low resource because we just don't have
much available data for
so let me take a little stuff back in time to the late nineteenth and
early two thousand
and one of the questions that you get all the time from funding agencies is
how much stated you need
okay
we try to answer this that i don't think anybody knows which is an hour
say what depends where you want to be it depends what you want to do
the funding agency users were leases time brawls complaining the data collection is
it is costly
i'm why you always asking to find data that is see this is a recurrent
question
and so this is the curves that we did back in two thousand showing with
supervised training on broadcast news data in english
how much you word error rate is as a function of business have a pointer
the red one
no i mean well that's anyway you start with
the little number of the really high number of the left is one and a
half hours of audio data distraught bootstrapping system
with a well trained language model
the second point there is about thirty three with what a set of ten minutes
the next one is one and half hours where you see that the word error
rate is about thirty three percent and then as we have more data we go
down we see that once we get to fifty or a hundred hours
the sort of starts to platform so we're getting diminishing returns really additional data and
so this one thing we can say
the red
we do
okay so once we get you could you know what hundred hours of data
basically
you don't wanna spend a lot of money for that additional data could you just
not getting much returns
once again this is on broadcast news data we had a reasonably well train a
language model so we're seeing this asymptotic behavior of the error rate something we observed
in the community at large is that when you start a new task get
rapid progress it's really fine "'cause" everything here the error rates are dropping we're getting
twenty percent thirty percent and one here is great
that once you get some reasonable
every we're getting about six percent per year and where did some count if you
look over say ten or fifteen years of progress
it seems like the average
improvement we're getting is about six percent per year
so this groups
i don't wanna do that
additional data should cost less
and we need to learn how to use to predict this is sort of what
was going to remind back in two thousand K which is still i think quite
relevant to that
so
you can think about different types of bubbles the supervision so way that when people
were saying we should use phonetic
or phone level transcriptions for training or phone models as logical
it's gives you more information is better than using words
and
people did that
our experience that they can see when we did some tests on this using timit
type data switchboard and a breath is read speech corpus in french
is that actually humans like a segmentation is that we're human phonetic transcriptions better but
the system like the automatic ones better
so basically if you use the word level transcription with the dictionary the covers a
reasonable about variance the systems were better than training them on the phonetic transcriptions maybe
that would not be true nowadays i don't know we have redone it
but that sort of satisfied us to say okay we can go ahead we can
do this approach that we do the standard alignment of word level transcriptions with the
audio so then if you go to be next you can say okay we can
have a large amount of actual reality data for large as round hundred hours or
greater than a hundred hours
we can have some
after the annotated data but a lot of unlabeled data with approximate transcriptions not gonna
give some results on this
we can have no annotated data
okay but we can find some sort of related text related information
or we can have some small amount and then use this to bootstrap are systems
that will be sort of semi supervised and this is what we heard about a
little bit yesterday this is what people been doing so you basically transcribed raw data
you say this is ground truth and you do your standard training to build a
models this work
there's no lot of variance that have been published so you can filter it you
can use confidence measures you can use consensus networks you can do rover you can
do
lattices lots of different sort of variance
and i listed some of the early work but in my recollection it was people
involved years project and people involved in the project in europe
and i if i forgot people i'm sorry i don't mean to but this is
what comes to mind absolutely
early adopters of this
type of activities
so
if we just go back to supervised training and i think most people in this
room know this i'm not gonna stand for a long time
to normalize transcriptions what you do that for the language model anyway that's not so
bad
you need to do things and creating a word this you need to come up
phonemic transcriptions and you meet you in the old days we collected so when we
start errors in the transcripts we actually spend time correcting that because we only had
thirty hours fifty hours we thought it would give a something i think young people
today wouldn't even think about this but we spent a lot of times that we
but that
and then you you're standard training
so
this is this showing the results are using what we called semi supervised training
so you had a language model that was
trained on a certain amount of hours amazing the justice right okay so they the
manual word error rate was eighteen percent if we had a fully train system
we used closed captions as a language model one am showing here be done different
variance
so it's a sort of an approximate transcription that we had
and we took
in these numbers we started we
every now i think
ten hours of original data
and we then transcribe varying amounts of
unlabeled data so this is the raw unlabeled data
and we said okay we can use it is so that this unfiltered this is
close to an unsupervised training courses we can do a semi supervised what we say
where the this not too much of an error rate difference between the transcript we
generate and what the caption was that's good
and so we took in this at a sort of phrase level where the segment
level we just kept and train the segments where the word error rate for the
captions to be from automatic transcriptions with less than X and i don't remember what
the experts
which probably less than twenty or thirty percent error rate
and you can see that we get pretty close
and so we get within ten percent absolute of the manual transcriptions using both in
i words this is what we do mostly then we just don't bother filtering it's
easy you just train on everything
it seems to give about the same type of results
be a measure that was introduced by dbn is was called the word error
rate recovery and so basically you look at the difference between
how much you get some supervised training and how much we get from unsupervised training
from your initial starting point and so what we get here is about eighty five
to ninety percent is we're covering most of what we could have gotten had we
don't supervised training
one problem we had this work is that there is some knowledge in the system
because we did
have prior knowledge from the dictionary we did have a pretty good language model was
close to the data it didn't wasn't exactly same data but was close
so we discuss the set i think it was in years meeting or maybe a
conference mostly was with rich worse we were discussing it and said well you know
take into an extreme let's see if we can use
one hour training data work ten minutes of training data
and we were crazy enough to do this in the time it was a lot
of computation because every time every time you value with a different language model every
time you use different amounts of data you have to be the code reader could
multiple times
these days be very easy to be one of these experiments but at the time
to time to do that
and so here we see that if we start with a ten minute bedroom system
we've got a word error rate of sixty five percent we actually did this didn't
think it would work
and that was okay and that's just take some data so three to four hours
of
data
and non people
in improve the fact we throw away these ten minutes could was just more complicated
to build models merging it to do that okay we take the ten to four
hours of automatic
and we go to fifty four or we go down and we stopped heard about
under forty hours and we got thirty seven point four percent if you use the
same language model with the full training data supervised
you only get to about thirty percent
so we're getting pretty
good difference of where we need to get
so we're happy with that
and it about this time
so what the green in came to do this is with this C and we
sort of tried to we don't really apply this method to his work as we
don't have enough audio data but we did try and look at
questions that we've been asking for a long time
as to how much
data you need to train models
what improve performance can you expect when you have limited resources what's more important audio
data or text data
and how can you prove the language models when you have very little bit so
she can twenty around two thousand four if i remember correctly and we had available
i guess we consider this reasonably small amounts of data we had thirty seven hours
was
bodily good but it's a not of a
and we had about five million words of transcripts
and we nothing about number
and so what when the first things you did was to look at what's the
influence of the audio data transcripts versus what transcripts on out-of-vocabulary
and so on the left we have the out-of-vocabulary rate
and here we're showing for two hours of transcripts ten hours of transcriptions about seven
K words twenty K words and fifty K words with the thirty five
and how it is your oov rate
go down as you adding more transcripts and so you can see here and the
top her
if you add in sre and on this
so i adding more text data so as the curves are the amount of transcripts
we have
and on there
bottom x-axis we reading different amounts of text data that protect five million text sources
and so if you start with
just two hours of audio data if you had ten K
you don't really lower your be too much
if you have hundred K right a little bit more et cetera et cetera
if you're have ten hours of data
there isn't much of an effect and so you see that the effect of adding
the text data is less than adding the audio data
that actually at a cat to because we probably have some sort mismatch we know
the audio data is the same type of data we're trying to look at
and the text data is related but not really the same
so then here's another curve that we're trying to
look at the amount of audio data
versus text data for
language model is a little bit complicated discrete on hers here or you just two
hours of audio data in the acoustic model
and the bottom one than the green once again it's ten hours and the red
is thirty five
and
but you can see that even if you add in more text data you're not
really improving the word error rate now
and everyone said okay is it coming from the acoustic data where's it coming from
the transcripts we know the transcripts or less close
and so we added in on the purple in the blue curves are using the
transcripts from the ten hours for the thirty five hours and so we can see
that if you only have two hours of audio data is just not going to
do very well and you need more once you get to ten hours
it seems like a improvement you get is
little lesson this is interesting "'cause" this is what was being used currently in the
babel project we're actually
working with ten hours for some the conditions
let me a few minutes about some other work he did for
the language modeling and it's with on work comp word decomposition so i'm work is
a very rich morphology and has lots of to poke our high out-of-vocabulary rates
a problem also for language modeling is that is very took is not very well
models so therefore it's interesting to use a word decompounding
and when you look at the literature you can see needs results languages you get
a nice gains some you don't always get it again you know we don't necessarily
get that in word error rate
and so
one a
idea that we had in this work was to try and avoid with the generate
once you don't want to create visible units and
so that's what i'm going to give a couple of ideas about and to do
this we build matched conditions these types of we train language models retrained acoustic models
for all the
conditions
so this is showing
here we had the we use them morfessor algorithm
which is relatively recent at the time
so this basic morfessor will be the curse would
then there's
no reason it's referred to as harris we're basically you look at the number of
strings letters the can succeed another better and that gives you an idea of what
the perplexity is if you have a big a lot of different letters that can
follow it's like the dbn you word if it's were you more and if it's
not it's likely to be within the same
a more
we also then tried to use and distinctive feature properties to train at some speech
information into the decomposition and looked using phonemic confusion constraints that were generated using phone
alignments and so basically here if you have used to sequences neighbour a and may
well
in queens like that may because if the not the lot and the well we
easily confusable if they were easily confusable was okay display
the idea that
constraints it's relatively language-independent but of course you do need to know the phonemes in
the language or
have and you set of phonemes in general
so this is looking at what happens in terms of the number of tokens you
get after T V splitting for different configurations
and
the length of them so it's something that was also the weight was represented rest
everything that was to phonemes so that's where you see things a two four six
eight et cetera
and basically the main point is that anything
that is that's this is your baseline by the words in the black
and once you start cutting use
units get shorter as you expect that's a goal of it
and then if you use this confusion constraints those ones we cease uses green in
the purple general there are a little bit less shifted to the left so we're
creating slightly
if you were very short units
and that was the goal what we're trying to do and then here's a table
that we probably don't want to go
into too much but if you look at the baseline system we had twenty two
point six percent
no numbers are relatively close okay but
if you split anything
the error rate general gets worse isn't the black ones so you can use to
the distinctive features that you really help
you can
the only ones that there's were used disk this phoneme confusion constraint and so here
all of these two slightly better
the and the baseline and those the only ones and so is really important to
avoid adding
we need to confusion your homophones and your system
so that sort of the typical message for this
so we
the other one thing today we got fifty percent reduction in oov rates that was
good except we were introducing errors and confusions and the little affix as
that we're compensating or recording this more than yours we recovered
and basically we did some studies and look at the previously over previous oov words
and basically about half of them were correctly recognized using this method but we would
swap it out
with a recently introduced
on the different aspects is that work
so just another slide sorta not of all of his work what was more logical
to put here in the talk so i
but here is we've used unsupervised decomposition once again usually based on morfessor or some
modifications of it for finnish hungarian and german
and russian
for the first three languages we got reasonable gains a between one and three percent
and we can reduce our vocabulary size is from seven hundred thousand two million words
to around three hundred thousand
which are a little bit more usable for the system and probably more
easy to train a reliable estimates
for some of them we need acoustic model retraining so we could do not for
german
for finnish we tried both the
acoustic model retraining or not
and we
well time's got three percent difference using the morphologically decomposed system whether or not retrain
the acoustic model to
interesting for us morfessor worked well for finnish i think in part because the authors
were
and so there
the output was maybe design for that
we also tried to do this and russian cts where we only had the time
about ten hours of training data so conversational telephone speech
six yes for some people that might know what
and we got a reduction in the oov we were able to use a smaller
vocabulary but we can get an egg in word error rate
but once again this is very preliminary
work we get done
so now i'm going to shift your gym where my time
fourteen is gone
that okay several faster
so to speak a few minutes about
finish where we do have is one of the first languages and deeper we didn't
have any
audio and untranscribed audio and so
we have found some online data with an approximate transcripts that comes from a
initially used for foreigners finish
and there is no transcribed development data either and said how are we gonna do
this for many companies is easy to hire someone to transcribe some data for us
is not so we see so it takes time to find the person
if we're government research labs to
this is a complicated so can we get ahead
by doing something simpler and so we did is we
use this is approximate transcriptions but also for the development not just for the
unsupervised training
and once again as i said before we use morphological decomposition for this
so here's occur showing the
estimated word error rate as we increase the amount of unsupervised training data
so we have two hours of five hours and then sir stabilised again once we
get around ten to fifteen hours
we're stabilising C we get a beginning here and this approximate but is going the
right direction that about two months later we had somebody the came in and
transcribed data for us it was a two or three hours sets is not a
lot it still took awhile first to get the person for them to do it
and you can see the human error rate
use
in the following exactly the same curve
are error rates higher zero underestimating here because what we did is we selected regions
as and the done for the unsupervised training where there was a good knots between
these sort of approximate transcriptions
and
what the system did we measured on that but we're not because it allowed us
to develop without necessarily having to wait for this data to become
available
so the message on that is that the unsupervised acoustic model training worked reasonably well
using these approximate transcripts
with since then it on
some sorry it is also worked on
for the language models so we can improve our language models using the sort approximate
transcription it worked
we then added into the system some cross lingual mlp
so we tried both french and english
and we got about ten percent improvement
and i said before with the morphological decomposition
so now i'm gonna talk about not a language which also is consider somewhat low
resourced so that in
and this was work it was done
with all the other operand was that nancy can was russian so you sort of
down that unit interesting language for him and basically his words where they just know
nothing for that and out there
this assistance is not distribute corpora but you can find text and audio on the
net so therefore something we could reasonably do
it's a baltic language is not so many speakers of one point five million it's
a complicated language but uses a lot now forget half of it
and you please reasonably straightforward
so i this is sort of the overview of the language models we found a
fair amount of data
good
one point six million words
and in domain data and hundred forty two million words newspapers so the in domain
means it comes from like radio and T V stations and this just newspapers
we used about a five hundred thousand word vocabulary just keeping words that occurred more
than three times
text processing thing is or standard
however this isn't really important stuff it's if you don't do the text processing carefully
you have problems when trying to cancer supervised training means that seems to be our
experience and it was pretty much standard language models he threw in some neural network
language models at the end so given distressed talk that was interesting to
for that line into
so this is this figure showing the
word error rate have goes up so that these curves here the word error rate
as a function of the iteration
and me circles are shown you roughly the size of the acoustic units were roughly
doubling at each stop
the amount of audio data used in an unsupervised manner
for the systems
at this level here for we added in the mlp from russian
are initial seed models were here
came from the mixer three languages english french russian the audio data wiper about sixty
hours at this stage to about seven hundred eight hundred hours at this stage raw
so you're only using about half
when you have to build models
and of course something that's important used to increase the number of context in the
states that you model the same time
so it doesn't suffice just add more data to keep the model topology fixed you
don't get much of again from the
afterwards he did some additional tuning parameters and you pass decoding and
use the four gram lm and you can see that we it's just the original
so see i is
case insensitive and C D is case sensitive not context independent context dependent
as we're looking at the word error rate if you take into account case "'cause"
what people want to read is really having case correct
and even for different search engines sometimes is important to have the case correct
because you want to know for the proper name or not
and so for people that are found neural net language model got about what have
to two percent gain
by
adding them and this is on dev data and then and validated we got pretty
much similar results
so we were happy with that so it's completely unsupervised we developed a system
in about less than a month
mainly at the end we were and this is trying for hungarian roughly the same
thing
we used a few data from
five languages we had less audio data so we only one two
about three hundred hours
and we used a originally and mlp trained on english
and then we use the transcripts of this level to then generate an mlp trail
area
using unsupervised transcription we got a another two point eight or so that napster
so just to you been
overview this just some results from the program which some of you are where i'm
sure in some of you or less
the systems to the one the including channel
are trained on supervised data
and the supervised data varies from fifty to see a hundred fifty two hundred hours
upon language
on the right the role train unsupervised
the green sorry the low
mine is the average error rate across the test data about three hours
and so you see it's are going up and the ones on the side our
general little bit higher the ones on the left not so much they're pretty good
bulgarian and with the when you are a little bit higher here
then on the a look some vocational come back to a few minutes
if you look at the
lowest word error rate i anyone to the segments we had from T V radio
they're pretty known in fact even some of the unsupervised or the word and the
supervised
and finally this is the worst case since the worst-case word error rate is still
pretty ice we still got a fair amount of work to do
these data are mixed news and conversations
and some of them are some languages a more interactive than others things like that
so i'm going to skip the next slide which is too much stuff was to
show the amount of data we use of people are interested come finally later
and i want to say two or three words about dictionary so when the
think that we're this them that passes very costly to do with dictionaries and so
there's been
more recently use a growing interest in using graphemic type units rather than phonetic units
in years just in our systems
and the first K work but i found was contact in i
maybe people are aware earlier work
doing this that are not aware of
and avoid this production of the pronunciation dictionary
basically the G two P problem becomes a text normalisation problems we can have numbers
there are things like that you have to convert dates and times and all those
types of things into words in order to do this and then you have units
so this we then it means see for turkish tagalog passed to within the babel
program
and we get about
as like other previous studies got about comparable results
in general
but for some languages we actually do better with the graphemic systems and the
phonemic systems in fact i should mention that back in the gale days that was
work using graphemic systems rare
and basically this is some results we don't passed to for
in the babel program we had a two pass system using the dbn voice activity
detection in the but features thank you and we use both graphemic and phonemic systems
and we can see that there about the same that the phonemic is about one
point higher than the graphemic but if we do what you pass system where we
do we need you graphemic we actually got a reasonable
getting from that
we believe that the one of the problems in the past was actually having for
pronunciation generation so therefore they're bad
or a lot of variability you don't have also where the graphemic systems can actually
outperformed the phonemic
so let me now speak about look some work just because it's a and this
is work done with
marty noted that there is looks from looks and work for those of you know
her and it's a little country where the not too many people but it's really
a multilingual environments of the
the
people
when they go to school their first language is german and then i believe it's
french and english at this study but it won't speak their local language about submission
apparently even those this type of the country the few close your eyes and you
guys are you don't see in you have exaggerated a little bit
but you even have multiple dialects in different regions
so what we did this we initialize to originally the first studies we did was
just try and look at segmentation experiments for how
which languages are favoured by look supportish data so we had basically no transcribed data
time
and we transcribe ten or fifteen minutes of some
the data and so we do is we did some approximate mappings are saying that
if you take the
in like to mortgage okay maps pretty well to french english german but if you
take you well that doesn't really exist in english but can see that from germany
french and so it's okay in english will use the it
to get a mapping so we have the same number of phonemes for sort of
phonemes in each one
and basically when we said okay we build models would put them in parable parallel
and we can have a superset of models and we try and align
he looks more this data with this so we had somewhere transcriptions of a small
amount and see which languages are
referred we do this we allowed
so
i don't know that much about the language myself but basically you can have french
words inserted in the middle so the apart from the language is that they're also
indian
and so we allowed the language is to change a word boundaries you had to
use the phones from a given seed model
within the word change of word boundaries and basically we found that as you can
expect since the looks mortgages the dramatic language in general the segmentation for german
second was english which is closest but there's about ten percent that what you french
in this was typically needs allows an effect for english
typically dip sounds remind we don't really know exactly why
so based on that we then said okay is now a couple years later we
got some transcribed broadcast news data in button we're going
and which are easy models richer context independent they're tiny they're not gonna perform well
and we just decoded the two or three hours of training data
and you can see that the word error rates are flying this you expect some
in the right range for the amount of data for the fact that the context
independent
but the german models for preferred
we actually did models that we're pooled estimate the data together and told it and
those for
like less than the german however we get already before we knew this "'cause" we
didn't have the data we had started this so we used will models to
do the automatic decoding and once again we did or standard techniques you can see
that we're going from about thirty five to about twenty nine percent
word error rate by doubling data and adding new increasing the context adding mlp features
et cetera
and we were able to model more context
but
is there is kind of high converge some the other languages and so we martine
looks at that classification of errors and you can see
basically there's a lot of confusions between
homophones
some of the data this is pretty interactive data so it's not the same bn
data we have two types you and with human production various of people did false
starts repetitions in this pronounce something
or the distance at work
and then a large percentage re-estimated somewhere between fifteen twenty percent writing variance so because
look some work is just sort of this
spoken language
are these really errors or not and so this is an example of some of
the writing variances of the words saturday and i'm not going to
was times they are not that's probably not really how you pronounce it okay all
these are written text all allowable so basically you can
depending on what regional variant or you can say so or show
you can say tiered you can change the bow
and all these are accepted in the written form we can find them in the
text
and in what they say
so
even though this is i don't know people really consider this a low resource language
you're not there's not much data were almost none available all the languages used in
speaking it's not really used in writing so how much for time i think it's
good
i'm going to speak
one minute about we're trying to do this on korean but we once again don't
have transcribed dev data and we were trying to do a study where we look
at the side of the language model to use for decoding an unsupervised recording using
a hundred twenty thousand words two hundred two million
or
two K character
language model
just for the decoding here we looked at using phone units and have syllable acoustic
units
and the only again we had was from ldc there's about a ten hour dataset
if we do a standard train model on it we took to the last two
files we held for deaf because we didn't have any
we got a word error rate of about thirty percent and the character error rate
of about twenty percent
on this data is probably optimistic because really seen data is all the training data
just the last two so that it's very close to it
and so what years were increasing the amount of
data we use from the web and looking at influence on the word error rate
and the character error rate using different size language models and you can see that
for the two hundred thousand where the chilean
it's about the same we sorry these results are all decoded with the same size
language model the role decoded with the to indicate language model
so it's only what we use for the unsupervised training that's changing
and can we see that the real results are basically the same as we go
with the same data
but the character language model which we skip this step
is doing slightly better in terms of character error rate than the others
we don't know it's real we need to look into some more since is just
really recent stuff we're doing so for people to think it's easy to get transcribers
it's been a month and how we're looking for someone in france that has the
right to work we can transcribe the query and for us
we finally found someone and they can start working in till february
okay so yes it's an easy thing to do but
not necessarily depending on your constraints for hiring things is not so easy
so we're gonna follow up on this more we hope to have some clear results
at the end
to words about acoustic model interpolation because you're string we spoke about we have this
heterogeneous data
how do we combine the data from different sources enrich make the statement that you
want to use all that you don't want to throw it away but you
data weightings that's
when we doing it is you can just
at
more we just some of the data remover for others in such a go frog
at the syllables into is working on acoustic model interpolation
and had a paper
speech i think
and looking at if you can do something random polling you can use a different
sets and then interpolate them and use this on the european portuguese
with the baseline putting gave you thirty one point seven percent in the interpolation give
you about the same result but almost easier to deal with
"'cause" you can you can train your data on smaller sets you
and then interpolate done the same idea of what's done for language modeling for years
now
then we also looked at what we hear knife that using different a variance for
english and this is some work that have been published with me to your back
in two thousand and ten and basically
we get a little bit of data for some of these
we don't degrade for any of the variance you with respect to the visual pull
model whereas with the map adaptation we actually did a little bit more some one
or two with the variance i don't remember which ones
so let me finish up
i guess the take a message like say that the unsupervised acoustic model training successful
it's been applied in a lot to broadcast news type data more recently in the
babel project to
wider range of the data i think it's really exciting that we can do that
we still have to find the data
but it it's really nice that we can do this type stuff
but the error rates are still kind of time
you it's a sorry even though the eraser so kind of i we can we're
going in the right direction general
the
i'm sure rich or people but can see more some more about this they are
during the meeting
this is something that's interesting is that it seems in this will make people from
yesterday happy it seems that the
mlps a more robust to the fact that the transcriptions are imperfect in and they
take less of a hit in the hmms
as that sort of interesting
observation
and so we can use this untranscribed data were automatically and on automatically transcribed data
to produce
references for the training the mlps that's really nice
the your hopefulness type of approach will allow us to extend two different types of
task more easily miss you don't have to use the time of collecting the data
entry transcribing the data you collect
we still have to collected
i didn't speak about multilingual acoustic modeling
which is something that we in general
shows and to do bootstrapping restore should just taking like models from other languages
is it better use multilingual can we do better
i think it depends on what you have been hand would be nice if we
could everything we've tried in babble has gotten worse the sparse of a little bit
disappointed with what we've been doing
then of course something i didn't talk about what you do we languages have no
written standard formats or touched upon it with the bottom for example i don't really
now we're trying to do some work in others are in this even for about
it's a paper round two thousand five i think of trying to automatically discover lexical
units
but when the main problems you also have you know they're meaningful
i said i'm a bunch of times that here myself saying it's or systems or
the kinetic the going to learn that people that say like and you know you
can either in that
then the word-like even if you get it is meaningful in some cases and it's
not meaningful another cases and so how do you with that how do not was
useful
but i think it's really exciting it's been fun stuff i hope the
those of you that have worked on unsupervised training will continue in those of you
that have in my money give it a try
and so
thank you for giving me the opportunity speak
and these are all but i've worked with closely on this work and there's probably
other people i've forgotten and sorry
so thank you
thanks lori we have some time for questions
natural unit is gradually improvements more data you have any idea what is being can
improve i mean in this which words are getting better which ones that state maybe
that probably not that yes now we have about that that's an interesting thing to
look at
something that we
what we would like to do that we haven't done yet is to actually not
just continue we don't normally we incrementally increasing amount of data would be interesting even
change the datasets is just use different round of portions of i think we should
cover better
"'cause" when you're models are like something gonna continue liking it
we have a look at words would be interesting
so the question it so
you know if you talk to machine learning person that works in some a supervised
learning the get really nervous when you say self training or some supervision because it's
this thing where you're starting with something which isn't working that well it can counters
and actually go unstable and the opposite direction so there's this sort of sensitive
you're starting with a baseline recognizer trained on a small amount that's work reasonably well
you can improve if you're starting something was working really bad i can get worse
and i and i noticed with some of a lot of the results that you
had in this talk shows a lot of broadcast news we are starting with
but are performing susan hours those all these results for all languages get your data
we had nothing started zero
zero you are transcribed data
in language or you're making language
okay so we started on a sort of all these languages here to the right
we have zero
in language data
and we started with seed models of word context independent if you use you know
that from another language and reduced to the max and so the noise model you
will roughly sixty to eighty percent word error rate when you start
on your data
so we're starting really high kicks but
or language models even though they're trained on
newspaper text newswire text and things like that
are pretty well representative the task there can be very strong constraints in there
which is why i find the babble were really exciting which means you we haven't
done it ourselves on the unsupervised part
but the it's really exciting because there don't have the screen coming from a language
model
all you have a small amount of
transcriptions so ten hours of transcriptions so here there is information is coming into the
system from the text
and that we're
i personally believe the
why works
just something to see if you don't normalise
correctly so we had certain situations people that i don't know how to pronounce numbers
i'm just gonna keep them as numbers
convergence is a lot harder doesn't work very well if you say i'm gonna throw
away the numbers it which is what some of the people to be the language
modeling did
it also doesn't work so well
so you really need to have something that represents pretty well
well cosine it seems
from my we also had some languages we're would you
people from the litter here we think the problem when you take text that are
online sometimes yes
it's a texan other languages you actually have to filter out the text that are
not language you're targeting tended did you wanna
come up i think you're so that users come up during questions or
okay
no during the question you wanted her to change
it wasn't it i don't think questions business at your and the formant to all
a
and the last question from or okay one L
depending application at the end of the day you may want to have readable text
like to queue for translation or broadcast news
and at that point two H and four by hand think of names
it's probably more important than
and also percent of the K L I
let me just call always systems are case them punctuation
but we're not measuring the word error rate on punctuation the case where all the
systems produce the punctuated case double
the named entities are not specifically detected but hopefully if there proper nouns will be
uppercase diff we did a language model right
so this is something that actually we've
we tried in that's why had this slide work in the case insensitive in case
okay and then in case insensitive word error rates this is about a two percent
difference
in that the punctuation is a lot harder to evaluate so some work that we
do with the acting colour who's now and that
trying to evaluate the punctuation based on the road program and it's very difficult if
you take to is for humans they don't agree how to punctuate things maybe not
of speech
for a
full stops
closer and big if there's eighty percent inter annotator agreement and if you go to
common sits down relative
so it's very heart
but what really want something that's acceptable no you don't really care about a ground
to i think sort of like translation you know really care exactly what it is
as long as a reasonable
reasonably correct punctuation if you have multiple forms of are possible just like you can
translate something in multiple ways if you get one of them that's correct this could
not so i think punctuation false same category if you use a common or after
something it's not very important as long as you can understand correctly
as i think as a really heart problems to evaluate
fact even more so than doing something that seems reasonable
the used as a separate comes on that the punctuation is that in the postprocessing
step
and other sites of done you bbn is done punctuation unity the other sites also
over ten years you know
but you're right
okay