i wish so where
well my paper was acoustic unit discovery and pronunciation generation from a grapheme based lexicon
this is work done with a menu right more in the model and john look
upon whimsy
so the basic premise was
given the task where you just have a parallel text and speech but no pronunciation
dictionary aware knowledge of a phonetic string acoustic units
can you learn the acoustic units and then the pronunciations for word using those discovered
acoustic units
so we broke the task into two parts with the first stage was learning the
acoustic units in the second stage was learning be pronunciations from them
i with the hope that you do one individually would improve performance and then together
what make things even better
so we start from the assumption that you can at least train a grapheme based
speech recognizer that will these produce some reasonable performance
once you train a contextdependent hmm based recognizer we taken we cluster the hmms using
spectral clustering
into some preset number of acoustic units
and i there we have a direct mapping from a context dependent try grapheme down
to one of these new acoustic units so using the original grapheme based lexicon you
can automatically map each pronunciation to the new acoustic units
i and performing recognition with that system for does produce a some small game but
it was our belief that
using those accused units in that matter might not be the best way to do
it might be better ways of
better pronunciations
so we took a machine translation based approach to transform those pronunciations into a new
set
so using the new discovered acoustic units we decode a the training data to generate
a set of pronunciation hypotheses for each word
and from their using moses i you can learn a set of phrase translations are
basically rules to translate a set of units into another set of units
so using that france phrase translation table you can line transform the original lexicon and
so on the one
i unfortunately using that directly actually significantly to a mix performance worse mainly because there's
a lot of the ways in the and the pronunciations the hypothesized
i still we rescore the rules in the phrase table by applying each rule individually
and then through forced alignment see how it affects the log-likelihood after training data
and we found that if we just from the phrase table keeping the rules that
improve the likelihood of the data and then transform the final lexicon we end up
with improve performance once we get the final lexicon we start over from beginning and
train the system up i haven't before recognition
okay might be better
to
okay my paper was
i don't i practical system for word discovery exploiting dtw based initialization and what we
basically did was
okay unsupervised word discovery task from continuous speech in our case we use the completely
zero-resourced setup meaning we only have the audio data and no other information
so
just for the task used a small vocabulary tasks in our case just the ti-digits
database was eleven digits and we used a hierarchical system for the word discovery are
the system in the in this case means we do have two levels on the
first level we have the acoustic unit discovery so
trying to discover acoustic units as a sequence of feature vectors it's basically the same
what use was presenting today
and the second level we do the discovery of words
sequences of the acoustic units which is basically another two learning the pronunciation lexicon so
for the first level as i said we're going the acoustic unit discovery it is
similar to self organizing units and what we basically do is segment or input
cluster all segments to get an initial transcription for the data and then do that
iterative hmm training for acoustic models for the acoustic you
and finally we get a transcription of or audio data is a sequence
and for the second level we'd of the word discovery
means that we are trying to discover words in an unsupervised lie on or sequence
of acoustic units
so what we do there is we use the probabilistic pronunciation lexicon because obviously our
acoustic unit sequence will be very noisy so we can reduce the one-to-one mapping we
need some probabilistic mapping and to do this mapping we use an hmm for each
word for the hmm
states have discrete emission from the distributions in terms of the acoustic units
so
additionally the transition probabilities of the hmm are governed by a not the distributions so
that you have kind of things modeling
and
the parameters of these hmms are learned in an unsupervised way we do this by
for example when we want to
search for and rinse instance ubm
what hmms connect them by a language model in our case a unigram language model
and then simply estimate the parameters using an em algorithm and we look at what
sequence of hmms be converged and finally we get
transcript transcription of the audio to intense
doing this with and
random initialization so a completely unsupervised setup we get T sixty
eight percent accuracy
and
well this was done in a speaker independent meaning what all the data into one
group just to learning and the segmentation
getting sixty percent
this was done unsupervised but we can get or we can go of step for
their using light supervision in our case we use the dtw algorithm of jim us
to do a pattern discovery on a small subset or input data
using nine percent of the input data or we ran the dtw algorithm and discovered
for four percent of the segments
some may so we used these four percent of the given segment is initialized our
right hmms these seconds
and
we used light supervision by just labeling
you can do this by listening to the
probably
and
when running this and
learning again we get an eighty two percent word accuracy of the end so using
the light supervision obviously improves the reside quite sick if we can
and it what we find it is an iterative speech recognizer training so just going
back to the standard speech recognizer using hmm you and gmms using the transcriptions but
we get from our unsupervised lightly supervised learning and doing an iterative training the speech
rate
so on the random case we then
go from sixty eight percent to
at four percent
and in the lightly supervised by people from the
at two percent to basically ninety nine percent which is close to the baseline
when using supervised training for details
okay so far work we are smoking on the low resource model we for a
tool on the continent dictionary but you how can we do not to the dallas
approach any so we still assume that we have a very small initial dictionary to
starbase
so that we can post that our system so our task in multimedia so given
that argue that we have us more initial a dictionary to start with we can
simply and we just simply channel up a grapheme to phoneme conversion model and then
we can generate
the multiple sessions for the word accented covering it up here you know training data
so actually a system we assume that our constant actually small it's very maybe even
or even noisy but we assume that we have a large amount word level transcription
i mean audio data and with whatever conclusion
so even though we do not know a have the procession for all the word
comparable results more usual actually just are ways we want to learn discipline sessions for
the word you know what training data so and the since we can be model
but very noisy because the training samples very small
but we can read all the possible pronunciation for the words in the training data
and then we can actually in the audio data to fit into the model and
then to be where the pronunciations and so here and we have another problem is
because if we had keep you know multiple from search for each word so and
for an utterance with say i'm unable word and the possible pronunciation sequence could be
bourlard you got at a national with a number word so you want to learn
the project where can be a problem
so we computer two different back to that in the model one just use a
bit be updated approximation at every time the most likely points in sequence fourish one
point you a project utterance as and uses that you couldn't that these subsidies to
obtain backed
so this is it just a approximation so we have to answer the question that
it is a good approximation so we have to i will call another system that
speech to learn these
model precisely so we do not do any pruning so other people actually a using
model that can say that's not the user impasse at least so this can also
be approximation
perform a so we have to know is a good approximation so
you want to you want to our work on this you know that impressive the
a large or a nonsensical for each utterance then we use the that is eighteen
techniques you know fst wfst so we represent these opens and a sequence E U
you ladies all you mother were you wfst and then used on the existing you
know composition and a limitation determinization algorithm that's only there's an algorithm arabians here in
the open fft one
so we use them and then she we found that is very efficient choose for
us to i don't think model so a given the two algorithm we can we
count as opposed to model but there's a nice to have to mention that because
or grapheme to phoneme conversion model and our work as model and also a one-best
model not depend on each other
so we can see that we have to each batch eight we find that you
know so that we can we can takes the other to have access to one
and go back
so that up a few iterations there is a system can convert
so we did some a common on the also we bought a i data set
so it becomes large tells that and the expert lexicon has a process of and
word and the total training data has about a three hundred hours so
we i was systems as a snack about fifty percent of the what an concessions
at the initialization which of this random this slide fifteen percent to start always as
an are we compare to use this expert let's go so obviously some can approach
is used for the let's go
there's a power to one percent to one point five percent in a gap there
but i think trustees to from a simple with we cannot speech or i mean
equivalent to that that's good as i think that's all translated just a initial study
offered to work
paper is tied probabilistic lexical modeling and unsupervised adaptation for unseen cases so
in this paper we propose to zero-resourced is our approach specifically Z linguistic whistles and
zero lexical resources et cetera system and the framework of codebook a bit languages based
features so we only use the nist probable words the target language
i that knowledge of the grapheme-to-phoneme relationship that
so that was used as opposed to model trained on a language independent out-of-domain data
and uses graphemes as subword units which avoid the need for lexicon
so i focus on three different point i that's tell you want kl-hmm approach use
and what has done until no kl-hmm what we did differently in this paper
so to briefly explain about kl-hmm approach is
the posterior probabilities estimated by neural network are directly used as feature observations to train
and an hmm right states are modeled by categorical distributions
so the dimension of the categorical distributions same as the output of mlp
and the parameters of the categorical distributions are estimated by minimising the kl divergence between
them
state distributions and the feature vectors belong to that state
and reality kl-hmm approach can be seen as probabilistic lexical modeling approach because the parameters
of the model capture the probabilistic relationship between hmm states and mlp outputs
and as in the normal hmm system the states can represent performance all graphemes context
independent units of context dependent subword units
so i'm not what we found until no to explain the benefits of kl-hmm we
have seen that the neural network can be trained on language-independent data and the kl-hmm
parameters can be trained on small amount of target language data
in the frame of kl-hmm the subword units like graphemes can also do you performance
similar to the systems using force as subword units so the grapheme based asr approach
using kl-hmm us to believe it's mentioned
the first it is it exploits both acoustic and lexical resources available in other languages
because it's reasonable to assume that some languages have lexical resources
and the parameters of the kl-hmm actually more than the probabilistic relationship between graphemes and
performance so it implicitly loans that you could be relationship using a pasta
so what we do in this work is normally insect or your the kl-hmm parameters
are estimated using target language data
so in this work we creeping it's to the kl-hmm parameters but knowledge based parameters
so we first dependent grapheme set in the target language we map each the graphemes
to one or more mlp what's a performance
and then the kl-hmm parameters are assigned using does not which
if untranscribed speech data from the target language we also proposed approach to i'd creativity
adapted kl-hmm parameters in an unsupervised fashion so first given the speech data waiting for
the grapheme sequence and
and we update the kl-hmm parameters using the decoded grapheme sequence and the process can
be i three and in this paper the approach was a evaluated on greek and
we used five other european languages as out-of-domain resources not greek
and what does not done in this paper we planted we it's
like in this paper we only adapted the kl-hmm parameters from untranscribed speech data but
in future we also plan to the mlp retraining based on
based on in for grapheme sequence and
also we don't prove in the utterances during unsupervised adaptation also so future we plan
try the problem or weight the utterances based on some
matching
so that anybody have any questions
so this is kind of an open and the question in this is for
everybody who talked about
automatically learning subword units
so i guess at least to the poster presenters and many of the speakers from
today and the question is a very often it's heart define the boundary between subword
units
and the more conversational a natural the speech gets the less well defined is boundaries
are
and so i was wondering if you found in your work
if you look at
at the types of boundaries there being hypothesize if you found
that this issue causes a problem for you and if you have any thoughts
and how to deal with it
i can at least a that's so even though my experiments were worked on what
read speech i did find that a lot of times the pronunciations that are not
being learned were heavily reduced and a much more reduced and the canonical pronunciations
i think that probably doesn't this decrease in accuracy because it increases the confusability among
the pronunciations in the lexicon
i don't really have a good
a good idea on how to fix that but i think probably maintaining some measure
of reducing or minimizing amount of commute confusability in the in the word units that
you get
a simple similar to the talk that we just saw
saying that you know it's important of these be able to still discriminate between
the words that you having a lexicon
so i don't see or anywhere see here
i will hunt you down only left the room
but i can or with what william set on the pronunciation mixture model stuff or
we can actually see and we throw out pronunciations real or new ones
we're lower learning variations that are addressing
you know reductions that you see in conversational speech
so i haven't looked closely enough of the units for learning
to know that by serving you see that and pronunciation stuff so i would assume
something like to be going i can say definitively
question no i
maybe i can say lovely what we discuss that the break i mean i wonder
why do we need boundaries between the you speech sounds
the obviously do need a sequence of the speech sounds but we can be enough
to have a sentence of the speech sounds and accent the fact that spits out
loss for about couple of hundred millisecond each of them and they are really overlapping
sold boundaries not entirely arbitrary in
things and i don't think we need an easy be that's my point but correct
me if i'm wrong
any other questions or comments
it's been longer
maybe we should declare success a closed session alright thanks everybody