0:00:15 | i wish so where |
---|
0:00:17 | well my paper was acoustic unit discovery and pronunciation generation from a grapheme based lexicon |
---|
0:00:23 | this is work done with a menu right more in the model and john look |
---|
0:00:28 | upon whimsy |
---|
0:00:31 | so the basic premise was |
---|
0:00:34 | given the task where you just have a parallel text and speech but no pronunciation |
---|
0:00:39 | dictionary aware knowledge of a phonetic string acoustic units |
---|
0:00:42 | can you learn the acoustic units and then the pronunciations for word using those discovered |
---|
0:00:47 | acoustic units |
---|
0:00:49 | so we broke the task into two parts with the first stage was learning the |
---|
0:00:54 | acoustic units in the second stage was learning be pronunciations from them |
---|
0:00:58 | i with the hope that you do one individually would improve performance and then together |
---|
0:01:03 | what make things even better |
---|
0:01:06 | so we start from the assumption that you can at least train a grapheme based |
---|
0:01:09 | speech recognizer that will these produce some reasonable performance |
---|
0:01:16 | once you train a contextdependent hmm based recognizer we taken we cluster the hmms using |
---|
0:01:24 | spectral clustering |
---|
0:01:27 | into some preset number of acoustic units |
---|
0:01:30 | and i there we have a direct mapping from a context dependent try grapheme down |
---|
0:01:38 | to one of these new acoustic units so using the original grapheme based lexicon you |
---|
0:01:44 | can automatically map each pronunciation to the new acoustic units |
---|
0:01:49 | i and performing recognition with that system for does produce a some small game but |
---|
0:01:55 | it was our belief that |
---|
0:01:57 | using those accused units in that matter might not be the best way to do |
---|
0:02:02 | it might be better ways of |
---|
0:02:04 | better pronunciations |
---|
0:02:07 | so we took a machine translation based approach to transform those pronunciations into a new |
---|
0:02:13 | set |
---|
0:02:13 | so using the new discovered acoustic units we decode a the training data to generate |
---|
0:02:20 | a set of pronunciation hypotheses for each word |
---|
0:02:24 | and from their using moses i you can learn a set of phrase translations are |
---|
0:02:30 | basically rules to translate a set of units into another set of units |
---|
0:02:36 | so using that france phrase translation table you can line transform the original lexicon and |
---|
0:02:40 | so on the one |
---|
0:02:42 | i unfortunately using that directly actually significantly to a mix performance worse mainly because there's |
---|
0:02:49 | a lot of the ways in the and the pronunciations the hypothesized |
---|
0:02:53 | i still we rescore the rules in the phrase table by applying each rule individually |
---|
0:02:57 | and then through forced alignment see how it affects the log-likelihood after training data |
---|
0:03:05 | and we found that if we just from the phrase table keeping the rules that |
---|
0:03:08 | improve the likelihood of the data and then transform the final lexicon we end up |
---|
0:03:15 | with improve performance once we get the final lexicon we start over from beginning and |
---|
0:03:19 | train the system up i haven't before recognition |
---|
0:03:27 | okay might be better |
---|
0:03:29 | to |
---|
0:03:32 | okay my paper was |
---|
0:03:34 | i don't i practical system for word discovery exploiting dtw based initialization and what we |
---|
0:03:40 | basically did was |
---|
0:03:42 | okay unsupervised word discovery task from continuous speech in our case we use the completely |
---|
0:03:48 | zero-resourced setup meaning we only have the audio data and no other information |
---|
0:03:54 | so |
---|
0:03:55 | just for the task used a small vocabulary tasks in our case just the ti-digits |
---|
0:04:00 | database was eleven digits and we used a hierarchical system for the word discovery are |
---|
0:04:06 | the system in the in this case means we do have two levels on the |
---|
0:04:10 | first level we have the acoustic unit discovery so |
---|
0:04:15 | trying to discover acoustic units as a sequence of feature vectors it's basically the same |
---|
0:04:20 | what use was presenting today |
---|
0:04:23 | and the second level we do the discovery of words |
---|
0:04:26 | sequences of the acoustic units which is basically another two learning the pronunciation lexicon so |
---|
0:04:33 | for the first level as i said we're going the acoustic unit discovery it is |
---|
0:04:38 | similar to self organizing units and what we basically do is segment or input |
---|
0:04:45 | cluster all segments to get an initial transcription for the data and then do that |
---|
0:04:49 | iterative hmm training for acoustic models for the acoustic you |
---|
0:04:54 | and finally we get a transcription of or audio data is a sequence |
---|
0:05:01 | and for the second level we'd of the word discovery |
---|
0:05:04 | means that we are trying to discover words in an unsupervised lie on or sequence |
---|
0:05:10 | of acoustic units |
---|
0:05:12 | so what we do there is we use the probabilistic pronunciation lexicon because obviously our |
---|
0:05:18 | acoustic unit sequence will be very noisy so we can reduce the one-to-one mapping we |
---|
0:05:23 | need some probabilistic mapping and to do this mapping we use an hmm for each |
---|
0:05:29 | word for the hmm |
---|
0:05:31 | states have discrete emission from the distributions in terms of the acoustic units |
---|
0:05:38 | so |
---|
0:05:41 | additionally the transition probabilities of the hmm are governed by a not the distributions so |
---|
0:05:47 | that you have kind of things modeling |
---|
0:05:49 | and |
---|
0:05:51 | the parameters of these hmms are learned in an unsupervised way we do this by |
---|
0:05:57 | for example when we want to |
---|
0:05:59 | search for and rinse instance ubm |
---|
0:06:04 | what hmms connect them by a language model in our case a unigram language model |
---|
0:06:08 | and then simply estimate the parameters using an em algorithm and we look at what |
---|
0:06:15 | sequence of hmms be converged and finally we get |
---|
0:06:21 | transcript transcription of the audio to intense |
---|
0:06:25 | doing this with and |
---|
0:06:27 | random initialization so a completely unsupervised setup we get T sixty |
---|
0:06:32 | eight percent accuracy |
---|
0:06:35 | and |
---|
0:06:37 | well this was done in a speaker independent meaning what all the data into one |
---|
0:06:44 | group just to learning and the segmentation |
---|
0:06:48 | getting sixty percent |
---|
0:06:51 | this was done unsupervised but we can get or we can go of step for |
---|
0:06:56 | their using light supervision in our case we use the dtw algorithm of jim us |
---|
0:07:02 | to do a pattern discovery on a small subset or input data |
---|
0:07:08 | using nine percent of the input data or we ran the dtw algorithm and discovered |
---|
0:07:14 | for four percent of the segments |
---|
0:07:18 | some may so we used these four percent of the given segment is initialized our |
---|
0:07:23 | right hmms these seconds |
---|
0:07:26 | and |
---|
0:07:27 | we used light supervision by just labeling |
---|
0:07:32 | you can do this by listening to the |
---|
0:07:34 | probably |
---|
0:07:36 | and |
---|
0:07:37 | when running this and |
---|
0:07:41 | learning again we get an eighty two percent word accuracy of the end so using |
---|
0:07:46 | the light supervision obviously improves the reside quite sick if we can |
---|
0:07:51 | and it what we find it is an iterative speech recognizer training so just going |
---|
0:07:57 | back to the standard speech recognizer using hmm you and gmms using the transcriptions but |
---|
0:08:02 | we get from our unsupervised lightly supervised learning and doing an iterative training the speech |
---|
0:08:08 | rate |
---|
0:08:09 | so on the random case we then |
---|
0:08:11 | go from sixty eight percent to |
---|
0:08:15 | at four percent |
---|
0:08:17 | and in the lightly supervised by people from the |
---|
0:08:22 | at two percent to basically ninety nine percent which is close to the baseline |
---|
0:08:26 | when using supervised training for details |
---|
0:08:36 | okay so far work we are smoking on the low resource model we for a |
---|
0:08:42 | tool on the continent dictionary but you how can we do not to the dallas |
---|
0:08:47 | approach any so we still assume that we have a very small initial dictionary to |
---|
0:08:52 | starbase |
---|
0:08:53 | so that we can post that our system so our task in multimedia so given |
---|
0:08:58 | that argue that we have us more initial a dictionary to start with we can |
---|
0:09:03 | simply and we just simply channel up a grapheme to phoneme conversion model and then |
---|
0:09:10 | we can generate |
---|
0:09:11 | the multiple sessions for the word accented covering it up here you know training data |
---|
0:09:16 | so actually a system we assume that our constant actually small it's very maybe even |
---|
0:09:23 | or even noisy but we assume that we have a large amount word level transcription |
---|
0:09:28 | i mean audio data and with whatever conclusion |
---|
0:09:31 | so even though we do not know a have the procession for all the word |
---|
0:09:35 | comparable results more usual actually just are ways we want to learn discipline sessions for |
---|
0:09:41 | the word you know what training data so and the since we can be model |
---|
0:09:46 | but very noisy because the training samples very small |
---|
0:09:50 | but we can read all the possible pronunciation for the words in the training data |
---|
0:09:54 | and then we can actually in the audio data to fit into the model and |
---|
0:10:00 | then to be where the pronunciations and so here and we have another problem is |
---|
0:10:07 | because if we had keep you know multiple from search for each word so and |
---|
0:10:12 | for an utterance with say i'm unable word and the possible pronunciation sequence could be |
---|
0:10:17 | bourlard you got at a national with a number word so you want to learn |
---|
0:10:22 | the project where can be a problem |
---|
0:10:24 | so we computer two different back to that in the model one just use a |
---|
0:10:30 | bit be updated approximation at every time the most likely points in sequence fourish one |
---|
0:10:36 | point you a project utterance as and uses that you couldn't that these subsidies to |
---|
0:10:41 | obtain backed |
---|
0:10:42 | so this is it just a approximation so we have to answer the question that |
---|
0:10:47 | it is a good approximation so we have to i will call another system that |
---|
0:10:51 | speech to learn these |
---|
0:10:53 | model precisely so we do not do any pruning so other people actually a using |
---|
0:10:58 | model that can say that's not the user impasse at least so this can also |
---|
0:11:02 | be approximation |
---|
0:11:03 | perform a so we have to know is a good approximation so |
---|
0:11:07 | you want to you want to our work on this you know that impressive the |
---|
0:11:12 | a large or a nonsensical for each utterance then we use the that is eighteen |
---|
0:11:21 | techniques you know fst wfst so we represent these opens and a sequence E U |
---|
0:11:28 | you ladies all you mother were you wfst and then used on the existing you |
---|
0:11:33 | know composition and a limitation determinization algorithm that's only there's an algorithm arabians here in |
---|
0:11:41 | the open fft one |
---|
0:11:42 | so we use them and then she we found that is very efficient choose for |
---|
0:11:47 | us to i don't think model so a given the two algorithm we can we |
---|
0:11:53 | count as opposed to model but there's a nice to have to mention that because |
---|
0:11:59 | or grapheme to phoneme conversion model and our work as model and also a one-best |
---|
0:12:03 | model not depend on each other |
---|
0:12:05 | so we can see that we have to each batch eight we find that you |
---|
0:12:08 | know so that we can we can takes the other to have access to one |
---|
0:12:15 | and go back |
---|
0:12:16 | so that up a few iterations there is a system can convert |
---|
0:12:21 | so we did some a common on the also we bought a i data set |
---|
0:12:26 | so it becomes large tells that and the expert lexicon has a process of and |
---|
0:12:31 | word and the total training data has about a three hundred hours so |
---|
0:12:38 | we i was systems as a snack about fifty percent of the what an concessions |
---|
0:12:43 | at the initialization which of this random this slide fifteen percent to start always as |
---|
0:12:49 | an are we compare to use this expert let's go so obviously some can approach |
---|
0:12:55 | is used for the let's go |
---|
0:12:58 | there's a power to one percent to one point five percent in a gap there |
---|
0:13:02 | but i think trustees to from a simple with we cannot speech or i mean |
---|
0:13:07 | equivalent to that that's good as i think that's all translated just a initial study |
---|
0:13:15 | offered to work |
---|
0:13:27 | paper is tied probabilistic lexical modeling and unsupervised adaptation for unseen cases so |
---|
0:13:33 | in this paper we propose to zero-resourced is our approach specifically Z linguistic whistles and |
---|
0:13:40 | zero lexical resources et cetera system and the framework of codebook a bit languages based |
---|
0:13:45 | features so we only use the nist probable words the target language |
---|
0:13:50 | i that knowledge of the grapheme-to-phoneme relationship that |
---|
0:13:55 | so that was used as opposed to model trained on a language independent out-of-domain data |
---|
0:14:02 | and uses graphemes as subword units which avoid the need for lexicon |
---|
0:14:07 | so i focus on three different point i that's tell you want kl-hmm approach use |
---|
0:14:13 | and what has done until no kl-hmm what we did differently in this paper |
---|
0:14:18 | so to briefly explain about kl-hmm approach is |
---|
0:14:23 | the posterior probabilities estimated by neural network are directly used as feature observations to train |
---|
0:14:30 | and an hmm right states are modeled by categorical distributions |
---|
0:14:40 | so the dimension of the categorical distributions same as the output of mlp |
---|
0:14:46 | and the parameters of the categorical distributions are estimated by minimising the kl divergence between |
---|
0:14:52 | them |
---|
0:14:54 | state distributions and the feature vectors belong to that state |
---|
0:14:58 | and reality kl-hmm approach can be seen as probabilistic lexical modeling approach because the parameters |
---|
0:15:06 | of the model capture the probabilistic relationship between hmm states and mlp outputs |
---|
0:15:13 | and as in the normal hmm system the states can represent performance all graphemes context |
---|
0:15:20 | independent units of context dependent subword units |
---|
0:15:24 | so i'm not what we found until no to explain the benefits of kl-hmm we |
---|
0:15:30 | have seen that the neural network can be trained on language-independent data and the kl-hmm |
---|
0:15:36 | parameters can be trained on small amount of target language data |
---|
0:15:40 | in the frame of kl-hmm the subword units like graphemes can also do you performance |
---|
0:15:46 | similar to the systems using force as subword units so the grapheme based asr approach |
---|
0:15:52 | using kl-hmm us to believe it's mentioned |
---|
0:15:55 | the first it is it exploits both acoustic and lexical resources available in other languages |
---|
0:16:01 | because it's reasonable to assume that some languages have lexical resources |
---|
0:16:07 | and the parameters of the kl-hmm actually more than the probabilistic relationship between graphemes and |
---|
0:16:14 | performance so it implicitly loans that you could be relationship using a pasta |
---|
0:16:22 | so what we do in this work is normally insect or your the kl-hmm parameters |
---|
0:16:28 | are estimated using target language data |
---|
0:16:31 | so in this work we creeping it's to the kl-hmm parameters but knowledge based parameters |
---|
0:16:37 | so we first dependent grapheme set in the target language we map each the graphemes |
---|
0:16:43 | to one or more mlp what's a performance |
---|
0:16:48 | and then the kl-hmm parameters are assigned using does not which |
---|
0:16:52 | if untranscribed speech data from the target language we also proposed approach to i'd creativity |
---|
0:16:59 | adapted kl-hmm parameters in an unsupervised fashion so first given the speech data waiting for |
---|
0:17:07 | the grapheme sequence and |
---|
0:17:11 | and we update the kl-hmm parameters using the decoded grapheme sequence and the process can |
---|
0:17:17 | be i three and in this paper the approach was a evaluated on greek and |
---|
0:17:23 | we used five other european languages as out-of-domain resources not greek |
---|
0:17:32 | and what does not done in this paper we planted we it's |
---|
0:17:36 | like in this paper we only adapted the kl-hmm parameters from untranscribed speech data but |
---|
0:17:42 | in future we also plan to the mlp retraining based on |
---|
0:17:47 | based on in for grapheme sequence and |
---|
0:17:51 | also we don't prove in the utterances during unsupervised adaptation also so future we plan |
---|
0:17:58 | try the problem or weight the utterances based on some |
---|
0:18:02 | matching |
---|
0:18:09 | so that anybody have any questions |
---|
0:18:26 | so this is kind of an open and the question in this is for |
---|
0:18:30 | everybody who talked about |
---|
0:18:34 | automatically learning subword units |
---|
0:18:36 | so i guess at least to the poster presenters and many of the speakers from |
---|
0:18:40 | today and the question is a very often it's heart define the boundary between subword |
---|
0:18:47 | units |
---|
0:18:50 | and the more conversational a natural the speech gets the less well defined is boundaries |
---|
0:18:55 | are |
---|
0:18:56 | and so i was wondering if you found in your work |
---|
0:19:01 | if you look at |
---|
0:19:02 | at the types of boundaries there being hypothesize if you found |
---|
0:19:06 | that this issue causes a problem for you and if you have any thoughts |
---|
0:19:11 | and how to deal with it |
---|
0:19:19 | i can at least a that's so even though my experiments were worked on what |
---|
0:19:23 | read speech i did find that a lot of times the pronunciations that are not |
---|
0:19:27 | being learned were heavily reduced and a much more reduced and the canonical pronunciations |
---|
0:19:36 | i think that probably doesn't this decrease in accuracy because it increases the confusability among |
---|
0:19:41 | the pronunciations in the lexicon |
---|
0:19:44 | i don't really have a good |
---|
0:19:47 | a good idea on how to fix that but i think probably maintaining some measure |
---|
0:19:52 | of reducing or minimizing amount of commute confusability in the in the word units that |
---|
0:20:00 | you get |
---|
0:20:01 | a simple similar to the talk that we just saw |
---|
0:20:04 | saying that you know it's important of these be able to still discriminate between |
---|
0:20:09 | the words that you having a lexicon |
---|
0:20:14 | so i don't see or anywhere see here |
---|
0:20:20 | i will hunt you down only left the room |
---|
0:20:23 | but i can or with what william set on the pronunciation mixture model stuff or |
---|
0:20:27 | we can actually see and we throw out pronunciations real or new ones |
---|
0:20:32 | we're lower learning variations that are addressing |
---|
0:20:40 | you know reductions that you see in conversational speech |
---|
0:20:44 | so i haven't looked closely enough of the units for learning |
---|
0:20:49 | to know that by serving you see that and pronunciation stuff so i would assume |
---|
0:20:53 | something like to be going i can say definitively |
---|
0:21:03 | question no i |
---|
0:21:12 | maybe i can say lovely what we discuss that the break i mean i wonder |
---|
0:21:16 | why do we need boundaries between the you speech sounds |
---|
0:21:21 | the obviously do need a sequence of the speech sounds but we can be enough |
---|
0:21:25 | to have a sentence of the speech sounds and accent the fact that spits out |
---|
0:21:29 | loss for about couple of hundred millisecond each of them and they are really overlapping |
---|
0:21:35 | sold boundaries not entirely arbitrary in |
---|
0:21:39 | things and i don't think we need an easy be that's my point but correct |
---|
0:21:43 | me if i'm wrong |
---|
0:21:51 | any other questions or comments |
---|
0:21:59 | it's been longer |
---|
0:22:01 | maybe we should declare success a closed session alright thanks everybody |
---|