0:00:13 | okay |
---|
0:00:14 | welcome to the morning session acoustic modeling |
---|
0:00:19 | start off with a speech by geoff zweig well |
---|
0:00:21 | sure we introduce |
---|
0:00:23 | actually i'm really happy to introduce to have known since it was this high |
---|
0:00:27 | but has grown a lot since it became since he was a graduate student |
---|
0:00:32 | anyway |
---|
0:00:35 | this we followed by the poster session and acoustic models |
---|
0:00:40 | after building berkeley where he really didn't amazing job and was already interested in |
---|
0:00:46 | green welcome crazy different models which was i've always liked |
---|
0:00:51 | he went on to I B M you work |
---|
0:00:54 | and you can tune into work actually on graphical models |
---|
0:00:59 | not only working on the throughput but also |
---|
0:01:01 | working on implementations |
---|
0:01:04 | yeah and you got sucked into a lot of darpa meeting |
---|
0:01:08 | as well as many of us that |
---|
0:01:09 | and he moved on from there microsoft word's been since two thousand six |
---|
0:01:13 | so he's well known that field now goes for |
---|
0:01:16 | a principled |
---|
0:01:18 | developments and also for implementations that have been useful for the community |
---|
0:01:24 | some happy you know |
---|
0:01:26 | by jeff appear to give a stock and sick mother interesting idea of segmental conditional random field |
---|
0:01:41 | i thank you very much |
---|
0:01:46 | okay so i'd like to talk us start today with a very high level description of what the theme of |
---|
0:01:55 | the talk is going to be |
---|
0:01:57 | and i tried to put a little bit of thought in advance into what would be a good a sort |
---|
0:02:02 | of a pictorial metaphor pictorial representation of a what the talk would be about and also something that is a |
---|
0:02:11 | fitting to the beautiful location that we're in today |
---|
0:02:16 | when i did that i decided the best thing that i could come up with was this picture that you |
---|
0:02:21 | see here of a nineteenth century clipper ship |
---|
0:02:25 | and these are sort of very interesting things they were basically the space shuttle |
---|
0:02:30 | of their day they were designed to go absolutely as fast as possible making trips from say in the to |
---|
0:02:37 | london or boston |
---|
0:02:40 | and when you look at the ship there you see that they put a huge amount of thought and engineering |
---|
0:02:46 | into its design |
---|
0:02:48 | and in particular if you look at those sales they didn't sorta just build a ship and then put one |
---|
0:02:54 | a big holes where sail up on top of it instead what they did was they try in many ways |
---|
0:03:01 | to harness sort of every aspect every facet of the wind |
---|
0:03:05 | that they could that they could possibly do and so they have sales positioned in all different ways they have |
---|
0:03:11 | some rectangular sales they have some that triangular sales they have the sort of the funny sale that you see |
---|
0:03:18 | there back at the end |
---|
0:03:20 | and the idea here is to really pull out absolutely all the energy that you can get from the wind |
---|
0:03:26 | and then drive this thing forward |
---|
0:03:29 | that relates to what i'm talking about today which is speech recognition systems that in a similar way harness together |
---|
0:03:37 | a large number of information sources to try to drive the speech recognizer forward i in a faster and better |
---|
0:03:43 | way |
---|
0:03:44 | and this is going to lead to a discussion of log-linear models |
---|
0:03:48 | a segmental models and then there's synthesis |
---|
0:03:52 | and in the form of segmental conditional random fields |
---|
0:03:57 | there's an outline of the talk |
---|
0:03:59 | i'll start with some motivation of the word |
---|
0:04:03 | i'll go into the mathematical details |
---|
0:04:05 | a segmental conditional random field starting with hidden markov models |
---|
0:04:09 | and then progressing through a sequence of models that to the ser at |
---|
0:04:14 | i'll talk about a specific implementation that my colleague patrick knowing in and i put together this is a scarf |
---|
0:04:22 | toolkit i'll talk about the language modeling that's and implemented there that's sort of interesting |
---|
0:04:28 | are the inputs to the system and then the features that it generates from them |
---|
0:04:33 | at present some experimental results are research challenges in a few concluding remarks |
---|
0:04:41 | okay so the motivation of this work is that state-of-the-art speech recognizers really look at speech sort of a frame-by-frame |
---|
0:04:51 | we go we extract are speech frames every ten milliseconds |
---|
0:04:55 | are we extract the feature usually one kind of feature for example P L Ps or mfccs |
---|
0:05:02 | and send those features into a time synchronous |
---|
0:05:06 | recognizer the processes them and outputs were |
---|
0:05:10 | i'm going to be the last person in the room to underestimate the power of that basic model and how |
---|
0:05:18 | well you can get perform have good performance you can get from working with that kind of model |
---|
0:05:24 | and doing a good job i in terms of the basics of it and so a very good question to |
---|
0:05:29 | ask |
---|
0:05:30 | is how to improve that model in some way |
---|
0:05:35 | but that is not the question that i'm going to ask today |
---|
0:05:39 | i instead i'm going to ask a different question i should say i will read task |
---|
0:05:45 | a question because this is something that a number of people have looked at in the path |
---|
0:05:51 | i in this is whether or not we could do better with the more general model |
---|
0:05:55 | and in particular the questions i'd like to look into our whether we can move from a frame-wise analysis |
---|
0:06:02 | to a segmental analysis |
---|
0:06:05 | i from the use of real-valued feature vectors |
---|
0:06:08 | i such as mfccs and plps |
---|
0:06:11 | two more arbitrary feature functions |
---|
0:06:13 | i E and if we can design a system around the synthesis |
---|
0:06:19 | at some disparate information source |
---|
0:06:22 | what's going to be new in this |
---|
0:06:24 | is doing it in the context of log-linear modeling |
---|
0:06:28 | and it's going to lead us to a model of the one that you see at the bottom of the |
---|
0:06:33 | picture here |
---|
0:06:35 | so in this model we have basically a two-state a two layer model i should say |
---|
0:06:40 | at the top layer we are going to end up with states these are going to be segmental states representing |
---|
0:06:47 | stereotypically words |
---|
0:06:49 | and then at the bottom layer will have a sequence of observation streams will have many observations training |
---|
0:06:55 | and these |
---|
0:06:58 | each provide some information they can be many different kinds of information sources for example at the detection of a |
---|
0:07:06 | phoneme the detection of the syllable detection of an energy burst a template match score |
---|
0:07:12 | all kinds of different information coming in at through these multiple observation streams |
---|
0:07:17 | and because their general like detections |
---|
0:07:21 | they're not necessarily frame synchronous and you can have variable numbers |
---|
0:07:26 | in the fixed and of time across the different streams |
---|
0:07:30 | and we'll have a log-linear model that relates |
---|
0:07:33 | the states that were hypothesized thing to the observations that are on hanging down below a below each state and |
---|
0:07:41 | blocked into work |
---|
0:07:46 | okay so i'd like to move on |
---|
0:07:48 | and now and discuss |
---|
0:07:50 | a ser S mathematically but starting first from hidden markov models |
---|
0:07:56 | so here's a depiction of it a hidden markov model i think we're all a familiar with this |
---|
0:08:01 | the key thing that we're we we're getting here is an estimation of the probability of the state sequence |
---|
0:08:10 | i given an observation sequence in this model states usually represent context-dependent phones or sub states of context dependent phones |
---|
0:08:20 | and the observations are most frequently i'm spectral representations such as mfccs or plps |
---|
0:08:27 | the probability is given by the expression that you see there where we go frame by frame |
---|
0:08:32 | and multiply i in transition probabilities the probability of a state at one time given the previous state |
---|
0:08:39 | and then observation probabilities the probability of an observation at a given time given that state |
---|
0:08:45 | in those observations are most frequently i gaussians on i mfcc or plp features |
---|
0:08:52 | whereas in hybrid systems you can also use neural net posteriors as input to the |
---|
0:08:59 | to the likelihood computation |
---|
0:09:04 | okay so i think the for |
---|
0:09:06 | sort of |
---|
0:09:07 | big step away conceptually from the hidden markov model is maximum entropy mark a markov models |
---|
0:09:15 | and these were first investigated by and wait right now party in the mid nineties in the context |
---|
0:09:20 | part-of-speech tagging |
---|
0:09:22 | for natural language processing |
---|
0:09:26 | and then generalized or formalise by mccallum and his colleagues in two thousand |
---|
0:09:32 | and then there were some i seminal application of these two speech recognition by jeff well when you ching now |
---|
0:09:40 | in the mid two thousand |
---|
0:09:43 | the idea behind these models |
---|
0:09:45 | is to ask the question what if we don't condition the observation on the state but instead condition the state |
---|
0:09:52 | on the observation |
---|
0:09:54 | so if you look at the graph your what's happened is the arrow instead of going down it's going up |
---|
0:09:59 | and we're conditioning a state at a given time J on the previous state and the current observation |
---|
0:10:06 | state are still context-dependent phone states as they were before |
---|
0:10:11 | but what we're gonna get out of this whole operation is the ability to have potentially much richer observations and |
---|
0:10:19 | then mfccs down here |
---|
0:10:22 | the probability of the state sequence given the observations are pretty an em am is given by this expression here |
---|
0:10:29 | where we go through time frame by time frame and compute the probability of the current state given the previous |
---|
0:10:35 | state |
---|
0:10:35 | and the and the current observation |
---|
0:10:39 | how do we do that |
---|
0:10:40 | the key to this is to use |
---|
0:10:43 | a |
---|
0:10:45 | small little maximum entropy model |
---|
0:10:48 | and apply it at every time frame |
---|
0:10:51 | so what this maximum entropy model does |
---|
0:10:54 | is primarily |
---|
0:10:56 | computes some feature functions that i |
---|
0:11:00 | that relate the state |
---|
0:11:02 | previous time to the state at the current time |
---|
0:11:05 | and the observation at the current time |
---|
0:11:07 | those feature functions can be arbitrary functions they can return a real number of a binary number and they can |
---|
0:11:14 | do an arbitrary computation |
---|
0:11:17 | they get weighted by lambda |
---|
0:11:19 | those are the parameters of the model summed over all the different kinds of features that you have and then |
---|
0:11:24 | exponentially eight |
---|
0:11:26 | it's normalized by the sum over all possible ways that you could assign values to the state they're of the |
---|
0:11:33 | same of the same sort of expression |
---|
0:11:36 | and this is doing two things again |
---|
0:11:38 | the first is gonna let us have arbitrary feature functions that we use |
---|
0:11:43 | rather than say gaussian mixture |
---|
0:11:45 | and it's inherently discriminative in that it has this normalisation factor here |
---|
0:11:53 | i'm gonna talk a lot about features and so i wanna make sure that we're on the same page in |
---|
0:11:58 | terms of what exactly i mean by features and feature functions |
---|
0:12:02 | features by the way are distinct from observations you observations of things you actually see and then the features |
---|
0:12:09 | are numbers that you can Q using those observations as in |
---|
0:12:16 | a nice way of thinking about the features is has a product of a state component and a linguistic compiled |
---|
0:12:24 | i'm sorry state component and then the acoustic component |
---|
0:12:28 | and i've illustrated a few possible state functions and acoustic functions |
---|
0:12:34 | in this table and then the features the kind of features that you extract from that |
---|
0:12:40 | so one very simple |
---|
0:12:42 | function is to ask the question is the current state |
---|
0:12:47 | why what's the current phone or what's the current context dependent on what's the value of that and just to |
---|
0:12:53 | use a constant for the acoustic function |
---|
0:12:56 | and you multiply those together and you have a binary feature |
---|
0:12:59 | it's either |
---|
0:13:01 | state is either this thing why or it's not zero one |
---|
0:13:04 | and the weight that you learn on that is essentially a prior on that particular concept context dependent state |
---|
0:13:12 | a full transition function would be the correct the previous state was X |
---|
0:13:17 | and the current state is why previous upon the such and so and the current phone as such and so |
---|
0:13:22 | we don't pay attention to the acoustics we just use one and that gives us a binary function that says |
---|
0:13:27 | what the transition |
---|
0:13:29 | little bit more interesting features when we start actually using the acoustic function |
---|
0:13:33 | so one example of that is to say the state function is the current state is such and so |
---|
0:13:41 | oh and by the way when i take my observation and plug it into my voicing detector that comes out |
---|
0:13:46 | either yes it's voiced or no it's not voiced and i get a binary feature when i multiply those two |
---|
0:13:51 | together |
---|
0:13:53 | yet another example is the state is such an so |
---|
0:13:56 | and i happen to have a |
---|
0:13:58 | a gaussian mixture model for every state and when i plug the observation into the gaussian mixture model for that |
---|
0:14:04 | state i get a score and i multiply the score by the by the fact that i'm seeing the state |
---|
0:14:10 | and that gives me a real-valued a feature function |
---|
0:14:13 | and so forth and so you can get fairly a fairly sophisticated feature functions this one down here by the |
---|
0:14:19 | way is the one that quoting now use in there and the mm work where they looked at the rank |
---|
0:14:25 | of a gaussian mixture model |
---|
0:14:29 | the rank of the gaussian mixture model associated with a particular state and compared all the other states in the |
---|
0:14:35 | system |
---|
0:14:38 | let's move on to the conditional random field |
---|
0:14:40 | now |
---|
0:14:41 | it turns out that under certain pathological conditions if you using em atoms you can make a decision early on |
---|
0:14:50 | and the transition structure |
---|
0:14:52 | just so happens to be set up in a way and such that you would nor the observations for the |
---|
0:14:57 | rest of the utterance |
---|
0:14:59 | and you run into a problem i think these are pathological conditions but they can theoretically exist |
---|
0:15:06 | and that motivated the development of conditional random field |
---|
0:15:10 | where rather than doing a bunch of the local normalizations making a bunch of local state wise decisions there's one |
---|
0:15:18 | global normalisation over all possible state sequences |
---|
0:15:22 | because there is a global normalisation the it doesn't make sense to have a rose in the picture the arrows |
---|
0:15:29 | indicate where you're gonna do the local normalisation and we're not doing a local normalisation |
---|
0:15:34 | so the picture is this |
---|
0:15:36 | the states are as with the maximum entropy model and the observations are also as with the maximum entropy model |
---|
0:15:42 | i and the feature functions are as with the maximum entropy model the thing that's different is that when you |
---|
0:15:48 | compute the probably the state given the observations |
---|
0:15:51 | you normalise |
---|
0:15:54 | not locally but once globally over all the possible ways that you can assign values |
---|
0:15:59 | to those state C one |
---|
0:16:05 | that brings me now to the segmental version of the crf which is the main point of the stock |
---|
0:16:11 | so the key difference between the segmental version of the crf and the previous version of the crf |
---|
0:16:17 | is that we're going to take the observations |
---|
0:16:21 | and we're not going to block them into groups that correspond to segments |
---|
0:16:25 | and we're actually gonna make those segments in the words |
---|
0:16:28 | conceptually they could be any kind of segment they could be a phone segment or syllable segment but the rest |
---|
0:16:33 | of this talk i'm gonna refer to them as word |
---|
0:16:36 | and for each word we're gonna block together a bunch of observations and associate it concretely with that state |
---|
0:16:44 | those observations again can be more general than mfccs for example they could be phoneme detections are the detection of |
---|
0:16:51 | the of articulatory feature |
---|
0:16:54 | there's some complexity that comes with this model because |
---|
0:16:58 | even when we do training where we know how many words there are we don't know what the segmentation is |
---|
0:17:03 | and so we'd have to consider all possible segmentations of the observations into the right number of were |
---|
0:17:10 | and then this guy in this picture here for example we have to consider segmenting seven observations not justice to |
---|
0:17:16 | two and three but maybe moving this guy over here and having three associated with the first word and only |
---|
0:17:22 | one associated with the second word |
---|
0:17:24 | and then three with the lab |
---|
0:17:26 | when you do decoding you don't even know how many words there are in so you have to consider both |
---|
0:17:31 | all the possible number of segments and all the possible segmentations |
---|
0:17:36 | given that number of sec |
---|
0:17:39 | this leads to an expression for segmental crfs that you see here |
---|
0:17:43 | it's written in terms of the edges that exist in the top layer of the graph there |
---|
0:17:49 | i each edge has a state to its left in the state to its right |
---|
0:17:54 | and it has a group of observations that are a link together underneath it O T |
---|
0:18:01 | and the segmentation is denoted by Q |
---|
0:18:04 | with that notation the probability of a state sequence given by the observations is given by the expression you see |
---|
0:18:11 | there which is essentially the same as expression for the regular crf |
---|
0:18:15 | except that now we have the some over segmentations that are consistent with the number of segments that are hypothesized |
---|
0:18:24 | or non during training |
---|
0:18:29 | okay so that was |
---|
0:18:31 | that was a lot of work to go to introduce segment features do we really need to introduce segmental features |
---|
0:18:36 | at all do we get anything from that because after all with the with the crf the state sequence is |
---|
0:18:43 | conditioned on the observations we've got the observation sitting there in front of us |
---|
0:18:47 | isn't that enough is there anything else you need |
---|
0:18:50 | and i think the answer to that is clearly yes you do need to have boundaries are you get more |
---|
0:18:56 | if you talk about concrete boundaries |
---|
0:18:59 | segment boundaries here a few examples of that |
---|
0:19:03 | i'm suppose you wanna use template match scores |
---|
0:19:06 | as a feature functions for example you have a segment and you ask the question what's the dtw distance between |
---|
0:19:13 | this segment and the closest example of the word that i'm hypothesize thing in some database that i have |
---|
0:19:20 | to do that you need to know where do you start the alignment where you end alignment and you need |
---|
0:19:24 | the boundary so you get something from that you don't have when you just say here's a big blob of |
---|
0:19:29 | observation |
---|
0:19:31 | similarly word durations if you wanna talk about a word duration model you have to be precise about when the |
---|
0:19:36 | word starts and when the word ends so that the duration is defined |
---|
0:19:40 | turns out to be useful to have boundaries if you're incorporating scores from other models |
---|
0:19:45 | two examples of that are the hmm likelihoods and fisher kernel scores |
---|
0:19:50 | the latent in gales have used |
---|
0:19:52 | and the point process model scores |
---|
0:19:55 | that the ends in and dog have propose |
---|
0:19:59 | later in the talk all talk about detection sub sequences |
---|
0:20:03 | as features in there again we need to know the bound |
---|
0:20:08 | okay so before proceeding i'd like to just emphasise that this is really building on along a tradition of work |
---|
0:20:15 | and i want to go over and call out some of the components of that tradition the first is log-linear |
---|
0:20:21 | models that use a frame level markov assumption |
---|
0:20:27 | and there i think he work was done by jeff cohen you ching gal with the maximum entropy markov model |
---|
0:20:35 | there really was the first to propose an exercise |
---|
0:20:38 | the power of using general feature functions |
---|
0:20:44 | shortly thereafter |
---|
0:20:46 | hidden or actually it's a more or less simultaneously with that a hidden crfs were proposed by cohen award on |
---|
0:20:52 | a and his colleagues and then there was a very interesting paper by under asking one of his students at |
---|
0:20:58 | last year's asr you |
---|
0:21:00 | i where essentially an extra hidden variables introduced into the crf |
---|
0:21:04 | to represent gaussian mixture components |
---|
0:21:06 | with the intention |
---|
0:21:08 | of simulating mmi training in a conventional system |
---|
0:21:15 | jeremy morris and error faster loosey a did some fascinating initial work on applying crfs and speech recognition |
---|
0:21:25 | they used features such as neural net attribute posteriors |
---|
0:21:30 | and in particular |
---|
0:21:31 | the detection of sonority voicing manner of articulation and so forth as a feature functions that went into the into |
---|
0:21:40 | the model |
---|
0:21:41 | and they also proposed and experimented with the use of mlp phoneme posteriors as feature |
---|
0:21:48 | and proposed the use of something called the clam didn't model |
---|
0:21:51 | which is essentially a hybrid crf hmm-model where the crf phone posteriors are used as a state likelihood functions rather |
---|
0:22:01 | than neural net posteriors in the standard hybrid system |
---|
0:22:05 | the second tradition i'd like to call out is actually the tradition of segmental log-linear models |
---|
0:22:11 | the first use of this was a termed a semi crfs by zero windy and cohen i in the development |
---|
0:22:19 | in natural language processing |
---|
0:22:22 | late evening gail's propose something term the conditional augmented statistical model which is a segmental crf |
---|
0:22:29 | that uses hmm scores and fisher kernel score |
---|
0:22:33 | saying rocking gail's propose the use of structured svms |
---|
0:22:37 | which are essentially a segmental crfs with large margin training |
---|
0:22:43 | later in stratford on have an interesting transducer representation that uses perceptron training and similarly achieves joint acoustic language and |
---|
0:22:52 | duration model training |
---|
0:22:54 | and finally georg cycle |
---|
0:22:56 | and patrick million i have done a lot of work on flat direct models which are essentially whole sentence maximum |
---|
0:23:05 | entropy |
---|
0:23:06 | acoustic models maxent models at the segment level and you can think of these segmental models i'm talking about today |
---|
0:23:13 | essentially stringing together a whole bunch of flat direct models one for each sect |
---|
0:23:20 | it's also important to realise that there's significant previous work on just classical segmental modelling and detector based asr |
---|
0:23:29 | the segmental modelling i think comes in sort of two main thread |
---|
0:23:33 | in one of these a likelihoods are based on framewise computations so you have a different number of scores that |
---|
0:23:39 | contribute each segment |
---|
0:23:41 | and there's a long line of work that was done here by mari ostendorf and her students and the number |
---|
0:23:48 | of other researchers so you see here |
---|
0:23:50 | i and then in a separate thread |
---|
0:23:52 | there's a development of using a fixed-length segment representation for each segment |
---|
0:23:58 | that mari ostendorf insulin glucose |
---|
0:24:01 | looked at in the late nineties and then jim glass more recently has worked on and contributed using |
---|
0:24:08 | phone likelihoods in the computation in a way that i think is similar to the normalisation and the ser a |
---|
0:24:16 | a framework |
---|
0:24:18 | i'm going to talk about using detections phone detections the multi-phone detections and the so is it that i think |
---|
0:24:24 | too much and lee and his colleagues in their proposal a detector based asr |
---|
0:24:30 | which combines detector information in the bottom a way to do speech recognition |
---|
0:24:38 | okay so i'm gonna move on now to the start implementation a specific implementation of a crf |
---|
0:24:44 | and what this is going to do is essentially extend that tradition that i've mentioned |
---|
0:24:48 | and it's going to extend it with the synthesis of detector based recognition i segmental modelling and log-linear modeling |
---|
0:24:58 | going to further |
---|
0:25:00 | develop some new features that weren't present before and in particular features termed existence expectation and levenshtein features |
---|
0:25:09 | and then i'm extend that tradition i would an adaptation to large vocabulary speech recognition by fusing finite state language |
---|
0:25:18 | modeling into that segmental framework for that |
---|
0:25:21 | talking about |
---|
0:25:24 | okay so let's move on to a specific implementation |
---|
0:25:28 | so this is a toolkit that i've i developed with a patrick neumann |
---|
0:25:32 | it's available from the web page that you see there you can download it and play around with it |
---|
0:25:39 | and the features that i talk about net |
---|
0:25:42 | arts |
---|
0:25:43 | civic |
---|
0:25:44 | to this implementation and they're sort of one way of realizing the general S crf framework and using it for |
---|
0:25:53 | speech recognition where you sort of have to dot all the icing cross all the T's and make sure that |
---|
0:25:58 | everything were |
---|
0:26:02 | okay someone at heart stop us start by talking about how language models are implemented there are because it's sort |
---|
0:26:08 | of a tricky issue |
---|
0:26:09 | when i see a model like this |
---|
0:26:12 | i think bigram language model i see to state |
---|
0:26:16 | they're next to each other they're connected to a narrow that's like the probability of one state given the preceding |
---|
0:26:21 | state and that looks a whole lot like a bigram language model so is that what we're talking about we |
---|
0:26:26 | just talking about bigram language model C |
---|
0:26:29 | and the answer is no what we're going to do is we're actually going to be able to model long |
---|
0:26:35 | span acoustic context |
---|
0:26:37 | by making these states |
---|
0:26:39 | refer to states in an underlying finite state language model |
---|
0:26:44 | here's an example of that |
---|
0:26:46 | what you see on the left is a fragment from a finite state language model it's a trigram language model |
---|
0:26:52 | so it has bigram history states |
---|
0:26:54 | for example there's a bigram history state the dog and dog are and dog way |
---|
0:27:00 | and sometimes we don't have all the trigrams in the world so to |
---|
0:27:05 | decode and unseen trigram we need to be able to back off to a lower order history state so for |
---|
0:27:11 | example if we're in the history state the dog we might have to back off to the history state dog |
---|
0:27:18 | the one word history state and then we could decode a word that we haven't seen before in a trigram |
---|
0:27:22 | context like yep and then moved to the history state dog yep |
---|
0:27:28 | finally as a last resort you can back off to the null history state three down there at the bottom |
---|
0:27:34 | and just decode any word in the vocabulary |
---|
0:27:38 | okay so let's assume that we want to decode the sequence the dog yet |
---|
0:27:43 | how would that look |
---|
0:27:45 | we decode the first word the and we end up in the state seven here i haven't seen the history |
---|
0:27:52 | the |
---|
0:27:54 | then we decode the word dog |
---|
0:27:56 | that moves us around up the state one we've seen the bigram now need all |
---|
0:28:02 | now suppose you wanna decode yeah |
---|
0:28:06 | to do that |
---|
0:28:08 | so right now we're in state one |
---|
0:28:10 | we gotten as far as the dog back to get us to state one here |
---|
0:28:15 | and now suppose you want to decode yeah we'd have to back off |
---|
0:28:19 | from state one to state two and then we could decode the wordnet and end up in state six over |
---|
0:28:26 | here thought yeah |
---|
0:28:28 | so what this means is that by the time we get around to decoding the wordnet |
---|
0:28:34 | we know a lot more then |
---|
0:28:36 | the last word was dog we actually know that the previous state was state one which corresponds to the to |
---|
0:28:42 | word history the dog and so this is not a bigram language model that we have here is actually reflects |
---|
0:28:48 | the semantics |
---|
0:28:50 | of the trigram language model that you see in that fragment on the left |
---|
0:28:57 | so there's two ways that we can use this one is to generate a basic language model score if we |
---|
0:29:03 | provide the system with the with the finite state language model then we can just look up the language model |
---|
0:29:08 | cost of transitioning between states and use that as one of the features in the system |
---|
0:29:13 | but more interestingly we can create a binary feature for each are in the language model |
---|
0:29:21 | now these arts and the language model are normally labeled with things like bigram probabilities a trigram probabilities or back-off |
---|
0:29:30 | probability |
---|
0:29:31 | what we're gonna do is we're gonna create a binary feature that just says have i traverse |
---|
0:29:36 | the are in transitioning from one state to the next |
---|
0:29:40 | so for example when we go from |
---|
0:29:42 | the dog to dog yep which reversed to works |
---|
0:29:46 | that are from one to two and then the art from two to six |
---|
0:29:49 | the weights |
---|
0:29:50 | the lamb does that we learn in association with that |
---|
0:29:54 | are analogous to the back-off weights and the bigram weights of the normal language model but we're actually learning what |
---|
0:30:01 | those weights are |
---|
0:30:03 | what that means is that when we do training we end up with the discriminatively trained language model and actually |
---|
0:30:09 | a language model that we train in association with the acoustic model training at the same time jointly with the |
---|
0:30:16 | acoustic model training |
---|
0:30:18 | so i think that sort of a interesting a phenomenon |
---|
0:30:23 | okay i'd like to talk about the inputs to the system now |
---|
0:30:28 | the first input are detector inputs so a detection is simply a unit and its midpoint |
---|
0:30:35 | an example of that is shown here what we have found detections this is from a voice mail system in |
---|
0:30:41 | it |
---|
0:30:41 | and start from a voice search system and it looks like the person is asking for burgers except the person |
---|
0:30:48 | says we're |
---|
0:30:49 | bird |
---|
0:30:51 | burgers E |
---|
0:30:53 | and so the way to read this is that we detected the but at time frame seven ninety and or |
---|
0:30:58 | time frame at and so forth and these correspond to the observations that are in red in the |
---|
0:31:05 | in the illustration here |
---|
0:31:07 | actually you can also provide a dictionaries that specify the expected sequence of detections for each word for example that |
---|
0:31:15 | if we're going to decode burgers we expect both for good and so forth that pronunciation of the word |
---|
0:31:23 | second input is lattices |
---|
0:31:26 | that constrain the search space |
---|
0:31:28 | the easiest way of getting these lattices is to use a conventional hmm system |
---|
0:31:33 | and use it to just provide |
---|
0:31:35 | i constraints on the search space |
---|
0:31:37 | and the way to read this is |
---|
0:31:39 | that from time twelve twenty one the time twenty five sixty a reasonable hypothesis is workings |
---|
0:31:48 | and these times here give us segment boundaries hypothesized segment boundaries and the word gives us |
---|
0:31:56 | possible labelings of the state |
---|
0:31:59 | and we're gonna use those when we actually do the computations to constrain the set of possibilities we have to |
---|
0:32:05 | consider |
---|
0:32:07 | second kind of a lattice input is user-defined features |
---|
0:32:11 | if you happen to have a model that you think provide some measure of consistency between the word that you're |
---|
0:32:19 | hypothesize thing in the observations you can plug it in is user-defined feature like you see here |
---|
0:32:25 | this lattice has a single feature that's been added it's it a dynamic time warping feature |
---|
0:32:30 | and the particular one and i've got underlined in red here is indicating that the dtw feature value for hypothesized |
---|
0:32:38 | sing the words fell |
---|
0:32:40 | between frames nineteen eleven and twenty to sixty is eight point two seven |
---|
0:32:45 | and that feature corresponds to one of the features in the log-linear models that exist on those vertical edges |
---|
0:32:54 | now multiple input |
---|
0:32:56 | are very much encouraged i and what you see here is a fragment of a of a lattice file that |
---|
0:33:03 | christa monk put together |
---|
0:33:05 | and you can see it's got lots of different a feature functions and he's defined |
---|
0:33:10 | and essentially these features are the things that the follow that a metaphor that i started at the beginning |
---|
0:33:16 | are analogous to the sales in the model that are providing the information in pushing the whole thing for work |
---|
0:33:22 | and that we want to get as many of those |
---|
0:33:24 | as possible |
---|
0:33:27 | okay |
---|
0:33:28 | let's talk about some features that are automatically defined from the inputs |
---|
0:33:34 | the user-defined features are we need to find you don't have to worry about them once you put them in |
---|
0:33:38 | on a lattice |
---|
0:33:40 | if you provide detector sequences are set of features that can be automatically extracted and then the system will learn |
---|
0:33:46 | the weights of those features those are existence expectation and levenshtein features along with something called the baseline feature |
---|
0:33:56 | so the idea of an existence feature is to measure whether a particular unit |
---|
0:34:02 | exists within the span of the word |
---|
0:34:04 | but you're hypothesize thing |
---|
0:34:06 | these are created for all word unit pair |
---|
0:34:10 | and they have the advantage that you don't need any predefined pronunciation dictionary |
---|
0:34:15 | but they have the disadvantage that you don't get any generalization ability across words |
---|
0:34:21 | i here's an example suppose we hypothesize in the word record |
---|
0:34:25 | and we spend the detections it and or |
---|
0:34:29 | i would create a feature that says okay i'm hypothesize in accord |
---|
0:34:33 | and i detected a in the span that would be in existence feature when you train the model presumably would |
---|
0:34:39 | get a positive weight because presumably it's a good thing to detect if you're hypothesize in the word court |
---|
0:34:47 | but |
---|
0:34:48 | there's no generalisation ability across words here so that's a completely different a then the code that you would have |
---|
0:34:54 | if you are hypothesized thing accordion and there's no transfer of the waiter smoothing there |
---|
0:35:03 | the idea behind expectation features is to use a dictionary to avoid this and actually get generalization ability across were |
---|
0:35:11 | there's three different kinds of expectation features |
---|
0:35:15 | and i think i'll just go through by example and it describes the examples |
---|
0:35:20 | so suppose let's take the first one suppose we're hypothesize in accord again and we detected it core |
---|
0:35:28 | we have a correct except |
---|
0:35:30 | oh but because we expect to see it on the basis of the dictionary and we've actually detected |
---|
0:35:37 | now that feature is very different from the other feature because we can learn that that's a good thing that |
---|
0:35:42 | detecting occur when you expect that could is good in the context of training on the word accord |
---|
0:35:48 | and then use that same feature weight when we detected a in association with the word accordion or the working |
---|
0:35:55 | at |
---|
0:35:56 | second kind of expectation features of false reject of the unit |
---|
0:36:00 | and that is an example of that where we expect to see it but we don't actually detected |
---|
0:36:05 | finally you can have a false accept of the unit where you don't expect to see it based on your |
---|
0:36:09 | dictionary pronunciation but it shows up there in the things that you've detected |
---|
0:36:14 | and that the |
---|
0:36:15 | in this example illustrates that |
---|
0:36:19 | a levenshtein features are similar to expectation features but they |
---|
0:36:25 | use stronger ordering constraints |
---|
0:36:29 | the idea behind the levenshtein features to take the dictionary pronunciation of a word |
---|
0:36:34 | and the units that you've detected |
---|
0:36:36 | in association with that word |
---|
0:36:39 | align them to each other get yeah the distance |
---|
0:36:42 | and then create one feature for each kind of added that you've had to make |
---|
0:36:46 | so the follow along in this example where we expect accord and we see that core |
---|
0:36:51 | we have a substitution of the a match of the cover a match of the war and the delete of |
---|
0:36:57 | the data |
---|
0:36:58 | and again presumably we can learn that matching and a is a good thing in that has a positive way |
---|
0:37:04 | by seeing one set of words you know training data and then use that |
---|
0:37:09 | to evaluate hypotheses of new word |
---|
0:37:13 | at test time where we haven't seen those particular words but they use these subword units |
---|
0:37:20 | whose feature values we've already learned |
---|
0:37:25 | okay the baseline features a kind of an important feature i wanna mention it here |
---|
0:37:29 | i think many people in the room have had the experience of taking a system having a an interesting idea |
---|
0:37:37 | very novel scientific thing to try out |
---|
0:37:41 | doing it adding it in and it gets worse |
---|
0:37:44 | and the idea behind the baseline features that we wanna think it's sort of the hippocratic oath |
---|
0:37:50 | where we're gonna do no harm we're gonna have a system where you can add information to it |
---|
0:37:55 | and not go backward |
---|
0:37:58 | so we're gonna make it so that you can build on the best system that you have |
---|
0:38:02 | by treating the output of that system as a word detector stream the detection of words |
---|
0:38:08 | and then defining a feature this baseline feature that sorta stabilises assist |
---|
0:38:13 | the definition of the baseline feature is that if you look at a at that are |
---|
0:38:18 | that you're hypothesize thing |
---|
0:38:20 | and you look at what words you've detected underneath it you get a plus one up for the baseline feature |
---|
0:38:26 | if the hypothesized word covers exactly one baseline detection and the words are the same and otherwise you get a |
---|
0:38:33 | minus one for this feature |
---|
0:38:36 | here's an example of that |
---|
0:38:38 | in the lattice path the sample path that were evaluating is a random like sort card or more |
---|
0:38:46 | the baseline system output was randomly sort cards man detected it these vertical lines that you see here |
---|
0:38:54 | so when we compute the baseline feature we take the first are random and we say how many words does |
---|
0:38:59 | it cover |
---|
0:39:00 | one that's good is it the same word no minus one |
---|
0:39:04 | then we take a light we say how many words does it cover not |
---|
0:39:08 | that's not going to get some minus one then we take sort we say how many words does it cover |
---|
0:39:12 | one |
---|
0:39:13 | is it the same yes okay we get a plus one there and finally called a mom covers two words |
---|
0:39:19 | not one like it's supposed to so we get some minus one also |
---|
0:39:23 | it turns out if you think about this you can see that |
---|
0:39:26 | the way to optimize the baseline score is to output exactly as many words as a baseline system as output |
---|
0:39:33 | and to make their identities |
---|
0:39:35 | exactly the same as the baseline identities |
---|
0:39:38 | so if you give the baseline feature high enough weight the baseline output is guaranteed |
---|
0:39:43 | in practice of course you don't just set that we randomly but yeah the feature to the system with all |
---|
0:39:48 | the other features and what is more in the way |
---|
0:39:52 | okay i'd like to move on now to some experimental results |
---|
0:39:56 | and the first of these has to use has to do with time using multi-phone detector is detecting multi-phone units |
---|
0:40:03 | in the context of voice search is nothing special about voice search here it just happens to be the application |
---|
0:40:09 | we were using |
---|
0:40:11 | the idea is to try to empirically find multi-phone units |
---|
0:40:16 | sequences of phones it tell us a lot about word |
---|
0:40:19 | then to train an hmm system |
---|
0:40:22 | whose units |
---|
0:40:23 | are these multi-phone systems do we decoding with that hmm system and take its output is a sequence of multi-phone |
---|
0:40:30 | detections |
---|
0:40:31 | we're gonna put that detector stream then into the ser at |
---|
0:40:36 | the main question here is what are good a phonetic sub sequences to you |
---|
0:40:41 | and we're gonna start by using every subsequent sit occurs in the dictionary as a candidate |
---|
0:40:47 | the expression for the mutual information between the unit you J and the word |
---|
0:40:53 | W is given by this big |
---|
0:40:55 | big mess that you see here |
---|
0:40:57 | and the important thing about this to take aways that there is a tradeoff it turns out that you want |
---|
0:41:02 | words that occur in about half i'm sorry you want units that occur in about half of the words so |
---|
0:41:08 | that when you get one of these binary detections you actually get full bit of information |
---|
0:41:14 | and from that sense of phones come close |
---|
0:41:17 | but you also need words it can be reliably detected because the best unit in the world isn't gonna do |
---|
0:41:24 | you any good if you can't actually detected and from that point of view one units are better |
---|
0:41:29 | turns out that if you do a phone decoding of the data you can then compile statistics and choose the |
---|
0:41:34 | units that are bad |
---|
0:41:36 | and my colleague patrick million i'd throw a research stream along that and you can look at this paper for |
---|
0:41:45 | for details |
---|
0:41:47 | if you do this and look at what are the most informative units in this particular voice search task you |
---|
0:41:52 | see something sort of interesting |
---|
0:41:54 | some of them are very short like on an R |
---|
0:41:57 | but then some of them are very long like california |
---|
0:42:01 | and so we get these units that sometimes are short and frequent and sometimes long and what california still pretty |
---|
0:42:08 | frequent but it's less frequent |
---|
0:42:12 | okay so what happens if we use multi-phone units |
---|
0:42:14 | we started with the baseline system that was about thirty seven percent |
---|
0:42:19 | if we added phone detections that dropped by about a percent |
---|
0:42:24 | if we use multi-phone units instead of phone units |
---|
0:42:28 | that turns out to be better so that was gratifying that using these multi-phone units instead of the simple phone |
---|
0:42:34 | units actually made a different |
---|
0:42:36 | and then if you use symbols together works |
---|
0:42:38 | little bit better |
---|
0:42:40 | if you use a multiple phone and multi-phone units of three best units that were detected it's little bit yet |
---|
0:42:45 | or better yet |
---|
0:42:47 | and finally when we did discriminative training |
---|
0:42:50 | that added a little bit more |
---|
0:42:53 | and so what you see here is it is actually possible to exploit some redundant information in this kind of |
---|
0:42:59 | a framework |
---|
0:43:02 | the next kind of features i want to talk about are template features and this is work that was done |
---|
0:43:08 | in the two thousand ten johns hopkins workshop |
---|
0:43:12 | on wall street journal by my colleagues christa monk and are incomparable |
---|
0:43:18 | i in order to understand that work you need to i need to say a just a little bit about |
---|
0:43:23 | how |
---|
0:43:24 | a baseline template system |
---|
0:43:28 | that is but about the baseline template system that's used that live in university |
---|
0:43:34 | so the idea here is that you have a big speech database |
---|
0:43:37 | and you do forced alignment of all the utterances those utterances are rows in that top picture |
---|
0:43:43 | and for each phone you know where it's boundaries are |
---|
0:43:47 | and that's what those square boxes are those are phone bound |
---|
0:43:50 | and you get a new utterance like the utterance at U C and the bottom |
---|
0:43:54 | and you try to explain it by going into this |
---|
0:43:57 | database that you haven't pulling out phone templates |
---|
0:44:01 | and then doing an alignment of those phone templates to the news each such that you cover the whole of |
---|
0:44:06 | the new utterance |
---|
0:44:08 | since the original templates come with phone labels you can then read off the phone sequence |
---|
0:44:17 | okay so suppose we have a system like that setup is it possible to use features |
---|
0:44:22 | that are created from templates in the sort of the ser a frame or |
---|
0:44:26 | and it turns out that you can do it in a in a sort of interesting the kinds of features |
---|
0:44:32 | that you can have |
---|
0:44:33 | so the idea is to create features |
---|
0:44:36 | based on the template matches that explain a hypothesis what you see at the upper left is a hypothesis of |
---|
0:44:42 | the word V |
---|
0:44:44 | and we further aligned it so that we know where the first phone the is and the second phone E |
---|
0:44:50 | is |
---|
0:44:51 | then we go into the database we find all the close matches to those phones |
---|
0:44:56 | so the number thirty five was a good match the number four hundred twenty three was a good match the |
---|
0:45:02 | number one thousand two no twelve thousand eleven was a good match and so for |
---|
0:45:08 | so given all those good matches what are some features that we can get |
---|
0:45:11 | one of these features is a word id feature |
---|
0:45:14 | what's a fraction of the templates that you see stacked up here that actually came from the word that were |
---|
0:45:20 | hypothesized T V |
---|
0:45:22 | another question is position consistency if the phone is word-initial like the |
---|
0:45:28 | what fraction of the |
---|
0:45:30 | the templates were word-initial in the original data that's another interesting feature |
---|
0:45:36 | speaker id entropy are all the close matches just from one speaker that would be a bad thing "'cause" potentially |
---|
0:45:43 | it's a flute |
---|
0:45:45 | a degree of working if you look at how much you have to work those examples to get them to |
---|
0:45:50 | fit what's the average word scale those are all features that the provide some information in that you can put |
---|
0:45:55 | into the system |
---|
0:45:56 | and the crust among wrote a word icassp paper that describes this in a detail |
---|
0:46:03 | if we look at the results there we started a from a baseline template system at eight point two percent |
---|
0:46:10 | adding the template metadata features provided an improvement |
---|
0:46:14 | to seven point six percent |
---|
0:46:16 | if we then add hmm scores we get the six point eight percent have to say there that the hmm |
---|
0:46:22 | itself actually was seven point three so that's |
---|
0:46:25 | seven point three sort of the baseline |
---|
0:46:27 | and then adding phone detectors dropped it down finally to six point six percent and this is actually very good |
---|
0:46:34 | number for the open vocab twenty a twenty K test set |
---|
0:46:39 | and again this is showing the effective use |
---|
0:46:41 | of multiple information source |
---|
0:46:45 | okay the last |
---|
0:46:46 | experimental result would like to go over |
---|
0:46:50 | is a broadcast news system that we worked on also at the twenty ten C lsp workshop |
---|
0:46:57 | i don't have time to go into detail on all the particular information sources that went into this |
---|
0:47:03 | i just want to call out a few things |
---|
0:47:05 | so I B M was kind enough to donate their at ellis system for use in creating a baseline system |
---|
0:47:12 | that constrain the search space |
---|
0:47:16 | we created a word detector system it |
---|
0:47:19 | at microsoft research that created these word detections that you see here detector streams |
---|
0:47:24 | there were a number of real value |
---|
0:47:28 | information sources here and hansen had a point process model that he worked on justine how worked on a duration |
---|
0:47:35 | model les atlas and some of the students had some scores based on modulation features those are provided real-valued a |
---|
0:47:42 | feature scores such as you see here |
---|
0:47:45 | and then facial i had some deep neural net phone detector |
---|
0:47:49 | and samuel thomas looked at |
---|
0:47:52 | the use of mlp phoneme detections in those provided the discrete detection streams that you see at the very bottom |
---|
0:47:59 | there |
---|
0:48:00 | if we look at the results let's just move over to the test results the baseline a system that we |
---|
0:48:07 | built had a fifteen point seven percent word error rate |
---|
0:48:11 | if we did that training with the scarf baseline feature there was a small improvement there i think that has |
---|
0:48:17 | to do with the dynamic range of the baseline feature plus minus one versus the dynamic range of the original |
---|
0:48:23 | likelihood |
---|
0:48:24 | adding the a word detectors provided about a percent adding the other feature scores added a bit more |
---|
0:48:31 | and the altogether we got about a nine point six percent |
---|
0:48:34 | relative improvement or about twenty seven percent of the gain possible i given the lattices and again this indicates that |
---|
0:48:42 | you can take multiple kinds of information put it into a system like this |
---|
0:48:47 | and then move in the right direction |
---|
0:48:51 | okay i want to just quickly go over a couple of research challenges i won't spend much time here because |
---|
0:48:57 | i research challenges are things that haven't been done in people are gonna do what they're gonna do anyway |
---|
0:49:03 | but i'll just mention a few things that seem like they might be interesting |
---|
0:49:06 | what one of them would be to use in a crf |
---|
0:49:10 | to boost hmm |
---|
0:49:12 | in the motivation for this is that the use of the a word detectors in the broadcast news system was |
---|
0:49:18 | actually a very effective we try to combine combination with rover and that didn't really work but we were able |
---|
0:49:25 | to use it with this log-linear weighting |
---|
0:49:28 | so the question is can we use crfs |
---|
0:49:30 | a crfs in a more general boosting loop |
---|
0:49:33 | the idea would be to train the system |
---|
0:49:35 | take its output take the word-level output |
---|
0:49:38 | reweighted training weighted according to the boosting algorithm up waiting the regions where we have mistake |
---|
0:49:45 | train a new system |
---|
0:49:47 | and then treat the output of that system is a new a detector stream to add in to the overall |
---|
0:49:54 | group obsessed |
---|
0:49:56 | second question is the use of spectro-temporal wreck receptive field models as detectors |
---|
0:50:03 | previously we've worked on hmm systems S detectors i think would be interesting to try to train S T R |
---|
0:50:09 | F |
---|
0:50:10 | models |
---|
0:50:11 | two |
---|
0:50:13 | work as a detectors and provide these |
---|
0:50:17 | detection streams |
---|
0:50:18 | one way of approaching that would be to take a bunch of examples of phones or multi-phone units in class |
---|
0:50:24 | examples and out-of-class examples for example |
---|
0:50:27 | and train a maximum entropy classifier to make the distinction |
---|
0:50:33 | and use the weight matrix of the maxent classifier essentially is a learned spectro-temporal receptive field |
---|
0:50:41 | the last |
---|
0:50:43 | idea that all throw out is to try to make much larger scale use of templates |
---|
0:50:49 | information then we used so far we start from the wall street journal results |
---|
0:50:55 | that there's comments there |
---|
0:50:58 | and i think |
---|
0:50:59 | maybe we could take that further for example in voice search systems we have an endless stream of data that |
---|
0:51:04 | comes in |
---|
0:51:06 | and we keep can transcribing some so we get more and more examples |
---|
0:51:09 | of phones and words and sub-word units and so forth |
---|
0:51:12 | and could we take some of those same features that are we're described previously and use "'em" a on a |
---|
0:51:18 | much larger scale as they come in on an ongoing basis |
---|
0:51:24 | okay so i'd like to conclude here |
---|
0:51:26 | i've talked today about segmental log-linear model specifically segmental conditional random fields |
---|
0:51:33 | i think these are flexible framework for testing novel scientific ideas |
---|
0:51:39 | in particular they allow you to in integrate diverse information sources different types of information a different granularities at the |
---|
0:51:50 | word level at the phone level at the frame level |
---|
0:51:53 | information it comes in a variable quality level some can be better than others |
---|
0:51:57 | potentially redundant |
---|
0:51:59 | information sources and generally speaking much more than where i'm currently using |
---|
0:52:05 | and finally i think there's a lot of interesting research left to do in this area |
---|
0:52:10 | so thank you |
---|
0:52:18 | okay we have a time for some questions |
---|
0:52:20 | we do have |
---|
0:52:22 | and please if you want to step up to my can actually put most close to the microphone that's |
---|
0:52:26 | actually very helpful |
---|
0:52:41 | so in a segmental models as an issue normalization |
---|
0:52:45 | because the comparing hypotheses with different numbers of segments and so |
---|
0:52:49 | there's an issue of how you how you make sure that you know |
---|
0:52:53 | i was used with fewer segments over |
---|
0:52:56 | longer segments knows wondering how |
---|
0:52:58 | still with that yeah good question and they deal with it because when you do training you have to normalize |
---|
0:53:06 | by considering all possible segmentations in the denominator |
---|
0:53:12 | so when you do training you know how many segments are you know how many words there are in the |
---|
0:53:18 | training hypothesis that gives you a fixed number like maybe it |
---|
0:53:23 | and then you have this normaliser |
---|
0:53:25 | where you have to consider all possible segmentations |
---|
0:53:29 | and if the system has a strong bias |
---|
0:53:31 | say towards |
---|
0:53:33 | segmentations it only had one segment because there were fewer score |
---|
0:53:37 | that wouldn't work because |
---|
0:53:40 | your denominator then with the sign high weights |
---|
0:53:44 | to the wrong segmentations it wouldn't assign highway to this thing in the did not in the numerator |
---|
0:53:50 | it would that has ten segments it was assigned high weight |
---|
0:53:53 | to the |
---|
0:53:56 | hypotheses it just had a single segmentation in the denominator and the objective function would be bad |
---|
0:54:03 | and |
---|
0:54:05 | training would take care that because maximizing the objective function the conditional likelihood of the training data it would have |
---|
0:54:12 | to assign parameter values |
---|
0:54:14 | and so that it didn't have that particular bias |
---|
0:54:21 | in one of those like you were saying that you in a discriminative kind of language more implicitly by building |
---|
0:54:27 | the other morning |
---|
0:54:28 | yeah so my question is but then it limits |
---|
0:54:31 | the thing is in order to bring them or do we need to have a bit of it is not |
---|
0:54:35 | really acoustically under data |
---|
0:54:37 | but also have other features that but usually for language modeling we have well who do most of text data |
---|
0:54:43 | with it but for which we may not have corresponding acoustic |
---|
0:54:46 | feature |
---|
0:54:48 | the whole you if we were to train a big language model from just X how to incorporate i didn't |
---|
0:54:55 | it is yeah |
---|
0:54:56 | so i think the way to do that |
---|
0:54:58 | is to |
---|
0:55:01 | annotate the lattice |
---|
0:55:03 | with the language model score |
---|
0:55:06 | that you get from this language model you train on lots and lots of data |
---|
0:55:11 | so that scores gonna get into the system |
---|
0:55:14 | then have a second language model that you could think of a sort of a corrective language model |
---|
0:55:20 | that is trained only on the data for which you have acoustics |
---|
0:55:26 | and |
---|
0:55:28 | add those |
---|
0:55:29 | features in |
---|
0:55:31 | in addition to the language model score from the basic language model |
---|
0:55:39 | all the other more |
---|
0:55:41 | well it just one pass decoding |
---|
0:55:44 | i mean for it what i understand that you take lattices and then to constrain your subspace |
---|
0:55:48 | but then what if i have a language model which is much more complicated than n-gram and i wish to |
---|
0:55:53 | do these coding |
---|
0:55:54 | so the possible output of like this like structure |
---|
0:55:57 | all that is the out of the core |
---|
0:56:00 | there's a question of in theory and in the particular |
---|
0:56:05 | implementation that we've made in the particular implementation that we've made a know it takes lattices in it produces one |
---|
0:56:12 | best |
---|
0:56:13 | there's nothing about the theory or the framework that says you can't take lattices in and produce lattices out and |
---|
0:56:21 | i was just curious about the pocket yeah okay |
---|
0:56:29 | i yeah i just have one question |
---|
0:56:31 | i think is a good idea to combine the different source of information but this thing can also be dying |
---|
0:56:36 | in a much simpler model right without using the concept of this that the buttons |
---|
0:56:41 | you introduce the segments here on the what is real benefits you |
---|
0:56:45 | on the |
---|
0:56:47 | so i think the benefit is |
---|
0:56:52 | features that you can't express |
---|
0:56:55 | if you don't have the concept of the segment |
---|
0:56:58 | an example of a feature where you need segment boundaries probably the simplest example is say a word duration model |
---|
0:57:06 | you really need to talk about when this the word star |
---|
0:57:09 | and when there's a word and |
---|
0:57:11 | another example where i think it's useful is in template matching if there's a hypothesis and you wanna have a |
---|
0:57:19 | feature of the form |
---|
0:57:23 | what is the dtw distance |
---|
0:57:26 | to the close this |
---|
0:57:27 | example in my training database |
---|
0:57:30 | of this word that i'm hypothesize thing |
---|
0:57:34 | it helps if you have a boundary to start that dtw alignment and the boundaries and that dtw alignment |
---|
0:57:42 | so i think the answer to the question is that |
---|
0:57:44 | by reasoning explicitly about segmentations |
---|
0:57:49 | you can incorporate features |
---|
0:57:52 | that |
---|
0:57:53 | you can incorporate if you reason only about frames |
---|
0:57:57 | feature against your incorporating a very heuristic we |
---|
0:58:01 | in some of the simple models of three levels |
---|
0:58:05 | with all the |
---|
0:58:07 | just combine them |
---|
0:58:09 | mapping |
---|
0:58:10 | the whole thing |
---|
0:58:11 | and the green |
---|
0:58:13 | it can we do that |
---|
0:58:15 | i |
---|
0:58:20 | i really i is so my own personal philosophy is that if you care about features if you care about |
---|
0:58:27 | information |
---|
0:58:29 | where the natural measure of that information is in terms of segments |
---|
0:58:34 | then you're better off |
---|
0:58:37 | bliss lately reasoning in terms of those units in terms of segments |
---|
0:58:41 | then somehow trying to |
---|
0:58:44 | implicitly or through the back door |
---|
0:58:47 | encode that information in some other way |
---|
0:58:59 | how many sorts of segments of you tried to be tried syllables of it right |
---|
0:59:04 | i would imagine because many syllables are also monosyllabic words that you might cease in confusion |
---|
0:59:12 | in your word models |
---|
0:59:14 | i |
---|
0:59:15 | syllables |
---|
0:59:17 | right i didn't mention this as a as a research direction but one thing i'm really interested in is being |
---|
0:59:25 | able to do decoding from scratch with the segmental model like this |
---|
0:59:30 | i also didn't go into detail about the computational |
---|
0:59:34 | burden of using these models |
---|
0:59:38 | but it turns out that it's |
---|
0:59:40 | proportional to the size of your vocabulary |
---|
0:59:43 | so if you wanted to bottom-up decoding from scratch without reference |
---|
0:59:49 | just some initial lattices or an external system |
---|
0:59:52 | you need to use subword units for example syllables which are on the order of some thousands |
---|
0:59:58 | or |
---|
1:00:00 | even better phones and for phones we actually have |
---|
1:00:03 | began some initial experiments |
---|
1:00:06 | with doing bottom-up phone recognition actually just at the segment level with the pure segment model where we just by |
---|
1:00:14 | force consider |
---|
1:00:16 | all possible segments and all possible phones |
---|
1:00:22 | okay let's think speaker |
---|