0:00:13 | i |
---|
0:00:13 | um um still |
---|
0:00:14 | this is work that uh we this summer at uh johns hopkins workshop |
---|
0:00:19 | and uh |
---|
0:00:21 | i just think out a |
---|
0:00:22 | a student that we had |
---|
0:00:24 | from start for she did the most of the work |
---|
0:00:26 | just to to talk |
---|
0:00:29 | uh |
---|
0:00:29 | so |
---|
0:00:31 | uh are we going to talk about uh and duration models |
---|
0:00:34 | and the particular take that we have here |
---|
0:00:36 | E is uh to look at uh discrimination specifically |
---|
0:00:41 | uh of duration models |
---|
0:00:43 | so we going to start with uh |
---|
0:00:45 | looking at uh the driving motivation |
---|
0:00:48 | uh for this work which is |
---|
0:00:50 | uh a look at |
---|
0:00:51 | what happens with duration |
---|
0:00:52 | and in the context of discrimination |
---|
0:00:55 | and from that in creation try to uh derive |
---|
0:00:59 | uh |
---|
0:01:00 | and to is that we can use |
---|
0:01:02 | to help |
---|
0:01:03 | um |
---|
0:01:04 | speech recognition and for that we need uh |
---|
0:01:07 | mathematical framework or the segmental conditional random field |
---|
0:01:11 | to integrate those feature |
---|
0:01:14 | easy |
---|
0:01:15 | and then uh we going to talk back to duration features |
---|
0:01:19 | and C |
---|
0:01:20 | specifically what |
---|
0:01:21 | uh features we've actually added to uh |
---|
0:01:24 | two segmental conditional random fields |
---|
0:01:27 | and that |
---|
0:01:28 | go with a result |
---|
0:01:32 | so uh |
---|
0:01:34 | get a each of the models |
---|
0:01:35 | uh the generative |
---|
0:01:36 | story of at each of the models is well known that it's a |
---|
0:01:39 | exponential |
---|
0:01:40 | uh |
---|
0:01:41 | distributions of the the |
---|
0:01:43 | probability of staying in a particular state |
---|
0:01:46 | is exponential |
---|
0:01:47 | and if you look get a what happens you in uh reality it's not the case it doesn't look like |
---|
0:01:52 | an exponential |
---|
0:01:53 | so we know these models are wrong |
---|
0:01:56 | a and the solution tend to be a is possible to fix |
---|
0:01:59 | essentially the H |
---|
0:02:00 | and the solutions tend to be a a a a a a a little bit awkward and difficult to that |
---|
0:02:05 | to used |
---|
0:02:07 | and and for that uh we introduce the segmental conditional random field |
---|
0:02:12 | uh but first what we need to do is whether |
---|
0:02:15 | uh |
---|
0:02:16 | actually a duration is is it good |
---|
0:02:19 | indicator for whether word is |
---|
0:02:21 | correctly recognized |
---|
0:02:23 | or incorrectly recognized |
---|
0:02:27 | so what we did was to uh |
---|
0:02:29 | but at that it is |
---|
0:02:31 | oh produced by the decoder |
---|
0:02:33 | and here we have the histogram |
---|
0:02:35 | of the word |
---|
0:02:36 | two |
---|
0:02:37 | uh |
---|
0:02:38 | uh |
---|
0:02:39 | uh against its so duration so but the X you have the duration |
---|
0:02:43 | and on the Y axis you had |
---|
0:02:45 | uh the frequency and which |
---|
0:02:47 | oh of that but you a word pronounced with |
---|
0:02:50 | uh that's a particular duration |
---|
0:02:53 | the question is |
---|
0:02:54 | no whether |
---|
0:02:54 | that's a good indication of whether the word is correctly recognized lot |
---|
0:02:58 | and so |
---|
0:02:59 | to to each the correctly recognized one from that that is |
---|
0:03:02 | and then |
---|
0:03:03 | did the same for |
---|
0:03:05 | the instances that where ms recognise |
---|
0:03:08 | and uh |
---|
0:03:10 | interestingly |
---|
0:03:11 | uh the ones that are ms |
---|
0:03:13 | but recognized tend to be shorter |
---|
0:03:15 | and i i i'll come back to uh |
---|
0:03:17 | to uh why we think it's the case |
---|
0:03:19 | but clearly |
---|
0:03:20 | uh |
---|
0:03:21 | those distributions are different so that they might be a |
---|
0:03:24 | useful for for us |
---|
0:03:25 | a to using the concept sec |
---|
0:03:27 | context of a discrimination |
---|
0:03:30 | so how do we |
---|
0:03:32 | uh do we turn that |
---|
0:03:34 | intuition tuition two |
---|
0:03:35 | uh |
---|
0:03:36 | or something that can help uh |
---|
0:03:39 | the speech recognition and G |
---|
0:03:41 | uh and that's that's the propose of segmental conditional random fields |
---|
0:03:44 | so that the peak the here |
---|
0:03:46 | the graph |
---|
0:03:48 | and so you see on top of are word |
---|
0:03:51 | uh |
---|
0:03:52 | but high is |
---|
0:03:53 | and so you grew from word to word from state |
---|
0:03:56 | to states uh |
---|
0:03:57 | the markov assumption that's basically a |
---|
0:04:00 | and gram language model |
---|
0:04:02 | have know the words you see that uh observations are |
---|
0:04:06 | grouped into small blocks |
---|
0:04:07 | and each block |
---|
0:04:08 | uh is associated with a word |
---|
0:04:11 | so i like hmms we each uh |
---|
0:04:15 | which show uh a separate frame by frame where and and the words that just a concatenation of frame |
---|
0:04:21 | here uh |
---|
0:04:22 | allowed |
---|
0:04:23 | oh you use of multiple observations in a single block of second |
---|
0:04:28 | a a to make uh the determination of a |
---|
0:04:31 | whether |
---|
0:04:32 | uh a word just the correct one or not |
---|
0:04:34 | and |
---|
0:04:36 | but you do that is you |
---|
0:04:37 | those observation and you create a feature vector |
---|
0:04:41 | and uh you are a score that is a a weighted sum of these feature vector of to the speech |
---|
0:04:48 | and and that's the log part |
---|
0:04:51 | and so basically uh that's we things do know about this model first of all |
---|
0:04:55 | there uh conditional models |
---|
0:04:58 | than then of which leads that there |
---|
0:05:00 | actually actually discriminative |
---|
0:05:02 | secondly um |
---|
0:05:05 | uh they are a lot to models which means that you can use uh multiple features of different type |
---|
0:05:11 | um to uh interpolated |
---|
0:05:13 | and make the determination of whether the word is correct not |
---|
0:05:16 | and sort of vol |
---|
0:05:17 | uh most importantly is that there segment of model which means |
---|
0:05:21 | that you by lower yourself to group observation |
---|
0:05:24 | but features |
---|
0:05:26 | uh that |
---|
0:05:27 | that are that were operate globally you uh |
---|
0:05:30 | this group of observation |
---|
0:05:33 | and so uh i he's an example and you for more information we |
---|
0:05:37 | i have a poster this afternoon |
---|
0:05:39 | uh describing |
---|
0:05:40 | the multiple approach is that we integrated in a |
---|
0:05:42 | segmental conditional random field from work |
---|
0:05:45 | and uh uh you see what of features we can at want a word |
---|
0:05:51 | a low low the features that we developed so |
---|
0:05:53 | uh one of "'em" is a a a not the system and the M our detection is it's an the |
---|
0:05:58 | system |
---|
0:05:59 | well by uh |
---|
0:06:00 | uh microsoft research uh where |
---|
0:06:02 | uh |
---|
0:06:03 | you can combine with the |
---|
0:06:05 | different high this |
---|
0:06:07 | a a a at the bottom you see uh |
---|
0:06:09 | phoneme detections |
---|
0:06:11 | and that are extracted from um |
---|
0:06:13 | a neural network or |
---|
0:06:15 | oh your press perceptual |
---|
0:06:17 | uh and and in the middle you see our |
---|
0:06:20 | our our uh features the duration feature for instance |
---|
0:06:23 | a a is just a number |
---|
0:06:25 | that you are so she |
---|
0:06:26 | with |
---|
0:06:27 | a a word hypothesis |
---|
0:06:30 | we can see sometimes a we are allow |
---|
0:06:32 | uh features to be missing |
---|
0:06:35 | that's something that uh the for mark allows us to do |
---|
0:06:38 | so uh to you have a different uh hypotheses we instead of he |
---|
0:06:42 | okay |
---|
0:06:42 | look at uh |
---|
0:06:44 | for duration |
---|
0:06:45 | and you can assign a different durations score |
---|
0:06:48 | for different |
---|
0:06:49 | uh |
---|
0:06:50 | so a word hypothesis |
---|
0:06:52 | depending on whether |
---|
0:06:55 | but what what the duration whether the duration is plotted but |
---|
0:07:00 | so this is basically what we want to do |
---|
0:07:03 | to great uh |
---|
0:07:05 | this since we show that you know short iterations are |
---|
0:07:08 | uh oh yeah |
---|
0:07:09 | uh in the proper |
---|
0:07:11 | and so he it a a real example from a is |
---|
0:07:14 | and the the true transcription it's a fragment the true transcription was in a place called to michael query |
---|
0:07:20 | which she's a place in i think somewhere in india |
---|
0:07:23 | uh |
---|
0:07:23 | and it's of very very rare or i |
---|
0:07:26 | and so what happens is that uh through the back weights |
---|
0:07:29 | and the language model likes to |
---|
0:07:32 | instead of the of the true hypothesis |
---|
0:07:34 | insert search very short words |
---|
0:07:37 | there are typically function words not are very frequent |
---|
0:07:40 | and and because they don't fit a tend to be shorter |
---|
0:07:43 | i'm so to my |
---|
0:07:45 | cocker |
---|
0:07:46 | a a typically you know my is is a shortened a |
---|
0:07:49 | has to be compressed to fit because |
---|
0:07:51 | because the |
---|
0:07:54 | because it's a section of the real high cost of this |
---|
0:07:56 | and uh |
---|
0:07:58 | so this is our goal uh we need to panel |
---|
0:08:01 | you know the were though the words that are yeah |
---|
0:08:03 | and uh the words of and blue are correct so we want to uh |
---|
0:08:08 | and |
---|
0:08:09 | uh |
---|
0:08:09 | additional books |
---|
0:08:12 | uh so the way we're going to do that is we going to produce to features to scores |
---|
0:08:17 | and if you remember these these are the histograms of a uh |
---|
0:08:21 | the correct incorrect that |
---|
0:08:23 | uh |
---|
0:08:25 | a the durations or frequency the histograms for |
---|
0:08:28 | durations of frequency when the word is recognized correctly or incorrectly |
---|
0:08:32 | for the work too |
---|
0:08:33 | so the blue one is the correct one in the the red one using using correct one |
---|
0:08:38 | so it it to you have a word hypotheses of two |
---|
0:08:40 | but that is um |
---|
0:08:43 | twenty frames are |
---|
0:08:44 | so we are going to look up that a a probability |
---|
0:08:48 | the uh a histogram |
---|
0:08:50 | and you see uh you see that the blue one is higher than that |
---|
0:08:53 | the red one |
---|
0:08:54 | and ultimately the model is going to learn |
---|
0:08:57 | that |
---|
0:08:57 | this difference |
---|
0:08:58 | should be |
---|
0:08:59 | uh |
---|
0:09:00 | should have a positive weight |
---|
0:09:02 | in it should help |
---|
0:09:03 | uh any hypothesis that has a positive difference |
---|
0:09:06 | and then rise any hypothesis that has a negative different |
---|
0:09:09 | so when anything that it ten frames or we had |
---|
0:09:12 | a large |
---|
0:09:13 | make it |
---|
0:09:15 | a penalty or a large but should but |
---|
0:09:20 | so i the thing we going to do is |
---|
0:09:22 | uh |
---|
0:09:23 | we going to only look at the top hundred were |
---|
0:09:26 | and the reason is we wanna draw these histograms of need enough |
---|
0:09:29 | samples to be able to draw a system grams |
---|
0:09:32 | uh reliably |
---|
0:09:34 | and luckily given given the skewness of the task |
---|
0:09:37 | the top hundred words a actually |
---|
0:09:40 | fifty percent of the probability mass and fifty percent of the error math |
---|
0:09:44 | so they relatively uh |
---|
0:09:47 | or two words |
---|
0:09:48 | and we can we can be secure for two |
---|
0:09:51 | my percent of the word types |
---|
0:09:53 | and |
---|
0:09:53 | the uh consist of a fifty percent of the work to okay |
---|
0:10:00 | and the feature that we looked at at as uh |
---|
0:10:03 | a and and short span so |
---|
0:10:05 | intuition here is that |
---|
0:10:07 | a you have this phenomenon where |
---|
0:10:10 | uh the language model in |
---|
0:10:12 | uh lots of small words for a large word |
---|
0:10:14 | i just trying to break up a large in frequent word |
---|
0:10:17 | to uh lots of small were |
---|
0:10:21 | so |
---|
0:10:21 | well to distinguish a case for instance called in calling between |
---|
0:10:25 | a a and and to mark korean T |
---|
0:10:27 | these are other |
---|
0:10:28 | uh word |
---|
0:10:29 | uh |
---|
0:10:30 | so the first one calling calling is just a substitution |
---|
0:10:33 | uh and the other one is is a a of a different type |
---|
0:10:36 | so instead of port producing |
---|
0:10:38 | to features we going to display to which all of these cases input to uh |
---|
0:10:43 | six |
---|
0:10:44 | features |
---|
0:10:44 | so a whenever |
---|
0:10:45 | you know there's no special style |
---|
0:10:47 | produced uh |
---|
0:10:49 | a features for that for that keys |
---|
0:10:51 | and whenever a word is of a different |
---|
0:10:54 | and we would produce two |
---|
0:10:56 | a features so weights can be assigned |
---|
0:10:58 | differently |
---|
0:10:59 | for these different cases |
---|
0:11:02 | and so we decided that are almost span was a word that span multiple words and a short span with |
---|
0:11:07 | a |
---|
0:11:08 | word |
---|
0:11:08 | that that was spanned by |
---|
0:11:10 | one a where |
---|
0:11:13 | okay |
---|
0:11:14 | the second uh |
---|
0:11:16 | i i or was that if you if you look at uh this has been reported multiple times in the |
---|
0:11:20 | literature |
---|
0:11:21 | basically |
---|
0:11:22 | a before a pause |
---|
0:11:24 | uh a word tends to be pronounced |
---|
0:11:27 | uh slow are so it it will have a longer duration |
---|
0:11:31 | and in the middle of a sentence are after a "'cause" it will tend to be uh of normal duration |
---|
0:11:36 | so to speak |
---|
0:11:37 | so if to get the |
---|
0:11:38 | the example sentence here |
---|
0:11:40 | uh |
---|
0:11:41 | a to present and |
---|
0:11:42 | "'cause" |
---|
0:11:43 | a two are present clinton said |
---|
0:11:45 | something |
---|
0:11:46 | a see that the second instance the blue instance |
---|
0:11:49 | yeah |
---|
0:11:49 | so a short duration |
---|
0:11:53 | and so can we can separate these it |
---|
0:11:56 | i have um |
---|
0:11:58 | where that appear at the end of the |
---|
0:12:00 | and of a sentence or before calls |
---|
0:12:02 | to be uh |
---|
0:12:03 | to have a different duration model |
---|
0:12:07 | okay |
---|
0:12:08 | and so uh we integrate this |
---|
0:12:10 | uh |
---|
0:12:11 | with the framework in the uh |
---|
0:12:14 | the uh in the model so we had a state-of-the-art art uh I B M B is "'cause" the either |
---|
0:12:20 | uh and this is a broadcast news task |
---|
0:12:23 | uh |
---|
0:12:24 | and we uh |
---|
0:12:25 | uh a combined it's with that M S R system and we got to fifteen point three |
---|
0:12:30 | then the we i did uh duration features |
---|
0:12:34 | so we can see that this is more uh well king |
---|
0:12:37 | uh |
---|
0:12:38 | when you at motion features and we and when you at them |
---|
0:12:42 | uh with different uh |
---|
0:12:44 | or the different variance that show |
---|
0:12:48 | and uh these features where |
---|
0:12:50 | uh |
---|
0:12:51 | cindy read your feature don't turn out to be as good as is in the other |
---|
0:12:55 | individual feature we try the workshop |
---|
0:13:00 | right |
---|
0:13:01 | so um in conclusion we |
---|
0:13:04 | uh i hope of of of a given uh uh and about |
---|
0:13:08 | uh |
---|
0:13:08 | how durations can be used |
---|
0:13:10 | for word discrimination |
---|
0:13:13 | and uh |
---|
0:13:14 | a idea that uh |
---|
0:13:16 | where is misrecognized that tend to be short or because they come from a |
---|
0:13:20 | from forcing them |
---|
0:13:21 | uh by the language model |
---|
0:13:23 | we tend to be you short function one |
---|
0:13:27 | we were able to uh |
---|
0:13:29 | to this intuition and |
---|
0:13:31 | um quantities if one features that where |
---|
0:13:34 | we were able to uh |
---|
0:13:36 | uh integrating the segmental conditional random field |
---|
0:13:39 | a framework |
---|
0:13:40 | to pen pen is uh |
---|
0:13:42 | spurious word hypothesis |
---|
0:13:43 | individually |
---|
0:13:44 | the our duration scores |
---|
0:13:47 | and we combine that |
---|
0:13:48 | uh with the |
---|
0:13:50 | but a to state of the art system or still |
---|
0:13:53 | a small improvement |
---|
0:13:57 | okay |
---|
0:14:03 | yeah have a few |
---|
0:14:05 | i have a question |
---|
0:14:08 | so that that yeah i i think that could effect the duration of them where |
---|
0:14:11 | my keys met |
---|
0:14:13 | yeah yeah i not only keep on but also a |
---|
0:14:17 | yeah the way sounds |
---|
0:14:19 | i think you yeah |
---|
0:14:20 | yeah |
---|
0:14:21 | and i i have a duration |
---|
0:14:24 | uh yes that's interesting um |
---|
0:14:27 | i have an yet but that that's one one thing we look that which uh i i think you reading |
---|
0:14:31 | report in a people which was i think interesting is |
---|
0:14:33 | you can look at the |
---|
0:14:35 | duration of each uh phone |
---|
0:14:38 | within the word |
---|
0:14:39 | and you can see that actually there they differ |
---|
0:14:41 | uh |
---|
0:14:42 | and and you see uh yeah exactly depending on that stress whether the stress is correct and this |
---|
0:14:47 | to see differences in the duration |
---|
0:14:53 | yeah one a question |
---|
0:15:00 | yeah |
---|