0:00:18 | X |
---|
0:00:21 | we |
---|
0:00:21 | yeah |
---|
0:00:22 | first you want |
---|
0:00:23 | depart |
---|
0:00:24 | like |
---|
0:00:24 | it an engineering |
---|
0:00:28 | great |
---|
0:00:29 | uh my presentation days and i |
---|
0:00:31 | you or improving the row |
---|
0:00:34 | section |
---|
0:00:35 | the |
---|
0:00:36 | uh |
---|
0:00:38 | the next day |
---|
0:00:40 | and uh |
---|
0:00:41 | a component note that you |
---|
0:00:44 | the big for line of my speech to the first we start by presenting the rover system |
---|
0:00:49 | uh then outlining the our proposed approach to improve the the system |
---|
0:00:54 | followed by some experiment results and then |
---|
0:00:57 | you with the somebody and future direction |
---|
0:01:01 | so the motivation of our work is uh |
---|
0:01:03 | do you to the way you could just the |
---|
0:01:05 | use of large vocabulary uh continuous speech recognition system |
---|
0:01:09 | and uh but the abundance of application using this type of a U K of systems |
---|
0:01:14 | the is an three requirement for high to see robustness |
---|
0:01:18 | and the my |
---|
0:01:19 | speech the code is |
---|
0:01:21 | so some of the common solutions to these uh problems are |
---|
0:01:25 | enhancing the feature extraction we |
---|
0:01:29 | we can also combine |
---|
0:01:31 | the uh |
---|
0:01:32 | speech feature |
---|
0:01:33 | the from and |
---|
0:01:34 | you |
---|
0:01:35 | we can also be my |
---|
0:01:37 | two |
---|
0:01:38 | and then |
---|
0:01:39 | yeah |
---|
0:01:40 | that is |
---|
0:01:41 | as are combined |
---|
0:01:43 | we don't know |
---|
0:01:44 | on this |
---|
0:01:45 | solutions |
---|
0:01:48 | so the or a combination |
---|
0:01:50 | uh |
---|
0:01:51 | means that uh a some of the common i mean a |
---|
0:01:55 | to and this type of a approach is out |
---|
0:01:58 | the recogniser output voting error reduction the rover system |
---|
0:02:02 | a there is also the confusion network combination issue |
---|
0:02:05 | and a minimum time frame framework error rate |
---|
0:02:08 | the idea is to combine |
---|
0:02:09 | different output coming from different the speech decode there's into one single composite output |
---|
0:02:15 | that |
---|
0:02:16 | hopefully will lead to and the that use the word error rate |
---|
0:02:21 | so we're gonna be focusing on the |
---|
0:02:23 | rule or |
---|
0:02:24 | we're trying to improve the this over system |
---|
0:02:27 | or different the rover a what has been divided by john and first this and nineteen ninety seven by uh |
---|
0:02:32 | within nist |
---|
0:02:34 | and the goal is to produce a scope was the asr to use the word error rate |
---|
0:02:38 | this technique is now a uh uh |
---|
0:02:41 | no as a is it's a baseline technique and if knew to the combination technique of the code there's |
---|
0:02:47 | is compared |
---|
0:02:48 | to uh to the rover technique |
---|
0:02:53 | so the process of for is it's a two sticks process |
---|
0:02:56 | first it starts of by creating a composite |
---|
0:02:59 | word transition network from |
---|
0:03:01 | the different a speech decoders |
---|
0:03:04 | then this network is brought was by uh of voting algorithm |
---|
0:03:09 | that's rice to select the best the |
---|
0:03:12 | a word at each slot and the uh a word transition network |
---|
0:03:16 | to do this that are i mean the people but has presented the back and night |
---|
0:03:20 | seven |
---|
0:03:21 | i see that the three voting uh scheme |
---|
0:03:24 | first of all you |
---|
0:03:25 | one of them uh uses the only the frequency of a is that each |
---|
0:03:29 | slot of the network |
---|
0:03:30 | and the two other |
---|
0:03:32 | the work of |
---|
0:03:36 | so basically this is the main the court like equation of the rover system |
---|
0:03:41 | uh we have a |
---|
0:03:43 | oh |
---|
0:03:44 | work |
---|
0:03:45 | i |
---|
0:03:46 | we have |
---|
0:03:48 | the ah |
---|
0:03:48 | so that more |
---|
0:03:49 | that's low |
---|
0:03:50 | uh |
---|
0:03:52 | so i |
---|
0:03:53 | they |
---|
0:03:54 | that uh we you |
---|
0:03:56 | uh |
---|
0:03:57 | but |
---|
0:03:58 | use |
---|
0:03:59 | one |
---|
0:04:00 | oh |
---|
0:04:01 | or |
---|
0:04:05 | there are |
---|
0:04:06 | if you for the voting because |
---|
0:04:08 | we don't need to worry about these for now |
---|
0:04:10 | the some of the shortcomings of the rover system is that this scoring mechanism the voting mechanism |
---|
0:04:16 | uh only works if the different the headers |
---|
0:04:19 | that that are coming from each decode that are different from each other |
---|
0:04:23 | otherwise the if we even we compile them together that is no |
---|
0:04:26 | being we don't gain anything because we will end up with the same i don't |
---|
0:04:30 | and the also that i the that the combination of these of the transition network |
---|
0:04:35 | it doesn't get an T the optimal |
---|
0:04:37 | a a result |
---|
0:04:38 | so if you combine to put that a and B |
---|
0:04:41 | the result is different from the N |
---|
0:04:44 | this technique is also or never be but wonder |
---|
0:04:47 | two of the voting mechanism are using the confidence value |
---|
0:04:50 | which is i mean uh are still not a reliable and the speech recognition and yeah |
---|
0:04:55 | uh also we a more than one best the sequence |
---|
0:04:59 | uh from each recognizer |
---|
0:05:01 | and we we it's and they but to i'll for the error on S asr |
---|
0:05:05 | systems one only one sink asr outputs the codec |
---|
0:05:09 | sequence of words |
---|
0:05:11 | so i what i mean some and all that i mean |
---|
0:05:13 | or uh several uh works has a and had been done |
---|
0:05:17 | uh a to try to fact these uh uh these problems |
---|
0:05:20 | especially the use of machine learning uh techniques |
---|
0:05:23 | and the voting mechanism |
---|
0:05:25 | uh but still i mean the performance of the system as a each of that to and |
---|
0:05:30 | is very difficult to |
---|
0:05:31 | uh uh to use a of the word at or eight |
---|
0:05:34 | so our our proposed approach here is to |
---|
0:05:37 | try tried to |
---|
0:05:39 | and reject the context sure well context what analysis |
---|
0:05:42 | B for the voting mechanism to try to feed that out and remove move the at ours |
---|
0:05:47 | from the composite word transition network |
---|
0:05:49 | before applying the voting make is |
---|
0:05:53 | oh with a with a wide but in the composite network then we do move the error and then we |
---|
0:05:58 | apply the usual voting uh |
---|
0:06:01 | oh also over |
---|
0:06:04 | that's start by presenting the at our detection a technique first so we have a five few terms than a |
---|
0:06:09 | but would of a word is the set of words |
---|
0:06:11 | the left and right context |
---|
0:06:13 | the P in my of the work is nothing else that a the of the probability of these were happening |
---|
0:06:18 | together |
---|
0:06:19 | divided by the product |
---|
0:06:21 | uh |
---|
0:06:22 | really |
---|
0:06:23 | a probably we can get that from |
---|
0:06:25 | a a a a large corpus |
---|
0:06:26 | the number of a as of the word i divide by the number |
---|
0:06:29 | then once we have a the in my in my four point |
---|
0:06:32 | uh |
---|
0:06:33 | why it information |
---|
0:06:35 | once we have the my we can |
---|
0:06:37 | we the same and the coherence of values |
---|
0:06:40 | and was uh |
---|
0:06:41 | how to money mean from the M my i i mean like |
---|
0:06:45 | or each |
---|
0:06:47 | yeah i but it some of this error classify error detection as i is as follows |
---|
0:06:51 | so given a sentence as i got five or |
---|
0:06:53 | the the neighborhood |
---|
0:06:55 | then we we be he's in my scores |
---|
0:06:57 | for uh all of the pair of words and that's sentence |
---|
0:07:01 | then we use the segment X score as we showed before using the either that harmonic mean maximum or summation |
---|
0:07:08 | uh |
---|
0:07:09 | once we could be same ending score as we create that we define the a of all these stores |
---|
0:07:14 | and then how we plan that one uh the uh what is an ad |
---|
0:07:18 | if the same and text for of that word is less |
---|
0:07:22 | uh uh i be but but by this average that mean that are |
---|
0:07:26 | that what is an adder or otherwise it will be dark as a correct uh output |
---|
0:07:32 | so the second part of that approach is and to get eating this thing within the rover process |
---|
0:07:37 | so we have a |
---|
0:07:38 | that |
---|
0:07:39 | by a think the word transition network |
---|
0:07:41 | so this is the work of the network |
---|
0:07:44 | we have the four |
---|
0:07:46 | see |
---|
0:07:46 | the second one |
---|
0:07:49 | and one |
---|
0:07:51 | so we want on this network we you to do |
---|
0:07:55 | and |
---|
0:07:56 | the more yeah |
---|
0:07:57 | what we use more than one at a classifier |
---|
0:08:00 | then |
---|
0:08:01 | and |
---|
0:08:02 | we |
---|
0:08:03 | yeah |
---|
0:08:04 | and |
---|
0:08:05 | oh |
---|
0:08:06 | of the net |
---|
0:08:07 | this one |
---|
0:08:08 | then |
---|
0:08:09 | my |
---|
0:08:10 | oh |
---|
0:08:10 | oh |
---|
0:08:11 | i we could work position network |
---|
0:08:15 | so that i could them or its as follows |
---|
0:08:17 | we can that that with it more than one |
---|
0:08:19 | no |
---|
0:08:20 | of the network |
---|
0:08:21 | we have my and a plastic five to remove the word |
---|
0:08:25 | and they're of them by the null transition |
---|
0:08:27 | and then we apply the voting algorithm |
---|
0:08:31 | so some experimental results |
---|
0:08:32 | uh the kind of a frame we had to use the the E also be nine the latest for |
---|
0:08:37 | uh uh speech recorded from your |
---|
0:08:40 | and then there C and you uh open source sphinx four or java bayes the mean |
---|
0:08:45 | uh a speech coder |
---|
0:08:47 | we used the have for the thing more |
---|
0:08:49 | uh we try to a a a a a to in three type of for decoder is so V nine |
---|
0:08:54 | with this language model |
---|
0:08:56 | and then sphinx four with two different language model |
---|
0:09:00 | that yeah my counts as you a but we before we had some probability so we had to use |
---|
0:09:04 | the huge but was we use the would one really and |
---|
0:09:07 | uh where the open corpus |
---|
0:09:09 | we use the seventeen million unigram and three hundred four |
---|
0:09:13 | many by guns |
---|
0:09:14 | to get those frequencies |
---|
0:09:16 | uh |
---|
0:09:18 | the measure you the word error rate |
---|
0:09:20 | the number of deletions the fusion |
---|
0:09:23 | i search and divided by the number of |
---|
0:09:25 | and words that have been out of by the recognizer |
---|
0:09:28 | precision and |
---|
0:09:29 | one one was it |
---|
0:09:31 | is for a negative for for that |
---|
0:09:32 | i a to the harmonic mean of precision and recall |
---|
0:09:36 | and then the naked uh |
---|
0:09:38 | but that that |
---|
0:09:38 | where they |
---|
0:09:39 | and the or that the the the fine |
---|
0:09:43 | maybe the value of |
---|
0:09:44 | predictive value |
---|
0:09:47 | so let's first uh show the uh a a says i mean that uh i would ever five before we |
---|
0:09:52 | had to get it within the rover this thing |
---|
0:09:54 | you we have not a measure |
---|
0:09:56 | or that's that's if i |
---|
0:09:58 | and we have not that there is a a be the threshold |
---|
0:10:01 | this is i mean the the the the threshold |
---|
0:10:04 | we uh |
---|
0:10:06 | but and is it's that they are how i get a that is the filtering of the ad |
---|
0:10:10 | uh we also a lot of the different time for the i-th deviation of the P M i for all |
---|
0:10:15 | the same and uh |
---|
0:10:17 | uh stored |
---|
0:10:18 | and i know this here that most of the aggregation you the same |
---|
0:10:23 | i |
---|
0:10:24 | or of the same |
---|
0:10:26 | i |
---|
0:10:26 | the when the change |
---|
0:10:28 | some of them |
---|
0:10:31 | better results |
---|
0:10:35 | the project rate |
---|
0:10:37 | so |
---|
0:10:38 | a |
---|
0:10:38 | so we the next day |
---|
0:10:42 | yeah uh |
---|
0:10:43 | we are |
---|
0:10:45 | what |
---|
0:10:45 | and uh |
---|
0:10:47 | again |
---|
0:10:48 | because we are tackling what export as a |
---|
0:10:51 | uh incorrect words |
---|
0:10:52 | so now that the C of our uh assessment we have done to experiments we have applied to at or |
---|
0:10:57 | detect on |
---|
0:10:59 | uh all the words and then |
---|
0:11:01 | on uh uh all the words but the stop words we move the stop words |
---|
0:11:05 | i i would explain why later on |
---|
0:11:07 | so a to uh we go there we have pretty settings |
---|
0:11:11 | so we and we have a report experiments for the |
---|
0:11:14 | or |
---|
0:11:16 | well my engine |
---|
0:11:17 | so we can see that we you know a to one point five percent |
---|
0:11:21 | but |
---|
0:11:22 | a at a rate |
---|
0:11:24 | uh we know that when we might my at are that's i one |
---|
0:11:28 | uh |
---|
0:11:29 | oh or |
---|
0:11:31 | because |
---|
0:11:32 | by the definition of a stop words |
---|
0:11:34 | that's a word that lacks i mean segment that meaning |
---|
0:11:36 | and then i and if you see that's the form uh for the pay in my |
---|
0:11:40 | it try to to see i mean |
---|
0:11:42 | each that what is an outlier |
---|
0:11:44 | but uh use you stop word it's very difficult to to see whether it's an outlier in the sentence or |
---|
0:11:49 | not |
---|
0:11:50 | all |
---|
0:11:52 | oh |
---|
0:11:53 | that's |
---|
0:11:54 | no for this C D uh D code or a complete H |
---|
0:11:58 | uh |
---|
0:12:00 | that |
---|
0:12:03 | the |
---|
0:12:04 | the one |
---|
0:12:05 | i |
---|
0:12:06 | why |
---|
0:12:07 | so what we have you have a |
---|
0:12:09 | oh |
---|
0:12:10 | a that we only have a at a classifier |
---|
0:12:13 | E |
---|
0:12:14 | uh |
---|
0:12:16 | i mean we don't a fine |
---|
0:12:18 | two of the asr |
---|
0:12:20 | uh output |
---|
0:12:21 | i i on so on a my when the video |
---|
0:12:26 | when we do that we see that |
---|
0:12:28 | get |
---|
0:12:30 | so |
---|
0:12:30 | what |
---|
0:12:31 | a |
---|
0:12:35 | or somebody we have the proposed in this paper or of an approach to improve the rovers and we |
---|
0:12:41 | oh X you are over |
---|
0:12:43 | a we have one |
---|
0:12:45 | a |
---|
0:12:46 | a a context what analysis |
---|
0:12:47 | the yeah the use of the and a classifier and now that |
---|
0:12:51 | i |
---|
0:12:52 | at are classifier |
---|
0:12:53 | and we have a have to one point five percent and what |
---|
0:12:57 | at a rate reduction |
---|
0:12:59 | future that action |
---|
0:13:00 | we can use uh uh uh all uh |
---|
0:13:02 | i i know at our classifier |
---|
0:13:04 | like the and that's i the latent let semantic and X and had a classifier |
---|
0:13:09 | a a we can also oh my that's fine to compensate for the low decision right |
---|
0:13:16 | i |
---|
0:13:18 | we can find |
---|
0:13:19 | i |
---|
0:13:21 | using a |
---|
0:13:23 | i |
---|
0:13:23 | we have |
---|
0:13:25 | again |
---|
0:13:26 | additional complexity |
---|
0:13:27 | of the C of a and the scalability ability of this system |
---|
0:13:31 | you |
---|
0:13:35 | you you know questions |
---|
0:13:37 | this |
---|
0:13:40 | yeah these |
---|
0:13:43 | so you you've presentation um |
---|
0:13:45 | but you try to be computed scores on the words |
---|
0:13:49 | yeah that of that it's a good question i mean uh and this paper we only use the one of |
---|
0:13:54 | the voting uh scheme the frequency once you one |
---|
0:13:57 | because most of the confidence value if not all of them from |
---|
0:14:01 | to |
---|
0:14:02 | a speech to decoder |
---|
0:14:03 | uh a or B use list they are a of it this value |
---|
0:14:07 | so we like we don't use of them so |
---|
0:14:09 | when you have a sentence all the words have a confidence value of one |
---|
0:14:13 | so it's basically use we cannot not use and see the impact of using the confidence value |
---|
0:14:19 | uh uh with this thing |
---|
0:14:22 | using this |
---|
0:14:23 | but i |
---|
0:14:24 | but uh uh i mean this supply this uh approach can be applied to |
---|
0:14:29 | what do we are talking about was the voting mechanism but we not touching that that part where trying to |
---|
0:14:34 | to move errors and then |
---|
0:14:36 | go back to the original rover |
---|
0:14:38 | so it doesn't affect what |
---|
0:14:46 | yes |
---|
0:14:47 | it does it it provides a means three voting with ten is in you can choose which are you like |
---|
0:14:51 | so now our |
---|
0:14:52 | we don't have a good confidence measure |
---|
0:14:54 | we don't use of we use the first |
---|
0:15:00 | any questions |
---|
0:15:04 | you questions |
---|
0:15:05 | well maybe can you just the you please |
---|
0:15:07 | read a home and on the computation complexity of the C will assess system |
---|
0:15:13 | yeah i mean uh for the C rover because we have to check her mean uh when we and do |
---|
0:15:18 | this |
---|
0:15:19 | error detection |
---|
0:15:20 | a classifier |
---|
0:15:21 | what will happen i mean there or do we need a me more computation however |
---|
0:15:25 | how was |
---|
0:15:26 | see |
---|
0:15:27 | i i i think i mean we have to have a huge corpus |
---|
0:15:30 | to be able to extract those probability |
---|
0:15:33 | how how does this a fact in terms of mean T |
---|
0:15:36 | in terms of P U power |
---|
0:15:37 | we have to one to give that |
---|
0:15:42 | measures |
---|
0:15:45 | he uh huh |
---|
0:15:46 | you no questions also |
---|
0:15:54 | oh |
---|
0:16:03 | i i of so yeah we still we can still get but better result because |
---|
0:16:07 | what we are we actually |
---|
0:16:08 | is are going to have the voting make a by tomorrow |
---|
0:16:11 | at |
---|
0:16:12 | so if we the error |
---|
0:16:13 | and we have coffee |
---|
0:16:14 | they are |
---|
0:16:15 | then that voting mechanism |
---|
0:16:17 | we we |
---|
0:16:18 | for sure i |
---|
0:16:19 | but |
---|
0:16:20 | yeah |
---|
0:16:24 | P if there is no one the question is too expensive speaker |
---|