0:00:15 | uh well come to to a uh uh i guess |
---|
0:00:18 | would morning everyone |
---|
0:00:20 | and the first couple of practical the model |
---|
0:00:24 | we have a a change of room |
---|
0:00:26 | you know that the this club B was really small and you are afraid that people are uh would not |
---|
0:00:31 | in a |
---|
0:00:32 | so uh we moved everything from club B and the the expert sessions from club E |
---|
0:00:38 | to the north hall |
---|
0:00:39 | it's actually about the this uh a a hall on the second floor next deviation |
---|
0:00:45 | and we should have more space the there so would be a known |
---|
0:00:49 | uh actually |
---|
0:00:50 | club |
---|
0:00:51 | the |
---|
0:00:51 | should be close than the |
---|
0:00:53 | oh signs |
---|
0:00:54 | would be there |
---|
0:00:55 | then a for the internet really sorry for the trouble just today |
---|
0:00:59 | that was close to a by you mobile by to provide |
---|
0:01:03 | weighted um |
---|
0:01:04 | uh |
---|
0:01:05 | a range problems are all spot |
---|
0:01:08 | so to you should be a variable again |
---|
0:01:10 | but please uh |
---|
0:01:12 | uh |
---|
0:01:13 | oh uh we have a a just |
---|
0:01:15 | five hundred twelve also available |
---|
0:01:17 | there is no |
---|
0:01:18 | you way |
---|
0:01:19 | will more |
---|
0:01:20 | so please disconnect when you |
---|
0:01:22 | you not need to to be a and this is especially course my the not because the mark rosewood |
---|
0:01:29 | on on just the state or or it all |
---|
0:01:33 | then a for the bank at torch you know |
---|
0:01:35 | we have a an i |
---|
0:01:37 | you need to dig |
---|
0:01:38 | i'm sorry for that but you don't have it the you will be a lot the to get on the |
---|
0:01:43 | bus looking |
---|
0:01:45 | real a very limited number of kids it's of available uh a the registration desk |
---|
0:01:51 | then the the partial but it right at the the a section |
---|
0:01:56 | the or to seven from and for number ten |
---|
0:01:59 | and the transportation back from just lena |
---|
0:02:01 | is not provided so |
---|
0:02:03 | my crap |
---|
0:02:05 | or continue or evening uh and this man it pops and or uh |
---|
0:02:10 | the of rock |
---|
0:02:12 | and uh uh i'm pretty much done a so uh there would be a short their introduction |
---|
0:02:17 | of a are you done and other on a uh i |
---|
0:02:27 | i |
---|
0:02:32 | i |
---|
0:02:33 | i |
---|
0:02:35 | oh |
---|
0:02:36 | hmmm |
---|
0:02:38 | i |
---|
0:02:39 | i |
---|
0:02:40 | i |
---|
0:02:41 | i |
---|
0:02:43 | i |
---|
0:02:48 | i |
---|
0:02:52 | yeah true |
---|
0:02:56 | and |
---|
0:02:58 | i |
---|
0:02:59 | so |
---|
0:03:04 | i |
---|
0:03:08 | and uh and uh |
---|
0:03:09 | there's is time for the for the second one E |
---|
0:03:12 | so uh |
---|
0:03:14 | so the going to be given by |
---|
0:03:16 | nelson morgan |
---|
0:03:17 | from icsi berkeley |
---|
0:03:19 | and uh and i get a month the |
---|
0:03:21 | or or the non fiction of the name |
---|
0:03:23 | will introduced a speaker and channel decision |
---|
0:03:31 | you very much for coming so one B |
---|
0:03:35 | the point |
---|
0:03:36 | it is my great |
---|
0:03:38 | and |
---|
0:03:40 | i |
---|
0:03:41 | on |
---|
0:03:43 | and |
---|
0:03:44 | yeah |
---|
0:03:45 | rubber for |
---|
0:03:47 | a |
---|
0:03:48 | compute |
---|
0:03:49 | but |
---|
0:03:50 | probably from or |
---|
0:03:52 | for those |
---|
0:03:54 | you know walking speech for very long time |
---|
0:03:57 | core |
---|
0:03:58 | a number of techniques |
---|
0:04:00 | i a |
---|
0:04:02 | or also are at you get number of |
---|
0:04:05 | number of you audience i C |
---|
0:04:08 | so |
---|
0:04:09 | for for those people |
---|
0:04:12 | more than that much of the introduction |
---|
0:04:15 | for those of you know him |
---|
0:04:17 | it's also called |
---|
0:04:19 | there you walk |
---|
0:04:20 | a better with the one of the a |
---|
0:04:22 | signal processing |
---|
0:04:24 | then vol |
---|
0:04:25 | and |
---|
0:04:26 | i mean out a new addition |
---|
0:04:28 | well for i for the problem |
---|
0:04:31 | on the uh |
---|
0:04:33 | what else can i say well i i i think that keep it sort or i will call more than |
---|
0:04:37 | here you leave you at all be better than |
---|
0:04:40 | looking at me |
---|
0:04:42 | i |
---|
0:04:44 | more |
---|
0:04:54 | i i nick |
---|
0:04:55 | well i thought it was time for a little bit of a reality check |
---|
0:04:59 | and uh speech recognition |
---|
0:05:01 | and |
---|
0:05:02 | it's been around for a long time as i think everybody here knows |
---|
0:05:06 | very long research history |
---|
0:05:08 | uh lots of publications for decades many projects |
---|
0:05:12 | and he sponsored project |
---|
0:05:14 | systems have continually gotten better |
---|
0:05:17 | it actually tended to converge so that there is |
---|
0:05:20 | in some sense a a standard |
---|
0:05:22 | automatic speech recognition system now |
---|
0:05:24 | uh it's made it to a lot of commercial products |
---|
0:05:28 | actually been used |
---|
0:05:29 | actually works from time to time |
---|
0:05:32 | and so in some sense |
---|
0:05:33 | it seems to have graduate |
---|
0:05:36 | but |
---|
0:05:39 | yes fails where humans don't |
---|
0:05:41 | and by the way those of you who have your P H Ds |
---|
0:05:44 | know that your education hopefully was not done at that point |
---|
0:05:49 | and there's probably a lot more to do here |
---|
0:05:51 | uh somewhat argue |
---|
0:05:53 | that there is little basic science that's been developed in quite a bit of time |
---|
0:05:58 | lots of good engineering methods though |
---|
0:06:00 | but they often require a great amount of data |
---|
0:06:03 | uh as we learned yesterday there is a great deal of data |
---|
0:06:07 | but not all of it is |
---|
0:06:08 | available for use in the way that you like |
---|
0:06:10 | and and are many tasks where you don't have that much |
---|
0:06:13 | and each new task requires |
---|
0:06:15 | uh essentially the same amount of effort you sort of have to start over again |
---|
0:06:20 | so how do we get to this point |
---|
0:06:21 | this is not gonna be anything like a complete history but |
---|
0:06:25 | enough to make my point help |
---|
0:06:27 | so |
---|
0:06:28 | i'm gonna talk about the status current status in the standard methods |
---|
0:06:31 | a very briefly |
---|
0:06:33 | uh talk about some of the alternatives the people have worked with over the years |
---|
0:06:37 | and where could we go from here |
---|
0:06:41 | so |
---|
0:06:41 | as i mentioned |
---|
0:06:42 | speech recognition research has been around for a very long time |
---|
0:06:46 | uh a significant papers for sixty years |
---|
0:06:50 | by the nineteen seventies |
---|
0:06:52 | in some sense major advances modeling and happened |
---|
0:06:56 | that is the basic |
---|
0:06:57 | mathematics behind hidden markov models |
---|
0:07:00 | was done by |
---|
0:07:02 | been a lots of improvements |
---|
0:07:03 | that happened uh a for the next twenty years or so |
---|
0:07:06 | and also in the features |
---|
0:07:08 | which became |
---|
0:07:09 | more less standard by ninety nine or so |
---|
0:07:12 | there were some really important methodology improvements by ninety ninety in earlier days |
---|
0:07:17 | people did many experiments but was very hard to compare them |
---|
0:07:20 | and the notions of standard evaluations and standard datasets really took called by ninety nine year so |
---|
0:07:27 | and over the all of these years |
---|
0:07:29 | uh specially last twenty thirty years they've been continuous improvements |
---|
0:07:33 | which were to some extent really close the related to more law |
---|
0:07:36 | movements in the technology |
---|
0:07:38 | that is |
---|
0:07:39 | um more more computational capability |
---|
0:07:42 | more more storage capability |
---|
0:07:44 | long people the work with very large datasets |
---|
0:07:46 | and develop very large models to well represent those large dataset |
---|
0:07:51 | so on |
---|
0:07:53 | so |
---|
0:07:54 | there's an elephant the room which is the things |
---|
0:07:57 | are not entirely working still |
---|
0:08:00 | with these systems then fact have converged |
---|
0:08:02 | was kind of a byproduct product of all of these standard evaluations which were |
---|
0:08:06 | very good in many ways |
---|
0:08:08 | but |
---|
0:08:09 | when people found out that the other group |
---|
0:08:11 | had something that they didn't they would copy a in very soon the system would become very much the same |
---|
0:08:18 | so |
---|
0:08:19 | what are some of the remaining problem |
---|
0:08:22 | well |
---|
0:08:22 | system still perform pretty poorly despite a large to work on this |
---|
0:08:27 | in the presence of a significant mounts of acoustic noise |
---|
0:08:30 | also reverberation |
---|
0:08:32 | which is natural for |
---|
0:08:34 | just about any situation |
---|
0:08:37 | uh unexpected speaking rate or accent |
---|
0:08:39 | that is by an expected i mean something that is not well represented in the training set |
---|
0:08:45 | uh on from all your topics |
---|
0:08:47 | uh the language models bring this a lot of the performance that we have and if you |
---|
0:08:51 | don't have a particular topic represented in the language model can do poorly |
---|
0:08:57 | and |
---|
0:08:57 | a from the recognition performance per se how many words you get right |
---|
0:09:01 | another thing that's important is knowing whether you're right or wrong |
---|
0:09:05 | and that's very important for practical applications |
---|
0:09:08 | and that still need some work as well |
---|
0:09:12 | so turns out that even some fairly simple speech recognition task can still fail under some of these conditions |
---|
0:09:17 | yielding some strange result |
---|
0:09:20 | well so boy she no slow |
---|
0:09:26 | voice recognition technology |
---|
0:09:28 | and i |
---|
0:09:29 | and shall |
---|
0:09:31 | yeah know try voice recognition technology |
---|
0:09:33 | no i one to change action |
---|
0:09:37 | oh i |
---|
0:09:40 | i |
---|
0:09:43 | yeah |
---|
0:09:45 | oh i |
---|
0:09:47 | oh and oh yeah |
---|
0:09:50 | i |
---|
0:09:53 | i don't |
---|
0:09:56 | i it was last |
---|
0:09:58 | yeah time in any case |
---|
0:10:01 | yeah |
---|
0:10:03 | i |
---|
0:10:04 | but |
---|
0:10:05 | i shown in a |
---|
0:10:06 | yeah |
---|
0:10:09 | yeah |
---|
0:10:10 | one |
---|
0:10:13 | a |
---|
0:10:15 | shacks |
---|
0:10:18 | and |
---|
0:10:20 | i |
---|
0:10:21 | a |
---|
0:10:21 | a |
---|
0:10:22 | same time one is that a |
---|
0:10:26 | i |
---|
0:10:27 | i |
---|
0:10:28 | yeah |
---|
0:10:29 | small a |
---|
0:10:31 | yeah |
---|
0:10:33 | a |
---|
0:10:35 | i |
---|
0:10:37 | a lot |
---|
0:10:38 | if you do not feel at all angles are we can getting a |
---|
0:10:42 | i |
---|
0:10:44 | anyway |
---|
0:10:45 | so that was funny |
---|
0:10:47 | i hope you think it was funny but |
---|
0:10:49 | what |
---|
0:10:49 | hasn't worked in real life as opposed to just the jokes |
---|
0:10:53 | and what have |
---|
0:10:56 | so uh let me start off with |
---|
0:10:58 | uh |
---|
0:10:59 | some results from some of these standard evaluations are referred to |
---|
0:11:03 | this is a graph the people in speech of seen a million times |
---|
0:11:06 | uh |
---|
0:11:07 | is this other one |
---|
0:11:09 | um |
---|
0:11:10 | for those of you who are familiar with this main thing to note is that uh P we start E |
---|
0:11:14 | R stands for word error rate |
---|
0:11:16 | hi high word error rate is obviously bad this is time and the axis |
---|
0:11:20 | and each of these lines represents a series of ten |
---|
0:11:23 | oh this is a kind of messy graph so it's cleaning up a little |
---|
0:11:26 | and |
---|
0:11:27 | uh this is uh a task done in the early nineties uh called eight is |
---|
0:11:32 | and the main thing to see here as with a lot of these is that to starts off at a |
---|
0:11:35 | pretty high error rate people work for a while |
---|
0:11:38 | and after while a gets down to uh a pretty reasonable error rate |
---|
0:11:43 | that's go to another one this was uh |
---|
0:11:45 | a a conversational telephone speech |
---|
0:11:47 | you have the same sort of a fact and do remember that the this is a a a a um |
---|
0:11:52 | a logarithmic scale here |
---|
0:11:53 | so even though it looks like it hasn't come down very far really did come down pretty far but after |
---|
0:11:58 | well sort of levels off |
---|
0:12:00 | uh more recently there's been a bunch work on speech from meetings which is also conversational |
---|
0:12:05 | these are from the uh individual head mounted microphones |
---|
0:12:09 | she we still didn't have a huge effects of background noise or or or reverberation or anything |
---|
0:12:14 | and there wasn't actually a huge amount of progress after some of the initial uh initial work |
---|
0:12:20 | uh now these are |
---|
0:12:21 | these evaluations |
---|
0:12:23 | uh a commercial products |
---|
0:12:25 | i think |
---|
0:12:26 | uh uh you |
---|
0:12:27 | a lot of information is proprietary |
---|
0:12:29 | but i think working can say is that |
---|
0:12:31 | a partial products work some of the time for some people |
---|
0:12:34 | and they often don't work |
---|
0:12:35 | for others |
---|
0:12:37 | so what is the state |
---|
0:12:39 | well the recognition systems were either |
---|
0:12:42 | work really well for somebody |
---|
0:12:44 | or they'll be terribly brittle and reliable |
---|
0:12:47 | uh i know that when my wife and i both tried a uh a dictation systems they work wonderfully for |
---|
0:12:52 | her and terrible for me i think i i well my words something |
---|
0:12:57 | so here's an abbreviated review |
---|
0:12:59 | of what standard |
---|
0:13:01 | by ninety ninety one |
---|
0:13:03 | we had |
---|
0:13:05 | uh feature extraction |
---|
0:13:06 | basically being based on frames every ten milliseconds or so |
---|
0:13:10 | compute |
---|
0:13:11 | some something from a short spectrum |
---|
0:13:14 | uh i things called mel-frequency cepstral coefficients |
---|
0:13:17 | well |
---|
0:13:18 | mention a bit more about a second |
---|
0:13:20 | uh |
---|
0:13:21 | P L P is another common method develop by then |
---|
0:13:25 | delta cepstra |
---|
0:13:26 | uh |
---|
0:13:26 | uh essentially temporal derivatives of the cepstra |
---|
0:13:30 | and on the statistical side |
---|
0:13:32 | uh acoustic modeling hidden markov models were quite standard |
---|
0:13:36 | it typically by this point represented |
---|
0:13:38 | context-dependent phoneme or units or phoneme like unit |
---|
0:13:42 | uh the language models are pretty much by this time all statistical |
---|
0:13:46 | and they represent it context-dependent words |
---|
0:13:50 | so all this with a by ninety ninety one |
---|
0:13:52 | a let's move to two thousand the eleven |
---|
0:13:56 | there it is |
---|
0:13:58 | uh |
---|
0:13:59 | notice all the changes |
---|
0:14:02 | okay that's a little unfair |
---|
0:14:04 | uh a will have actually done work in the last twenty years |
---|
0:14:07 | and this is |
---|
0:14:09 | they representation of a a lot of it i think |
---|
0:14:11 | and these of had big affects |
---|
0:14:13 | i don't mean to minimize |
---|
0:14:14 | errors |
---|
0:14:15 | uh various kinds of normalisation is uh meeting variance kind of normalisation |
---|
0:14:20 | uh a a an online version of that that we called rasta |
---|
0:14:23 | uh vocal tract length normalisation which |
---|
0:14:26 | compresses or expands the spectrum and |
---|
0:14:29 | in such a way is as to match the models better |
---|
0:14:33 | um |
---|
0:14:34 | and uh then |
---|
0:14:35 | adaptation in feature transformation |
---|
0:14:38 | uh either adapting better to test set that somewhat different from the training set |
---|
0:14:42 | uh or uh |
---|
0:14:44 | various that changes is to make the features more discriminative |
---|
0:14:49 | discriminate range training |
---|
0:14:51 | actually |
---|
0:14:52 | uh changing the statistical models |
---|
0:14:55 | in such a way as to make them more discriminant between different speech sound |
---|
0:14:59 | did did have more more data or of the years and that required |
---|
0:15:03 | lots of work to figure out how to handle that |
---|
0:15:05 | but aside from handling it was also taking advantage of lots of data |
---|
0:15:09 | which was didn't come for free so was lots of engineering work there |
---|
0:15:14 | uh people found that |
---|
0:15:15 | combining systems helped and sometimes combining |
---|
0:15:18 | pieces of systems helped |
---|
0:15:20 | and that's been an important thing in improving uh perform an |
---|
0:15:24 | and because |
---|
0:15:25 | uh speech recognition was starting to go into applications you had to be concerned about speed |
---|
0:15:30 | and this been a lot of work on that |
---|
0:15:33 | well but more and some of this uh |
---|
0:15:35 | the main point uh about mel cepstrum and plp a wanna make is that |
---|
0:15:40 | each of "'em" use this kind of warped frequency scale |
---|
0:15:43 | uh in which you have better resolution at low frequencies and high frequencies |
---|
0:15:47 | "'cause" our perception of different uh |
---|
0:15:50 | speech sounds is very different at low frequencies high frequencies |
---|
0:15:53 | no cepstrum and plp used a different mechanisms |
---|
0:15:57 | for getting a smooth spectrum uh |
---|
0:16:00 | delta cepstrum uh |
---|
0:16:02 | uh as big as i said is basically |
---|
0:16:05 | uh time derivatives uh of the cepstrum |
---|
0:16:09 | um |
---|
0:16:10 | hidden markov model this is a graphical form of it |
---|
0:16:13 | and main thing to see here this is a a |
---|
0:16:16 | a statistical dependency graph |
---|
0:16:18 | uh and |
---|
0:16:20 | say X three is only dependent on the current state |
---|
0:16:24 | each of these |
---|
0:16:25 | time steps |
---|
0:16:26 | uh |
---|
0:16:27 | are represented here |
---|
0:16:29 | and if you know Q three |
---|
0:16:31 | uh then Q two Q one X one X to tell you nothing about X three |
---|
0:16:35 | so that's a very very strong statistical conditional independence model |
---|
0:16:40 | and that's pretty much what people have used in these |
---|
0:16:43 | are now standard cyst |
---|
0:16:45 | this is my only equation |
---|
0:16:47 | and uh those of you and speech will go oh yeah fact probably |
---|
0:16:50 | most people say oh yeah |
---|
0:16:52 | this |
---|
0:16:53 | basically bayes rule |
---|
0:16:55 | the idea is that |
---|
0:16:56 | in statistical system |
---|
0:16:58 | you want to pick the model |
---|
0:16:59 | that is most probable given the data |
---|
0:17:02 | and base so as you could expand in this way |
---|
0:17:05 | and then you can get rid of the P of X because there's no dependence on the model |
---|
0:17:12 | um |
---|
0:17:12 | so |
---|
0:17:13 | you realise these |
---|
0:17:14 | uh likelihoods |
---|
0:17:16 | of of probability of the two six given the model with mixtures of gaussians typically |
---|
0:17:21 | you typically have each gaussian in just represented by means and variances there's no covariance represented between the features |
---|
0:17:29 | and there's the weights of each of the gaussians |
---|
0:17:31 | the you language priors |
---|
0:17:32 | P of them |
---|
0:17:34 | are uh |
---|
0:17:35 | implemented with a n-gram |
---|
0:17:37 | do a bunch accounting counting you do some smoothing |
---|
0:17:40 | and it's basically a probability of a word given some word histories such as the frequent the recent |
---|
0:17:46 | and minus one words |
---|
0:17:49 | now i |
---|
0:17:50 | the math is lovely but in practice we actually raise each of these things to some kind of power |
---|
0:17:55 | this is to compensate for the fact that the models are |
---|
0:17:58 | and that uh |
---|
0:18:00 | there are really other dependence |
---|
0:18:04 | um |
---|
0:18:04 | this is a picture of the acoustic likelihood |
---|
0:18:07 | uh uh uh uh estimator |
---|
0:18:09 | there's a few steps in here each of these boxes can actually be fairly complicated but |
---|
0:18:14 | just generally speaking |
---|
0:18:15 | there's a some kind of space short spectral estimation |
---|
0:18:19 | there's this vocal tract length normalisation i mention which compresses or expanse spectrum |
---|
0:18:24 | the some kind of smoothing either by |
---|
0:18:26 | uh throwing away some of the upper cepstra coefficients are why autoregressive modeling as is done in P L P |
---|
0:18:33 | there's various kinds of linear transformations for instance for dimensionality reduction |
---|
0:18:38 | uh and for discrimination better discrimination |
---|
0:18:41 | then there's the statistical engine |
---|
0:18:43 | that i mentioned before with this funny scaling um |
---|
0:18:46 | in the log domain or raising to a power |
---|
0:18:49 | in order to mixed with the |
---|
0:18:50 | uh language model |
---|
0:18:52 | okay well that seems simple enough but |
---|
0:18:54 | actual systems that get the very best scores are a bit more complicated than this |
---|
0:18:58 | uh there's well |
---|
0:18:59 | first off there's the decoder and the language priors coming in |
---|
0:19:03 | um |
---|
0:19:05 | well you might have to of these france to |
---|
0:19:08 | and |
---|
0:19:09 | people found that this is very helpful for getting best perform |
---|
0:19:13 | but you don't just put "'em" in in a very simple way |
---|
0:19:17 | it's a very often the case that you have all sorts of stages is |
---|
0:19:20 | with ugh |
---|
0:19:22 | C W here's is crossword |
---|
0:19:24 | or non crossword models and you produce graphs or lattice and you combine them at different points and you cross |
---|
0:19:30 | at that |
---|
0:19:32 | well |
---|
0:19:33 | this kind of reminds me of some work |
---|
0:19:36 | by a uh |
---|
0:19:37 | a berkeley grad of for about a century ago name rube goldberg |
---|
0:19:41 | and this is these self operating napkin |
---|
0:19:44 | the self operating napkin is activated when these ships spoon a a is raised to mouth |
---|
0:19:50 | uh pulling string P and thereby jerking little C |
---|
0:19:54 | which throws crack or D past parrot P |
---|
0:19:57 | uh pair of jumps after cracked or and perch have tilt |
---|
0:20:01 | which uh uh a process C it's G in into pale H |
---|
0:20:06 | the extra weight in the pale pulls the cord i which opens and |
---|
0:20:10 | uh i which lights the cigarette lighter J |
---|
0:20:13 | and this uh |
---|
0:20:14 | turn lights the rocket which pulls the sickle which cuts the string |
---|
0:20:19 | which |
---|
0:20:20 | was the pendulum to swing back and forth |
---|
0:20:22 | thereby by wiping the chen |
---|
0:20:25 | uh for this |
---|
0:20:26 | time my view of current speech recognition system |
---|
0:20:32 | it's successful at wiping the chance sometime time |
---|
0:20:35 | so i wanna talk a little bit about alternatives |
---|
0:20:37 | and i wanna say that the at the outset |
---|
0:20:40 | that these are just some of the alternatives |
---|
0:20:42 | a conference like this has uh a lot of work |
---|
0:20:45 | happily |
---|
0:20:46 | uh in in many different directions |
---|
0:20:48 | is the ones i wanted to give as examples |
---|
0:20:52 | but first i wanna say |
---|
0:20:54 | a little bit |
---|
0:20:55 | about |
---|
0:20:57 | what else is there |
---|
0:20:58 | besides the main |
---|
0:21:02 | the great sage natural and |
---|
0:21:04 | was tracked down by a seeker |
---|
0:21:06 | and the or ask the sage |
---|
0:21:09 | what is the secret to happiness |
---|
0:21:12 | sage answered |
---|
0:21:13 | good judge |
---|
0:21:16 | well the sick said that's |
---|
0:21:17 | that's so very well |
---|
0:21:18 | master but |
---|
0:21:20 | how does one obtain good judgement |
---|
0:21:23 | and the master said |
---|
0:21:24 | from experience |
---|
0:21:27 | so the seek a okay experience |
---|
0:21:30 | but |
---|
0:21:31 | how does one obtain this experience |
---|
0:21:34 | and the master said |
---|
0:21:35 | bad judgement |
---|
0:21:39 | so |
---|
0:21:40 | here's of exercise exercises that we many other people of done in bad judgement |
---|
0:21:44 | we've pursued |
---|
0:21:46 | different signal representation |
---|
0:21:48 | uh some of them are related to perception |
---|
0:21:50 | to auditory models france |
---|
0:21:53 | a mean rate and synchrony has a to send ups model from "'em" some time ago |
---|
0:21:57 | uh and into and sample interval histogram |
---|
0:22:00 | from uh uh way it gets a |
---|
0:22:02 | i each of these |
---|
0:22:03 | were |
---|
0:22:04 | related to models of neural firing |
---|
0:22:08 | uh how |
---|
0:22:09 | how fast they want to how much they synchronise one another |
---|
0:22:12 | what uh timing there was between the fire |
---|
0:22:15 | and they had some interesting performance in noise uh they |
---|
0:22:19 | i not been adopted any serious way |
---|
0:22:22 | but |
---|
0:22:23 | there's interesting technology there an interesting scientific models |
---|
0:22:27 | then their stuff that's more and the psychological side these were sort of based on on models of fit |
---|
0:22:32 | physiology |
---|
0:22:33 | uh then there is uh model uh |
---|
0:22:36 | really from the psychological side and multi band systems based on critical bands going all the way back to |
---|
0:22:42 | fletcher's work work of others |
---|
0:22:44 | uh and |
---|
0:22:46 | uh the idea here is that if you have a system that's just looking at part of the spectrum |
---|
0:22:50 | if the disturbances in that part of the spectrum |
---|
0:22:53 | uh then you can deal with that separately |
---|
0:22:56 | note of had some X some six |
---|
0:22:58 | and then something that uh |
---|
0:23:00 | you can observe both that the physiological and psychological level |
---|
0:23:04 | is the importance of tip different um modulations |
---|
0:23:08 | particularly temporal but also spectral modulations and the signal |
---|
0:23:13 | uh then there's on the production side is been a bunch of work by people on |
---|
0:23:17 | uh given the fact that there is only if you articulatory uh mechanism |
---|
0:23:22 | uh that maybe you can represent things that way and O be more se saying and |
---|
0:23:26 | the better |
---|
0:23:27 | better representation the signal one |
---|
0:23:29 | represent this over time in their been |
---|
0:23:31 | hidden dynamic |
---|
0:23:32 | uh models that attempt to do this and |
---|
0:23:35 | trajectory models sometimes the trajectory models had nothing to do with the physiological models but |
---|
0:23:40 | uh sometimes they did |
---|
0:23:43 | and articulatory features which you could think of as a quantized version of the articulator positions and so for |
---|
0:23:51 | then another direction was artificial neural networks which of been around for a very long time |
---|
0:23:57 | um |
---|
0:23:58 | actually before nineteen sixty one but |
---|
0:24:00 | i picked out this one discriminant analysis iterative design |
---|
0:24:04 | the pick that out "'cause" a lot of people don't know about it a lot of people think that a |
---|
0:24:07 | multilayer networks the big N in the eighties |
---|
0:24:10 | but actually neck can sixty one they had a multilayer network that work very well for some problems is actually |
---|
0:24:15 | used industrial E |
---|
0:24:16 | for a that case after that |
---|
0:24:19 | um which the first uh layer of units was uh uh a bunch of gaussians and after that you had |
---|
0:24:24 | a you had linear perceptron |
---|
0:24:27 | couple years later uh other was work at stanford |
---|
0:24:30 | in which they actually did apply some of this kind of stuff to speech these were actually linear adaptive units |
---|
0:24:35 | to actually called add lines |
---|
0:24:37 | uh burning would row sent me uh uh |
---|
0:24:39 | a technical report |
---|
0:24:40 | sri is struggle interest is the cover real technical report nineteen sixty three |
---|
0:24:46 | uh is a page from it that shows a |
---|
0:24:48 | uh a block diagram of try blew up here for |
---|
0:24:51 | mean it's |
---|
0:24:52 | and starts off with some band filters basically you getting some power measures in each band |
---|
0:24:57 | and then here these add lines which uh give you some sets of outputs |
---|
0:25:02 | which one to a typewriter a pair |
---|
0:25:06 | um |
---|
0:25:07 | nineteen eighties so an explosion of interest in the neural network |
---|
0:25:11 | uh |
---|
0:25:11 | very area |
---|
0:25:13 | uh part of this |
---|
0:25:14 | was sparked by |
---|
0:25:16 | a a rediscovery discovery say of your were back propagation |
---|
0:25:20 | just basically propagating the effect of errors from the output of the system |
---|
0:25:24 | back to the individual weight |
---|
0:25:27 | uh in the late eighties uh number of us worked on hybrid hmm artificial neural network systems |
---|
0:25:34 | where the neural networks were used this probability estimators stick to get the emission uh probabilities for the hmm |
---|
0:25:41 | um |
---|
0:25:41 | last decade or so uh quite a few people have taken off on the tandem idea |
---|
0:25:46 | which is do you which is a particular way of using artificial neural networks |
---|
0:25:50 | as feature extractor is |
---|
0:25:52 | and i will just mention uh briefly |
---|
0:25:55 | uh a fairly recent development of the networks |
---|
0:25:59 | and |
---|
0:26:00 | how uh |
---|
0:26:02 | how innovative it is is a is the question |
---|
0:26:04 | but there's definitely some new things going on there which i think are interesting |
---|
0:26:09 | uh |
---|
0:26:10 | the obvious difference between this in the previous networks to can to be more layers that and steep |
---|
0:26:15 | there's also sometimes and unsupervised pre-training |
---|
0:26:20 | uh |
---|
0:26:21 | there's actually several papers at this conference there's also a special issue |
---|
0:26:24 | uh in uh november of the transaction |
---|
0:26:28 | um here's a couple papers that this conference i think this if you others as well as one from the |
---|
0:26:32 | nails river E |
---|
0:26:34 | they had a lot different numbers in the paper but uh i pick one out |
---|
0:26:38 | and just |
---|
0:26:40 | it did they most the numbers had the same general trend |
---|
0:26:43 | mfcc |
---|
0:26:44 | bad |
---|
0:26:45 | deep mlp good |
---|
0:26:47 | uh and the old mlp somewhere in between |
---|
0:26:50 | these are error rates so low again uh low is good |
---|
0:26:54 | and uh there is a large vocabulary um |
---|
0:26:58 | voice search |
---|
0:26:59 | uh paper which uh i |
---|
0:27:01 | is is that poster today |
---|
0:27:03 | uh i had a sixteen percent set the their metric was sentence error reduction |
---|
0:27:08 | and they had a nice improvement compared to |
---|
0:27:10 | a system that used uh M P which is a a a very common discriminant training |
---|
0:27:15 | approach |
---|
0:27:20 | okay |
---|
0:27:20 | so that it to some of the alternatives is again there's you i'm sure |
---|
0:27:24 | many people this audience good think of a many of |
---|
0:27:29 | where could we go from here |
---|
0:27:30 | or |
---|
0:27:31 | in my opinion where should be go from |
---|
0:27:35 | well |
---|
0:27:36 | better features and models |
---|
0:27:39 | um |
---|
0:27:40 | i've suggested |
---|
0:27:41 | better models of hearing in production |
---|
0:27:44 | uh could press perhaps lead to better features |
---|
0:27:48 | uh better models of these features |
---|
0:27:50 | better acoustic models |
---|
0:27:53 | models of understanding better language models dialogue models pragmatics and so on |
---|
0:27:58 | all these are likely to be import |
---|
0:28:01 | the other thing which i'm gonna go into a bit especially at the end is understanding the errors |
---|
0:28:06 | understanding what the assumptions are |
---|
0:28:08 | that are going into our models |
---|
0:28:10 | and how to get past |
---|
0:28:15 | so we start with models of hearing |
---|
0:28:17 | so there are |
---|
0:28:19 | useful approximations to the action of for free that is uh |
---|
0:28:23 | uh from here |
---|
0:28:25 | to the auditory in your of |
---|
0:28:27 | and when i say useful approximations i mean that there are are number of people who've worked |
---|
0:28:33 | and |
---|
0:28:34 | simplifying the models that if that were used earlier |
---|
0:28:38 | and |
---|
0:28:39 | crafting them more towards |
---|
0:28:41 | uh good engineering |
---|
0:28:43 | tools |
---|
0:28:44 | some of those are looking kind of promise |
---|
0:28:47 | uh there's new information about the auditory cortex which i'm gonna brief the refer to |
---|
0:28:51 | next few slides |
---|
0:28:53 | including some results with noise |
---|
0:28:56 | um |
---|
0:28:57 | it's good to learn from a biological examples because uh you know humans are pretty good in many situations that |
---|
0:29:03 | at recognizing speech |
---|
0:29:05 | but |
---|
0:29:06 | it's |
---|
0:29:06 | probably good also not to be purist |
---|
0:29:08 | and to mix |
---|
0:29:09 | in size that you get from these things with good engineering approaches |
---|
0:29:13 | and i i i think there's some |
---|
0:29:15 | uh good possibilities there |
---|
0:29:17 | uh this bottom bullet it is just to note that |
---|
0:29:20 | as with many things in this talk a money talking about some of the field |
---|
0:29:24 | and a mostly talking about single channel |
---|
0:29:26 | but uh people have to ears they make pretty good use of them when they were |
---|
0:29:31 | uh and that's |
---|
0:29:33 | something to keep in mind |
---|
0:29:34 | and of course you can go to many years in some situations with microphone arrays and that's a good thing |
---|
0:29:39 | to |
---|
0:29:39 | think about |
---|
0:29:40 | that's not a topics and i'm expanding on and the stock |
---|
0:29:44 | and the same thing with visual information visual information is used by people whenever they can |
---|
0:29:49 | uh and i'm not gonna talk about that but it's obviously imp or |
---|
0:29:53 | okay a a is gonna talk about this a cortical stuff |
---|
0:29:58 | uh the slightest courtesy of uh she she shah it's not just the slide but also the idea |
---|
0:30:03 | uh and the idea that which comes from experiments that uh he in it's guys |
---|
0:30:09 | and gals |
---|
0:30:10 | have |
---|
0:30:11 | uh done with a small mammals |
---|
0:30:14 | uh a that have |
---|
0:30:16 | pretty similar |
---|
0:30:17 | really part of the cortex X |
---|
0:30:19 | uh a primary auditory cortex |
---|
0:30:21 | to what people have |
---|
0:30:23 | also been some other work with people |
---|
0:30:25 | uh and |
---|
0:30:27 | these |
---|
0:30:27 | uh |
---|
0:30:28 | this if you mention this is being the kind of spectrogram that's received that this primary auditory cortex |
---|
0:30:35 | what they've observed is that there's a bunch of what are called split spectro-temporal receptive fields S T R apps |
---|
0:30:41 | which are little filters |
---|
0:30:42 | that process it in time and frequency |
---|
0:30:46 | and you could think of them as processing temporal modulations which you called rate and spectral modulations which called scale |
---|
0:30:53 | and you imagine there being a cube |
---|
0:30:55 | at each time point |
---|
0:30:57 | with auditory frequency |
---|
0:30:59 | and uh |
---|
0:31:00 | rate and scale |
---|
0:31:02 | and much as you would like to be able to in in and a regular spectrogram |
---|
0:31:07 | uh |
---|
0:31:08 | de emphasise the areas where the signals noise was poor |
---|
0:31:11 | and emphasise areas with the sings noise was good |
---|
0:31:14 | you have perhaps an even greater chance |
---|
0:31:16 | to do this kind of emphasis you have a as |
---|
0:31:19 | uh if you're expanded out to this cube |
---|
0:31:22 | that's the general idea |
---|
0:31:23 | so you could end up with a lot of these different spectrotemporal receptive fields |
---|
0:31:27 | you could implement them and you could try to do something good with them pick out a good |
---|
0:31:32 | uh if limitation that we and and number of people have been trying |
---|
0:31:36 | is |
---|
0:31:37 | a |
---|
0:31:38 | uh what we would call T many stream |
---|
0:31:41 | uh implementation |
---|
0:31:43 | as opposed to multi-stream which uh was what i we shown before you we'd have two or three streams just |
---|
0:31:49 | refers to the quantity |
---|
0:31:50 | but what's in each stream is |
---|
0:31:53 | one of the representation one these spectro-temporal receptive fields implemented by a gabor filter |
---|
0:31:57 | and by a multilayer perceptron |
---|
0:31:59 | that's a discriminatively trained discriminant between different speech sounds |
---|
0:32:04 | you get a whole lot of these and some of implementations we at three hundred |
---|
0:32:07 | uh and then you have to figure out how to combine them or select them |
---|
0:32:11 | hopefully again to de emphasise the ones that are uh bad indicators of what was set |
---|
0:32:19 | so |
---|
0:32:20 | another interesting side light of this kind of approach |
---|
0:32:23 | is that it's a good fit to modern high speed computing that it's |
---|
0:32:27 | as i think a lot of you know |
---|
0:32:29 | the clock rates and or long going up the way they used to other cpus use |
---|
0:32:33 | and so the way that manufacturers are trying to give us more performances by having many more core |
---|
0:32:38 | the graphics processors are an extreme example of this |
---|
0:32:41 | this kind of structure is a really good match to that |
---|
0:32:44 | uh because it's it's what they call an embarrassingly parallel |
---|
0:32:48 | um we found that this room this kind of approach does remove a significant number of errors particularly and noise |
---|
0:32:54 | but also a as it turns out in the clean condition |
---|
0:32:58 | um |
---|
0:32:59 | it combines well with pure engineering not auditory |
---|
0:33:02 | kind of methods |
---|
0:33:03 | uh such as wiener filter based methods |
---|
0:33:06 | and we'd like to think that it could combine well with other auditory models all we haven't really done that |
---|
0:33:11 | work yet |
---|
0:33:14 | um |
---|
0:33:15 | statistical |
---|
0:33:16 | acoustic models |
---|
0:33:19 | uh we currently use these critical assumption |
---|
0:33:22 | and one of things about using very different kinds of features is that this can really change their statistical properties |
---|
0:33:27 | from what the ones we have now |
---|
0:33:29 | and so these assumptions |
---|
0:33:31 | i could be violated in yet different way |
---|
0:33:35 | uh there have been all turn models that were propose that allow you to bypass these typical assumptions |
---|
0:33:41 | but part of the problem is the figure out |
---|
0:33:43 | which statistical dependencies to put in |
---|
0:33:48 | um models of language an understanding |
---|
0:33:51 | i think it's probably pretty clear those you don't know me that this isn't a research area |
---|
0:33:55 | but it's of obvious import |
---|
0:33:57 | and |
---|
0:33:58 | one of the things that uh |
---|
0:34:01 | has been frustrating to a lot of people in fact a member fred jelinek being physically frustrated about this |
---|
0:34:07 | is that |
---|
0:34:08 | it's very very tough to get much improvement |
---|
0:34:10 | over simple n-grams that is a probability of word given some number of previous work |
---|
0:34:16 | but |
---|
0:34:16 | it can be very important |
---|
0:34:18 | two |
---|
0:34:19 | get further information |
---|
0:34:21 | and we know this for sure for people |
---|
0:34:25 | me tell you little story |
---|
0:34:27 | uh one day |
---|
0:34:29 | i was walking out of i csi |
---|
0:34:31 | and i had on one of these catch this is a cap for the oakland athletics to local |
---|
0:34:37 | make league baseball club |
---|
0:34:39 | i also had on a jacket |
---|
0:34:41 | that had the same insignia on it |
---|
0:34:44 | and i had a radio |
---|
0:34:45 | hell to my head i was walking down the street |
---|
0:34:49 | and a guy across the street |
---|
0:34:50 | moderately noisy street |
---|
0:34:51 | yeah |
---|
0:34:52 | or |
---|
0:34:55 | and i said |
---|
0:34:56 | oh can five to three |
---|
0:35:00 | anyway |
---|
0:35:01 | we'd like to be able to do that with a machine |
---|
0:35:06 | so where we go from here |
---|
0:35:09 | well |
---|
0:35:10 | research what continue to get good ideas |
---|
0:35:13 | uh |
---|
0:35:15 | every time you get the shower or maybe you have a have a good idea coming out |
---|
0:35:20 | but |
---|
0:35:20 | what's the best methodology |
---|
0:35:22 | what's the best way to proceed along this path |
---|
0:35:25 | so maybe we can learn from some other disciplines |
---|
0:35:29 | and let me give |
---|
0:35:30 | uh a kind of stretched analogy to |
---|
0:35:33 | the search for a cure for cancer |
---|
0:35:35 | and again i'm gonna tell you a little story |
---|
0:35:38 | uh us the personal one uh is about an uncle of mine names sydney far per |
---|
0:35:43 | um |
---|
0:35:44 | now |
---|
0:35:44 | my uncle set the and the forties |
---|
0:35:46 | uh was |
---|
0:35:48 | i path file just |
---|
0:35:49 | at harvard med channel |
---|
0:35:50 | however but centre |
---|
0:35:52 | and at children's hospital boston |
---|
0:35:55 | and |
---|
0:35:56 | he |
---|
0:35:57 | unfortunately fortunately got to see lots of little children |
---|
0:36:00 | of we came yeah |
---|
0:36:01 | uh once they were diagnosed they only had a few week |
---|
0:36:05 | as a pathologist he mostly dealt with P two dishes and so forth you didn't really wasn't really a clinician |
---|
0:36:11 | but he got this thought |
---|
0:36:13 | that maybe if you could come up with chemicals |
---|
0:36:16 | that were more poison this to the cancer cells than they were to the normal cells |
---|
0:36:20 | maybe he could extend the lives of these K |
---|
0:36:23 | are we experimented with this in the petri dishes the course for the most part for a while |
---|
0:36:27 | they need then any came up with something that he thought would work |
---|
0:36:31 | any tried it out |
---|
0:36:32 | everybody's permission |
---|
0:36:34 | and some of these kids |
---|
0:36:35 | and low and behold |
---|
0:36:36 | it actually did extend their lives for a while |
---|
0:36:39 | this was |
---|
0:36:40 | the first |
---|
0:36:41 | known |
---|
0:36:41 | case of came at there |
---|
0:36:45 | this just |
---|
0:36:46 | great and it started a whole revolution the ended up starting a big centre national cancer institute stuff uh |
---|
0:36:52 | it's not the data fibre reverence to |
---|
0:36:55 | and |
---|
0:36:57 | um |
---|
0:36:59 | the key point i wanna make about it |
---|
0:37:01 | is that |
---|
0:37:02 | there's this quandary |
---|
0:37:03 | between curing patients |
---|
0:37:05 | you have these patients are coming through |
---|
0:37:07 | who are in terrible straits |
---|
0:37:10 | but on the other hand |
---|
0:37:12 | you don't have any time |
---|
0:37:14 | to figure out what's really going on |
---|
0:37:17 | and there were |
---|
0:37:18 | important early successes based on hunches the my own call than many others had |
---|
0:37:24 | and there wasn't time to wear in the real cause for things |
---|
0:37:27 | and by the way stories like this |
---|
0:37:29 | for surgery surgical interventions and for radiation as well |
---|
0:37:34 | uh |
---|
0:37:35 | so there's some success |
---|
0:37:37 | but they still |
---|
0:37:38 | didn't find a general curve cure and uh as you know to this day there's still is no general cure |
---|
0:37:42 | for cancer |
---|
0:37:43 | but things are a lot better every missions or longer and so forth |
---|
0:37:47 | and now there's |
---|
0:37:48 | starting to be some understanding of the biological mechanisms and one hopes that this will lead to to keep |
---|
0:37:54 | uh a solution |
---|
0:37:56 | so this is wonderful book a strong the recommend the emperor of all melodies |
---|
0:38:00 | uh |
---|
0:38:01 | about uh like the industry have cancer |
---|
0:38:05 | and i'll just read this |
---|
0:38:06 | isn't thing the speaker viewers i think of remedies |
---|
0:38:09 | in such time as we have considered of the cost |
---|
0:38:12 | here must be imperfect claim and to no purpose |
---|
0:38:15 | where and the "'cause" of that first been searched |
---|
0:38:18 | this again doesn't belie the fact that it can be very useful |
---|
0:38:22 | to uh go ahead and try to fix something along the way |
---|
0:38:26 | but in the long term you need to understand what's going on |
---|
0:38:30 | so as opposed to just |
---|
0:38:32 | trying our bright ideas which we all do |
---|
0:38:35 | how about finding out what's wrong |
---|
0:38:39 | the statistical approach |
---|
0:38:40 | to speech recognition requires |
---|
0:38:42 | assumptions that made reference to |
---|
0:38:44 | there known literally to be false |
---|
0:38:47 | this may or may not be a problem |
---|
0:38:49 | maybe it's just handled by uh say raising these |
---|
0:38:52 | uh likelihoods to power |
---|
0:38:55 | how can we learn |
---|
0:38:57 | so there's a some work that's been started i wanted the call your tension to |
---|
0:39:01 | from steve work men and larry gaelic okay |
---|
0:39:03 | starting a couple years ago |
---|
0:39:05 | where what they did was to consider each assumption separately |
---|
0:39:09 | and then rather can trying to fix the models |
---|
0:39:12 | modified the data |
---|
0:39:13 | B some resampling S um |
---|
0:39:15 | some uh |
---|
0:39:16 | bootstrapping kind approaches |
---|
0:39:18 | to match the models |
---|
0:39:20 | observe the improvement |
---|
0:39:22 | and use that to inspire more bright ideas |
---|
0:39:26 | but this point |
---|
0:39:27 | the really just focused on the diagnosis part and not on the |
---|
0:39:30 | a new bright ideas frankly |
---|
0:39:32 | so this is being pursued also at icsi and the how to project which is outing unfortunate characteristics of hmm |
---|
0:39:39 | and |
---|
0:39:40 | uh i'm gonna give you just a couple results for more recent version i should add by the way that |
---|
0:39:44 | uh |
---|
0:39:45 | it's is a different K like this is larry sonde dan |
---|
0:39:48 | who just this P H D with that's |
---|
0:39:51 | um |
---|
0:39:52 | but |
---|
0:39:53 | first this is a |
---|
0:39:55 | very simplified system so the error rate for wall street journal is is pretty is pretty high here |
---|
0:40:01 | and uh it's |
---|
0:40:03 | the output |
---|
0:40:04 | uh demonstrably does not really fit the G M distribution that you got from the training set |
---|
0:40:10 | and it definitely doesn't satisfy the independence assumptions and you get this thirteen percent |
---|
0:40:16 | uh |
---|
0:40:17 | now if you simulated data really just generated from the models |
---|
0:40:21 | you should do pretty well in a fact you do basically |
---|
0:40:23 | uh virtually all of the errors go away |
---|
0:40:27 | but here's the interesting one i think |
---|
0:40:29 | if you |
---|
0:40:30 | use resampled sample data so this is the actual speech data |
---|
0:40:34 | but you're just resampling it such a way |
---|
0:40:37 | to assure the statistical conditional independence |
---|
0:40:41 | it also gets rid of nearly all of the year |
---|
0:40:45 | now they're studies are a lot more detailed this there's a a lot of |
---|
0:40:48 | a lot of things that they're looking at |
---|
0:40:50 | a lot of things that trying out |
---|
0:40:51 | but i think this gives the flavour |
---|
0:40:53 | of what they're doing |
---|
0:40:58 | so |
---|
0:40:59 | in summary |
---|
0:41:02 | uh a speech recognition is mature mature |
---|
0:41:05 | in some sense it has an advanced degree |
---|
0:41:07 | this because it's been around a long time and their commercial systems and so forth |
---|
0:41:12 | and yet we still find it to be brittle |
---|
0:41:15 | and uh we essentially have to start over again with each new task |
---|
0:41:20 | uh the recent improvements |
---|
0:41:21 | have been really quite incremental or a lot of things of sort of levelled off |
---|
0:41:26 | we need to rethink |
---|
0:41:28 | kind of like going back to school |
---|
0:41:30 | kind of like continuing education |
---|
0:41:33 | uh we may need more basic models |
---|
0:41:35 | uh more may need more |
---|
0:41:37 | basic features |
---|
0:41:39 | we may need more study of air |
---|
0:41:43 | and |
---|
0:41:44 | the other thing i wanna briefly mention is that |
---|
0:41:48 | we do live in error where there is a huge amount of computation available |
---|
0:41:52 | and even though the clock rates don't continue to go up is they have |
---|
0:41:56 | uh do to uh many core systems |
---|
0:41:59 | and |
---|
0:42:00 | cloud computing and so forth |
---|
0:42:01 | there is gonna continue to be |
---|
0:42:03 | an increased availability to lots of computation |
---|
0:42:06 | and this |
---|
0:42:07 | should make it possible for us to consider |
---|
0:42:10 | huge numbers of models |
---|
0:42:12 | uh and methods |
---|
0:42:13 | that we wouldn't consider before |
---|
0:42:15 | for instance on the front end side |
---|
0:42:17 | these uh auditory based or cortical base things can really blowup up the computation |
---|
0:42:22 | from the simple kind of stuff that you have with mfccs or P L P |
---|
0:42:28 | uh |
---|
0:42:28 | so |
---|
0:42:30 | it's good to do that it's good to try things |
---|
0:42:34 | that might take a lot of computation even if they might not work yeah in your i phone just now |
---|
0:42:39 | um um so |
---|
0:42:43 | you also have to know and then sure you all do that just having more computation is not a panacea |
---|
0:42:48 | doesn't actually solve things |
---|
0:42:49 | but it can potentially |
---|
0:42:51 | a give you a lot more possibility |
---|
0:42:54 | that's pretty much what i want to say |
---|
0:42:56 | uh |
---|
0:42:57 | i do wanna acknowledge that the stuff have talked about is not a particularly for me for many people including |
---|
0:43:03 | people outside our level |
---|
0:43:05 | uh but i do want to thank |
---|
0:43:07 | the many current former students and postdocs visitors icsi staff |
---|
0:43:12 | and particularly give a shout out to |
---|
0:43:14 | hynek hermansky every pore large she option honesty workman jordan cone |
---|
0:43:18 | here's my shameless plug for a book |
---|
0:43:21 | uh which he did already mentioned |
---|
0:43:23 | that is gonna be out this fall thanks to tons of work from dan ellis |
---|
0:43:27 | and other contributors i should say |
---|
0:43:29 | uh like uh |
---|
0:43:31 | gel and |
---|
0:43:32 | and |
---|
0:43:33 | job for then the |
---|
0:43:35 | um |
---|
0:43:37 | simon king for instance |
---|
0:43:39 | and |
---|
0:43:41 | thank you for your attention |
---|
0:43:50 | K |
---|
0:43:52 | a |
---|
0:43:53 | sorry |
---|
0:43:54 | having time i'm |
---|
0:43:56 | oh |
---|
0:43:58 | you |
---|
0:43:58 | what is only a lot not of time bringing up |
---|
0:44:04 | yeah |
---|
0:44:05 | you feel |
---|
0:44:06 | i |
---|
0:44:08 | oh |
---|
0:44:16 | i promised i put on that |
---|
0:44:18 | yes are you thing to remind you about why |
---|
0:44:21 | i if know what is a question |
---|
0:44:23 | or |
---|
0:44:25 | okay |
---|
0:44:26 | and |
---|
0:44:28 | you mike |
---|
0:44:29 | yeah |
---|
0:44:30 | uh |
---|
0:44:31 | right |
---|
0:44:33 | think |
---|
0:44:35 | yeah |
---|
0:44:36 | what are you in the remote |
---|
0:44:38 | i know that |
---|
0:44:39 | think |
---|
0:44:41 | by |
---|
0:44:42 | yeah |
---|
0:44:43 | speak Q mine |
---|
0:44:44 | oh |
---|
0:44:44 | oh don't hold back um yeah okay |
---|
0:44:47 | right at a time |
---|
0:44:49 | i |
---|
0:44:50 | yes |
---|
0:44:51 | well though they they still a chance |
---|
0:44:52 | it's still which chance |
---|
0:44:53 | get get the courage |
---|
0:44:55 | um |
---|
0:44:57 | i i think that the right answer is |
---|
0:44:59 | i don't know |
---|
0:45:02 | because |
---|
0:45:03 | for instance |
---|
0:45:04 | well i used to say when people talk to me about this is that |
---|
0:45:07 | okay i think of |
---|
0:45:08 | speech recognition is as being in three pieces there's |
---|
0:45:12 | the representations that you have |
---|
0:45:14 | there's at the statistical models and the search and so forth in the middle |
---|
0:45:19 | and then there's |
---|
0:45:20 | uh all of the things that you could imagine doing with speech understanding and pragmatics |
---|
0:45:24 | it's X and so forth |
---|
0:45:26 | and i used the think that okay the first one i know a little bit about |
---|
0:45:30 | uh and i and i i feel very strongly and you know bunny results to back this up that that's |
---|
0:45:35 | very important for improving |
---|
0:45:37 | the last one i is not my area of expertise but where have seen in other is certainly and human |
---|
0:45:42 | case |
---|
0:45:44 | i believe that's very important |
---|
0:45:45 | so i sort of thought the middle part |
---|
0:45:47 | yeah O you works well in |
---|
0:45:50 | uh but then so this |
---|
0:45:51 | this study |
---|
0:45:53 | and i'm not so sure |
---|
0:45:54 | no i actually think that you should |
---|
0:45:56 | uh pursue whatever it is that you |
---|
0:45:59 | i feel |
---|
0:46:00 | yeah feel is of greatest interest |
---|
0:46:01 | i actually think the key thing |
---|
0:46:03 | is they have interesting france |
---|
0:46:07 | a for nine now i see |
---|
0:46:10 | now |
---|
0:46:11 | uh you like or know it's what i actually think i and if here and here right |
---|
0:46:16 | okay |
---|
0:46:18 | heard |
---|
0:46:19 | a |
---|
0:46:19 | yeah |
---|
0:46:20 | okay |
---|
0:46:21 | my is louder the |
---|
0:46:24 | all of these uh a technique used right or |
---|
0:46:27 | since we |
---|
0:46:29 | i'm spectral analysis |
---|
0:46:30 | roaches |
---|
0:46:32 | pretty much uh uh everything just right |
---|
0:46:34 | i |
---|
0:46:35 | or |
---|
0:46:36 | in almost all |
---|
0:46:38 | but from now on |
---|
0:46:40 | you |
---|
0:46:40 | um |
---|
0:46:41 | spectral techniques like much |
---|
0:46:43 | C P L you of |
---|
0:46:45 | from |
---|
0:46:45 | reading some aspects of |
---|
0:46:47 | us |
---|
0:46:48 | course things |
---|
0:46:49 | you guys most |
---|
0:46:51 | but the big problems it seems to me you're still will interfere |
---|
0:46:55 | or fear from other sources |
---|
0:46:57 | reverberation |
---|
0:46:59 | uh |
---|
0:47:00 | spatial |
---|
0:47:01 | hearing and so forth where us |
---|
0:47:04 | or you much help |
---|
0:47:06 | distinguishing mode |
---|
0:47:07 | sources |
---|
0:47:08 | direction |
---|
0:47:10 | the uh the other dimension |
---|
0:47:12 | uh uh uh |
---|
0:47:13 | fine |
---|
0:47:14 | role |
---|
0:47:15 | information or something that has been explored lot |
---|
0:47:17 | the |
---|
0:47:18 | psychological and |
---|
0:47:20 | is about you |
---|
0:47:22 | and |
---|
0:47:23 | few steps |
---|
0:47:24 | and |
---|
0:47:25 | so ensemble interval histogram that |
---|
0:47:27 | and |
---|
0:47:28 | drop another |
---|
0:47:29 | drop |
---|
0:47:30 | or |
---|
0:47:30 | mention an entirely |
---|
0:47:32 | a kind of a |
---|
0:47:33 | but |
---|
0:47:34 | station |
---|
0:47:34 | see |
---|
0:47:36 | you |
---|
0:47:36 | get |
---|
0:47:39 | rel |
---|
0:47:39 | and source you |
---|
0:47:41 | same time |
---|
0:47:42 | you say much about that |
---|
0:47:45 | that that's the direction course |
---|
0:47:46 | so that's you |
---|
0:47:48 | what you think about that |
---|
0:47:49 | direction and we get |
---|
0:47:51 | people working in |
---|
0:47:52 | you pay more attention |
---|
0:47:55 | things beyond young |
---|
0:47:58 | what why spectral i guess which you mean a short-term spectral right |
---|
0:48:02 | and uh i i may not have done this is clearly as like could but i think the shah must |
---|
0:48:07 | stuff that i was making reference to |
---|
0:48:10 | a certainly can be long time the the the their spectro-temporal representation |
---|
0:48:15 | what you feed |
---|
0:48:17 | uh the the different |
---|
0:48:18 | quite the cortical |
---|
0:48:20 | filters |
---|
0:48:21 | can be a very different kind of spectrogram when that takes advantage of the sort of stuff and i think |
---|
0:48:26 | absolutely what we should do |
---|
0:48:28 | and that's these disturbances the multiple sources the reverberation et cetera |
---|
0:48:33 | uh uh i agree that's |
---|
0:48:34 | that's the biggest challenge that C |
---|
0:48:36 | if someone talks about the performance of humans versus |
---|
0:48:40 | uh a a speech recognition systems in the current generation systems that's the easiest playstation of the difference |
---|
0:48:46 | so uh |
---|
0:48:48 | i completely agree |
---|
0:48:49 | sorry am |
---|
0:48:50 | i'm not being a politician i actually do agree |
---|
0:48:54 | i |
---|
0:48:55 | uh |
---|
0:48:56 | results |
---|
0:48:57 | i |
---|
0:48:58 | oh |
---|
0:48:59 | hmms |
---|
0:49:00 | see |
---|
0:49:00 | just |
---|
0:49:04 | i modeling |
---|
0:49:07 | true |
---|
0:49:07 | S |
---|
0:49:09 | so |
---|
0:49:11 | you |
---|
0:49:11 | so |
---|
0:49:12 | i |
---|
0:49:15 | i |
---|
0:49:17 | so go |
---|
0:49:18 | yeah |
---|
0:49:19 | the most |
---|
0:49:21 | i |
---|
0:49:26 | uh_huh |
---|
0:49:27 | really |
---|
0:49:28 | thus |
---|
0:49:29 | a more attention that |
---|
0:49:32 | K i didn't pay yeah |
---|
0:49:34 | but |
---|
0:49:35 | but this is you were certainly are are reinforcing my my bias as |
---|
0:49:39 | uh oh go it is getting up but |
---|
0:49:42 | um |
---|
0:49:43 | i i'm mostly a front-end person these days have been for a while and i agree that there's a lot |
---|
0:49:49 | to be done there |
---|
0:49:50 | i didn't mean the say at all that the language modeling and so forth was |
---|
0:49:54 | was the bulk of it |
---|
0:49:55 | even that study at the end was just saying for fairly simple case with the sensually matched training and test |
---|
0:50:01 | uh that |
---|
0:50:02 | uh |
---|
0:50:03 | you could |
---|
0:50:04 | jimmy with the data in such a way |
---|
0:50:07 | to match the models assumptions and you could do much better |
---|
0:50:10 | but uh one of the things that we're gonna be trying to do in follows to that study is looking |
---|
0:50:15 | at mismatched conditions |
---|
0:50:17 | what can you |
---|
0:50:18 | i cases with noise and reverberation and so forth |
---|
0:50:21 | in which case i don't think the effect will be quite as big |
---|
0:50:24 | and |
---|
0:50:25 | you know it's garbage in garbage out if basically you feed in representation |
---|
0:50:30 | that are not |
---|
0:50:31 | uh giving you the information you need how are you gonna get it at the yeah so |
---|
0:50:36 | i i i agree with you but i was trying to be fair or not only to people that co |
---|
0:50:40 | but also because |
---|
0:50:42 | i feel that uh in if you cover the space |
---|
0:50:45 | of all these different cases |
---|
0:50:47 | there many cases where these other areas are in fact very pour |
---|
0:50:51 | and human beings as with my base po example human beings to make use of higher level information |
---|
0:50:56 | uh often |
---|
0:50:57 | in order to figure out what was said what up important about was that |
---|
0:51:01 | which leads me to george's question |
---|
0:51:03 | as you were talking |
---|
0:51:05 | i was |
---|
0:51:05 | constantly we with the |
---|
0:51:07 | analogy |
---|
0:51:09 | um in speech recognition with almost |
---|
0:51:12 | you know |
---|
0:51:12 | irresistibly and at |
---|
0:51:15 | a things and optical character recognition |
---|
0:51:18 | and so |
---|
0:51:19 | uh |
---|
0:51:20 | almost every slide hand hand irresistible analogies uh from the a current successes to future direction is to problems that |
---|
0:51:28 | are being experience |
---|
0:51:29 | uh_huh and i i'm just wondering is there |
---|
0:51:31 | a cross disciplinary knowledge that can be leveraged yeah is is it is it being language |
---|
0:51:37 | to speech recognition except in the sense that some of these alternative there uh |
---|
0:51:42 | approaches |
---|
0:51:43 | uh F they have tried looking at uh the spectrogram has an image |
---|
0:51:47 | uh and so forth some of the neural network techniques that were developed uh in optical character recognition |
---|
0:51:54 | sort of came back the other way but a lot of it's gone |
---|
0:51:57 | gone the other way |
---|
0:51:58 | but |
---|
0:51:59 | you know we can to be fairly fragmented community and and and not listen to each other quite as much |
---|
0:52:04 | as we should |
---|
0:52:06 | whose now |
---|
0:52:11 | J |
---|
0:52:12 | no i think he's of the dog |
---|
0:52:15 | hold |
---|
0:52:16 | well |
---|
0:52:17 | oh i'm sorry i was drawing |
---|
0:52:19 | to stay in was C |
---|
0:52:21 | oh |
---|
0:52:22 | couldn't |
---|
0:52:23 | but the climbs |
---|
0:52:24 | a plug for the for a go on a tour |
---|
0:52:27 | um |
---|
0:52:32 | i i have some exposure probably most people are so that some exposure to model |
---|
0:52:37 | speech recognition technology |
---|
0:52:39 | you real application |
---|
0:52:40 | yeah i think you know of um |
---|
0:52:43 | i've been exposed to google voice |
---|
0:52:45 | perhaps many people have |
---|
0:52:47 | yeah and |
---|
0:52:48 | uh is not a |
---|
0:52:49 | plug |
---|
0:52:50 | google voice but |
---|
0:52:52 | i think |
---|
0:52:52 | model and uh point speech recognition technology to me use them easily we |
---|
0:52:58 | good |
---|
0:53:01 | considering the |
---|
0:53:02 | uh the systems or |
---|
0:53:04 | these are are have no |
---|
0:53:06 | "'kay" will you really great |
---|
0:53:08 | uh |
---|
0:53:10 | semantic condo |
---|
0:53:13 | in to see what people see what do systems can do you acoustic |
---|
0:53:17 | use a movies |
---|
0:53:18 | to me used |
---|
0:53:21 | yeah and the so |
---|
0:53:23 | where i so you the the channel used to be a to i don't know how to do that a |
---|
0:53:28 | where i so you challenge uses |
---|
0:53:31 | use um |
---|
0:53:34 | creating models so the semantic context to you of the kind of support |
---|
0:53:39 | to uh speech recognition that |
---|
0:53:41 | we seen from uh the |
---|
0:53:44 | you real |
---|
0:53:45 | why which models |
---|
0:53:46 | which |
---|
0:53:47 | don't |
---|
0:53:48 | model |
---|
0:53:49 | that |
---|
0:53:52 | okay well |
---|
0:53:53 | that was a question uh |
---|
0:53:55 | i |
---|
0:53:56 | i i know i wasn't |
---|
0:53:57 | um but also say something anyway which is that |
---|
0:54:00 | uh i i am really taking the middle position |
---|
0:54:03 | there plenty a task |
---|
0:54:04 | uh where in fact |
---|
0:54:06 | uh recognition does fail particularly in noise and reverberation so on |
---|
0:54:10 | google voice search is is very impressive |
---|
0:54:12 | but |
---|
0:54:13 | you know there's a lot of |
---|
0:54:13 | a lot of cases where things to fail |
---|
0:54:16 | and |
---|
0:54:17 | uh |
---|
0:54:17 | we can see significant improvements |
---|
0:54:20 | in a number of tasks |
---|
0:54:21 | by changing the front so i think there is something important there |
---|
0:54:25 | but in your in your state you one really attacking the front so what you're saying we have to pay |
---|
0:54:29 | attention to the back and i completely agree |
---|
0:54:34 | one more in it's probably time |
---|
0:54:36 | i one change of subject a little bit um yeah given that i'll as can you say something about the |
---|
0:54:40 | of |
---|
0:54:41 | oh in this you courses academia and this research |
---|
0:54:43 | you you got a a big put both see a both side |
---|
0:54:47 | when it what is good what is that in you could for speech |
---|
0:54:49 | or now and |
---|
0:54:52 | which we go |
---|
0:54:53 | i actually a pretty small for |
---|
0:54:55 | and just re |
---|
0:54:57 | but uh |
---|
0:54:58 | uh well i think industry should fund the academia |
---|
0:55:02 | i |
---|
0:55:06 | yeah |
---|
0:55:07 | i to |
---|
0:55:12 | and |
---|
0:55:12 | exactly |
---|
0:55:14 | thanks for the actual |
---|