0:00:13 | we introduce a professor lin-shan lee |
---|
0:00:16 | oh he's been with the national taiwan university since nineteen eighty two |
---|
0:00:22 | is early work focused on the |
---|
0:00:24 | a broader area of spoken language systems particularly focused on the chinese language |
---|
0:00:29 | and a number of breakthroughs early on in that language |
---|
0:00:34 | is more recent work |
---|
0:00:35 | it's been focused on the sort the fundamentals of speech recognition |
---|
0:00:40 | at a network environment issues |
---|
0:00:43 | like information retrieval semantic analysis |
---|
0:00:46 | oh spoken content |
---|
0:00:49 | is ieee vol |
---|
0:00:51 | and it is gonna follow |
---|
0:00:53 | he served on numerous boards |
---|
0:00:55 | and received a number of a awards including a recently |
---|
0:01:00 | the meritorious service work for ieee |
---|
0:01:02 | signal process the signal processing society |
---|
0:01:05 | so please what we welcome professor lin-shan |
---|
0:01:24 | hello |
---|
0:01:24 | so you can hear me right |
---|
0:01:27 | good doesn't it |
---|
0:01:28 | and thank you larry |
---|
0:01:30 | is my great pleasure today to be here presented to you |
---|
0:01:35 | spoken content retrieval |
---|
0:01:37 | lattice and the young my name's and change from national high |
---|
0:01:43 | in this talk of first introduce |
---|
0:01:46 | the base |
---|
0:01:47 | is a problem and some fundamentals |
---|
0:01:50 | and then i'll spend more time |
---|
0:01:52 | finding some recent research examples |
---|
0:01:55 | all showed them before the conclusion |
---|
0:01:59 | so first introduce |
---|
0:02:00 | problem |
---|
0:02:02 | we are all very familiar with |
---|
0:02:04 | text content which because it says in there is that |
---|
0:02:09 | for any user errors |
---|
0:02:11 | or user instructions |
---|
0:02:13 | use your every repair them as well as |
---|
0:02:16 | the desired information can be obtained |
---|
0:02:19 | very |
---|
0:02:20 | in real time you refer to |
---|
0:02:23 | documents |
---|
0:02:24 | all users alike |
---|
0:02:25 | and i even produce various |
---|
0:02:27 | successful in dutch |
---|
0:02:30 | now today we all know that all rows of have |
---|
0:02:33 | can be accomplished by force |
---|
0:02:36 | on the content side |
---|
0:02:38 | the spoken |
---|
0:02:39 | content we do have spoken content or mouth and yet |
---|
0:02:43 | sports |
---|
0:02:44 | all part |
---|
0:02:46 | on the query side the voice query can and should be a hand |
---|
0:02:51 | hand held devices |
---|
0:02:52 | so it's time for us to consider |
---|
0:02:55 | spoken kind |
---|
0:02:58 | now this is what we have today |
---|
0:03:00 | everyone |
---|
0:03:01 | blender |
---|
0:03:02 | when we ensure |
---|
0:03:04 | a text |
---|
0:03:05 | we get X |
---|
0:03:07 | now boasts the choirs and content |
---|
0:03:10 | can be |
---|
0:03:11 | informal ports |
---|
0:03:12 | future |
---|
0:03:13 | first |
---|
0:03:15 | we may use text queries |
---|
0:03:18 | to retrieve spoken comedy |
---|
0:03:20 | or not |
---|
0:03:21 | okay |
---|
0:03:22 | including all |
---|
0:03:24 | for this case |
---|
0:03:26 | very often |
---|
0:03:27 | also this before then |
---|
0:03:29 | a spoken content |
---|
0:03:31 | i spoken document retrieval |
---|
0:03:33 | very often |
---|
0:03:39 | morse subset |
---|
0:03:40 | oh that problem |
---|
0:03:41 | is referred to as |
---|
0:03:42 | spoken term detection |
---|
0:03:44 | in other words to detect |
---|
0:03:47 | query change it |
---|
0:03:48 | i spoken con |
---|
0:03:50 | of course we can also |
---|
0:03:52 | retrieve |
---|
0:03:53 | a text content using |
---|
0:03:55 | all was clear |
---|
0:03:57 | in that case |
---|
0:03:59 | usually referred to by also |
---|
0:04:02 | thus voice search |
---|
0:04:04 | and |
---|
0:04:05 | oh however in this part because the retreat you document to be retrieved is in tech school and therefore would |
---|
0:04:11 | be out of the scope of this talk |
---|
0:04:13 | so i'm not going to spend more time talking about voice |
---|
0:04:19 | oh of course we can do the other side that is to retrieve |
---|
0:04:23 | a spoken |
---|
0:04:25 | content using spoken queries |
---|
0:04:27 | and sometimes the use of it for two days |
---|
0:04:30 | query by example |
---|
0:04:32 | and so in this part of focus on retrieval of spoken content primarily using text |
---|
0:04:39 | cards |
---|
0:04:40 | but sometimes we can also consider the case of |
---|
0:04:43 | spoke |
---|
0:04:46 | now as we all understand if the spoken content is one chorus can be accurately recognise |
---|
0:04:52 | then this problem would be reduced to you well known text content |
---|
0:04:55 | retrieval |
---|
0:04:56 | it will be no problem |
---|
0:04:58 | of course |
---|
0:04:59 | that never happened |
---|
0:05:00 | because we know the us recognition |
---|
0:05:03 | in most cases |
---|
0:05:04 | so that's a major part |
---|
0:05:08 | not today we understand that |
---|
0:05:10 | the |
---|
0:05:11 | they are many hand held devices |
---|
0:05:13 | with multimedia functionalities available today commercial |
---|
0:05:17 | and also they are unlimited quantities of multimedia can which is ban scrolling over the internet |
---|
0:05:24 | so we should be able to retrieve not only the text content but that's where the multimedia and spoke |
---|
0:05:32 | in other words you wireless and multimedia technologies today are creating an environment for spoken content retrieval |
---|
0:05:42 | as to let me repeat again that the |
---|
0:05:45 | network access is primarily |
---|
0:05:48 | text based today |
---|
0:05:49 | but almost all rows of text can be accomplished by voice |
---|
0:05:54 | so the nice mentioned very briefly some fundamental |
---|
0:05:58 | first |
---|
0:06:00 | but this we wanna stand D recognition always give errors |
---|
0:06:03 | for various reasons for example |
---|
0:06:06 | oh spontaneous speech for example oov words or mismatch models and so on |
---|
0:06:12 | and then makes the problem difficult |
---|
0:06:15 | so a good approach may be to consider lattices with multiple alternative |
---|
0:06:21 | rather than the one best output on |
---|
0:06:24 | in this case |
---|
0:06:26 | of course we can have a higher probability of including quite words |
---|
0:06:31 | but also in that case we have to include more noisy words that cost problem |
---|
0:06:36 | but on the other hand even if we have to be like this we still have the problem that some |
---|
0:06:41 | correct words may not be |
---|
0:06:43 | including because they are all keywords and still |
---|
0:06:47 | on the other hand when we use lattices that implies we need shooting memory and computation requirements |
---|
0:06:53 | there's another major problem |
---|
0:06:57 | of course there exist other approaches to solve that similar problems |
---|
0:07:02 | for example people use confusion matrix to model reading errors |
---|
0:07:07 | and try to explain the query and document using confusion matrices |
---|
0:07:13 | people also we use |
---|
0:07:14 | pronunciation modeling to try to expand enquiry in that way |
---|
0:07:19 | people also use say fuzzy matching in other words the matching between the quality and the content does not has |
---|
0:07:26 | to be |
---|
0:07:27 | exact |
---|
0:07:29 | these are all very good approaches however i won't have time to |
---|
0:07:33 | say more about things |
---|
0:07:35 | since our focus on lattices here it just talk |
---|
0:07:39 | now the first question is how can we index that |
---|
0:07:44 | well unless a lattice that like this |
---|
0:07:47 | and usually the most |
---|
0:07:49 | popular approach to index a lattice is to transform the lattice into a sausage like structure like this |
---|
0:07:57 | in other words it's serious of segments |
---|
0:08:00 | and every segment includes its number of word hypothesis we use |
---|
0:08:06 | posterior probability |
---|
0:08:08 | in this way the position information for the words can be readily available in other words they were one is |
---|
0:08:15 | on the first segment and word at some the suffers the second segment so we're one after work can you |
---|
0:08:21 | want a followed by word eight and that is a bigram and so on |
---|
0:08:25 | in this way this is a more compatible to existing text indexing techniques |
---|
0:08:31 | also in this way the required a memory and computation can be read use the slightly |
---|
0:08:38 | in addition we may notice that in this way we can add more possible path |
---|
0:08:42 | for example were three cannot be found by where i ate in the original lattice |
---|
0:08:47 | but here this becomes possible |
---|
0:08:50 | also the noisy words can be discriminated by posterior probabilities |
---|
0:08:55 | because we do have |
---|
0:08:56 | scores |
---|
0:08:58 | in either case this we notice that we can match the and right |
---|
0:09:03 | oh this lattice |
---|
0:09:05 | oh with the very for example we have the bigram were three followed by word of a then response of |
---|
0:09:11 | these bigrams exist in the very that helps |
---|
0:09:14 | so we can |
---|
0:09:15 | come all the possible |
---|
0:09:17 | and when spanky will accumulate the scores and still |
---|
0:09:22 | now there are many approaches proposed for such kind of indexing of the lattices |
---|
0:09:27 | i just list a few examples here |
---|
0:09:30 | and i think that today the most popular ones maybe the top to the yeah posterior please position specific posterior |
---|
0:09:37 | lattice |
---|
0:09:39 | or P S P L |
---|
0:09:40 | confusion networks or C and |
---|
0:09:43 | also another very popular one would be the weighted finite state transducer |
---|
0:09:47 | wfst |
---|
0:09:49 | now let me take |
---|
0:09:51 | one minute to explain the first two |
---|
0:09:54 | the position specific posterior lattice this psp a and a confusion networks |
---|
0:10:00 | and |
---|
0:10:01 | suppose is a lattice |
---|
0:10:03 | and these are the board your possible pasts here |
---|
0:10:07 | they were sick |
---|
0:10:09 | and |
---|
0:10:11 | the P S appeal or positions of the signal |
---|
0:10:15 | a posterior lattice try to locate every word |
---|
0:10:18 | in a segment based on the position of that word in a path |
---|
0:10:23 | for example where ten here appears only as the force were in the past |
---|
0:10:29 | so it appeared in the force tacoma |
---|
0:10:32 | on the other hand |
---|
0:10:33 | the and the confusion networks |
---|
0:10:37 | try to cluster words together in a cluster based on for example time spans into word pronunciation |
---|
0:10:45 | so for example |
---|
0:10:46 | the word to word five or ten may have very similar time span and pronunciation they may be clustered together |
---|
0:10:54 | and they may appear in the |
---|
0:10:57 | second for here |
---|
0:10:59 | so in this case you may note that the different approach gives different in this |
---|
0:11:05 | now a major problem here is oov words |
---|
0:11:08 | as you understand the |
---|
0:11:10 | only work cannot be recognized therefore never appear in the lattice |
---|
0:11:15 | that's important because very often |
---|
0:11:18 | the yeah according words |
---|
0:11:19 | the query includes both we were |
---|
0:11:23 | however if we look carefully |
---|
0:11:25 | but there are many approaches to handle this problem i think the most fundamental approach is to use some for |
---|
0:11:31 | you |
---|
0:11:33 | a i mean let's take this example |
---|
0:11:36 | suppose at all keyword W is composed of |
---|
0:11:40 | these four somewhere units every small W i use a summer units for example a phoneme or syllable or something |
---|
0:11:48 | these are also somewhere units and these eli |
---|
0:11:51 | and these are arcs |
---|
0:11:53 | and the word here because |
---|
0:11:55 | this W is not a |
---|
0:11:58 | in the vocabulary so it's not recognise here |
---|
0:12:01 | however if we look at carefully we notice that |
---|
0:12:05 | the work is actually here it is hinted at sub-word level |
---|
0:12:10 | so it can actually be matched |
---|
0:12:12 | and somewhere level without being recognised in the |
---|
0:12:16 | and that's a major approach that a different ways can be developed to try to handle this using somewhere units |
---|
0:12:24 | oh one example is to construct |
---|
0:12:27 | the same |
---|
0:12:29 | P S P L or C and based on separate units |
---|
0:12:33 | for example |
---|
0:12:35 | now there are many give principal units have been used in this approach |
---|
0:12:40 | and usually we can categorise them into two class |
---|
0:12:44 | the first one is linguistically motivated units |
---|
0:12:47 | for example phonemes |
---|
0:12:49 | set of a character or more times and so on |
---|
0:12:52 | the other one is data-driven units in other words the a drive |
---|
0:12:57 | using some data-driven L |
---|
0:12:59 | and different ellison's may |
---|
0:13:01 | produce different names |
---|
0:13:03 | for example someone for the particle someone code word fragments of panama transform offsets |
---|
0:13:11 | of course there are some other different approaches |
---|
0:13:14 | if we do have the very invoice for available |
---|
0:13:17 | in that case we can also manage to hurry |
---|
0:13:21 | in speech and a |
---|
0:13:23 | all containing speech directly |
---|
0:13:25 | without doing recognition |
---|
0:13:27 | in that case we can avoid the recognition error problem and we can even do it with the in the |
---|
0:13:34 | unsupervised way |
---|
0:13:36 | in that case we even don't need a lattice |
---|
0:13:38 | and this can be performed in say |
---|
0:13:41 | frame based matching for example like dtw |
---|
0:13:45 | or segment based approach |
---|
0:13:48 | just imagine the sex the segments |
---|
0:13:50 | or model based action so |
---|
0:13:53 | our board at this kind of approaches |
---|
0:13:55 | do not use recognition and therefore do not |
---|
0:13:58 | have lattices |
---|
0:13:59 | so i won't spend more time using a about this approach is all just focus on those with laughter |
---|
0:14:07 | okay so below all always this fundamental at all it's just described to you some recent research examples |
---|
0:14:13 | i have to apologise i can only cover a small number of examples |
---|
0:14:18 | so many examples i just cannot cover |
---|
0:14:21 | all |
---|
0:14:22 | below i'll assume the retrieval was look something like this |
---|
0:14:27 | this is spoken archive |
---|
0:14:29 | after recognition based on some clothes models we have lattices here |
---|
0:14:33 | now you retrieval was applied on top of this lattices |
---|
0:14:38 | here the search engine i mean index in the lattices and search over the in |
---|
0:14:44 | and by retrieval model i mean anything in addition for example confusion matrix is mention that was in the waiting |
---|
0:14:53 | room and whatever |
---|
0:14:55 | all based on this |
---|
0:14:57 | oh |
---|
0:14:58 | graph to discuss the following |
---|
0:15:01 | the first thing we can think about can do is to do integration and wait |
---|
0:15:06 | and for example we can integrate different rules from recognition |
---|
0:15:10 | from different recognition systems |
---|
0:15:12 | from those |
---|
0:15:14 | based on different subword units |
---|
0:15:16 | oh in queens some of the |
---|
0:15:18 | information and so on |
---|
0:15:20 | in addition a good idea maybe to try to try and those of model parameters |
---|
0:15:25 | if we have some training data available |
---|
0:15:29 | what kind of training data are needed here well this kind of training data we need here are a set |
---|
0:15:34 | of queries |
---|
0:15:35 | indeed the so we shaded relevant irrelevant second |
---|
0:15:39 | for example use when user entered query Q one we get a list of here and then the first two |
---|
0:15:45 | are forced or irrelevant and the next two are two more relevant and still |
---|
0:15:50 | we need a set of these kind of data |
---|
0:15:53 | or such data does not necessary to be anointed by person abides by some people because we can collect them |
---|
0:16:00 | from real clicks with data |
---|
0:16:03 | for example if the user enter a query Q one and Q get this list |
---|
0:16:08 | and then p2p skip the first two items and directories |
---|
0:16:12 | click the next two |
---|
0:16:14 | we may assume then |
---|
0:16:16 | the first two are irrelevant or false |
---|
0:16:19 | and the next two are round |
---|
0:16:21 | in this way we can have chris with data |
---|
0:16:24 | when we have this data then we can do something more for example we can use this training data to |
---|
0:16:30 | train the |
---|
0:16:31 | parameters here |
---|
0:16:34 | for example we trained different weighting parameters to wait |
---|
0:16:38 | different recognition output different |
---|
0:16:41 | of subword units was the different information including open confidence or phone confusion matrix and so on |
---|
0:16:50 | oh here let me show you to very briefly two examples here |
---|
0:16:55 | and the first one |
---|
0:16:56 | is that |
---|
0:16:57 | and in this example we actually use two different ads |
---|
0:17:02 | a indexing approach we just mention confusion network and position specific posterior lattices in each case we use not only |
---|
0:17:12 | the we're page |
---|
0:17:14 | in this thing but also those based on subword units in which case we can really one but right one |
---|
0:17:20 | gram bigram three and trigram |
---|
0:17:23 | and so we have a total of eighteen different scores |
---|
0:17:26 | and we try to add them together by some weighting |
---|
0:17:29 | to optimize some parameter described in the |
---|
0:17:33 | oh retrieval perform |
---|
0:17:35 | which is called M I P |
---|
0:17:38 | here i am a P |
---|
0:17:40 | oh the mlp i mentioned in this talk is mean average precision |
---|
0:17:44 | which is the area integrate under this |
---|
0:17:48 | recall precision her |
---|
0:17:50 | and which is a yeah performs measure frequently used for information retrieval of course the are many other parameters that |
---|
0:17:57 | i just have time to use one of them here |
---|
0:18:00 | now we can try to optimize |
---|
0:18:02 | this parameter using some |
---|
0:18:05 | all extended version of |
---|
0:18:07 | set of support vector machine |
---|
0:18:11 | oh here's a result |
---|
0:18:13 | here i am a few results for the yeah at different scores used in give usually and is the result |
---|
0:18:19 | when we integrate them together |
---|
0:18:21 | you see we get about a net gain of about seven to eight percent of the mlp which is not |
---|
0:18:29 | here is another example that |
---|
0:18:31 | we think it is possible to have context within the context dependent term weighting |
---|
0:18:37 | in other words the same term may have different weights depending on the content |
---|
0:18:42 | for example if the query term is the query this information series this information is very important |
---|
0:18:49 | but if the previous speech information retrieval |
---|
0:18:52 | then this work information is not so important because important terms are speech and retrieval |
---|
0:19:00 | in this week different term may have different weights in different context |
---|
0:19:04 | and these weights can be trained |
---|
0:19:07 | and he got the results |
---|
0:19:09 | using context-dependent wait we actually get some gain on the mlp |
---|
0:19:15 | okay these are some directly waiting |
---|
0:19:18 | now what can we do next |
---|
0:19:20 | where the first thing we think about is how about acoustic model |
---|
0:19:24 | can we do |
---|
0:19:26 | just as we have so many expert in the clues modeling we can do discrete training on the quiz models |
---|
0:19:32 | so can we use this training data to re-estimate "'cause" model |
---|
0:19:37 | well |
---|
0:19:39 | in the past or the retrieval are considered based on top of recognition output |
---|
0:19:45 | they are two cascaded independent stages |
---|
0:19:49 | and so retrieval performance really rely on the recognition act |
---|
0:19:54 | so why don't we consider this two-stage together as a whole |
---|
0:19:59 | then the acoustic models can be re-estimated by optimising the retrieval problem in performance here |
---|
0:20:05 | in this way to coups models maybe better match to each respective dataset |
---|
0:20:11 | so in this way we learn from the mpe and try to define they object function in this paper |
---|
0:20:19 | and here is the results |
---|
0:20:22 | here the this the results for a different set of course it unusual "'cause" models these supports speaker independent models |
---|
0:20:30 | and these four adapted by global mllr and this by adapted further by |
---|
0:20:37 | as mmi and |
---|
0:20:38 | here is M E P but that is the adaptation for |
---|
0:20:43 | but X men a posterior probability |
---|
0:20:47 | and these numbers are mlp not yeah recognition accuracy |
---|
0:20:52 | as you notice that |
---|
0:20:53 | we do have some improvements |
---|
0:20:55 | but relatively limited |
---|
0:20:57 | probably because we were not able to define a good enough |
---|
0:21:02 | objective function |
---|
0:21:04 | another possible reason may be |
---|
0:21:07 | because different christ given query is really have quite different characteristics |
---|
0:21:12 | so when we put together many queries |
---|
0:21:16 | and these different query really interfere with each other in the training data |
---|
0:21:20 | so we are thinking of one not use query specific acoustic model |
---|
0:21:26 | in other words you we can we estimate it "'cause" model for every query |
---|
0:21:31 | then that means this has to be done |
---|
0:21:33 | on real time |
---|
0:21:34 | on the line |
---|
0:21:36 | is it possible |
---|
0:21:37 | we think that yes |
---|
0:21:39 | because we can based on the first several utterances |
---|
0:21:42 | they when the user clicks through |
---|
0:21:44 | and browsing the retrieval results |
---|
0:21:47 | then all their utterances not get browse can be reranked |
---|
0:21:51 | by the "'cause" model |
---|
0:21:53 | that means |
---|
0:21:54 | the models actually can be updated and the lattice can be rescored very quickly |
---|
0:21:59 | why because we have only a very limited number of training data so the retrieval can be very |
---|
0:22:06 | so this is the |
---|
0:22:08 | scenario that when the re |
---|
0:22:10 | when the system gives the which you results here and the user clicks |
---|
0:22:15 | browse and create the first several |
---|
0:22:17 | but when assessing indicating |
---|
0:22:19 | they are relevant or irrelevant |
---|
0:22:21 | then these results are actually fit to the acoustic model to re-estimate them up models |
---|
0:22:27 | where we get new models |
---|
0:22:28 | and these are used to rescore the lattices |
---|
0:22:30 | and that |
---|
0:22:31 | is used to rerank the rest of the art |
---|
0:22:36 | so what is the results |
---|
0:22:38 | where we can see |
---|
0:22:39 | just with one iteration of model react re-estimation which makes you real time |
---|
0:22:45 | adaptation possible |
---|
0:22:47 | and we do have some improvements |
---|
0:22:50 | here |
---|
0:22:52 | now what else can do |
---|
0:22:55 | well how about acoustic features |
---|
0:22:58 | well yes we can do something a focus feature |
---|
0:23:03 | for example if we know an utterance is known to be relevant or irrelevant |
---|
0:23:08 | then all the utterances similar to this one |
---|
0:23:12 | can be is more probable to be relevant and iraq |
---|
0:23:17 | so in this case |
---|
0:23:19 | this we have the same scenario that the when the user see the output |
---|
0:23:25 | and he clicked the first several utterances |
---|
0:23:29 | and we can use the first separate or utterance as reference |
---|
0:23:34 | and |
---|
0:23:35 | does not give rows are compared with those correct |
---|
0:23:38 | based on the acoustic similarity and then rewrite |
---|
0:23:44 | in this way |
---|
0:23:45 | let's see whether it is better or not |
---|
0:23:49 | and then we need to first define the a similarity in the acoustic features |
---|
0:23:54 | we first |
---|
0:23:55 | define forty utterance |
---|
0:23:58 | the a hypothesized region is these segment of |
---|
0:24:02 | feature vector sequences corresponding to this lattice |
---|
0:24:06 | these utterances and this lattice corresponding to the query Q |
---|
0:24:11 | in the lattice with the highest score for example for this utterance of us see our feature vector sequence and |
---|
0:24:18 | this is the corresponding arc for the choir and this is a high possibly |
---|
0:24:23 | not similar there's another utterance |
---|
0:24:26 | with the sequence here and high-pass reading here |
---|
0:24:29 | then the similarity can be derived based on the dtw distance between these two regions |
---|
0:24:36 | and in this way we can perform this scenario we just mentioned |
---|
0:24:41 | and here are the results again for the three sets of acoustic model |
---|
0:24:46 | and we may notice that in this way using a close similarities |
---|
0:24:50 | we guess slightly better improvements |
---|
0:24:52 | as compared to directly model we estimate acoustic model |
---|
0:24:58 | okay so what else can we do |
---|
0:25:00 | where we may consider a different approach |
---|
0:25:03 | that in the above we always assume we need to rely on the users |
---|
0:25:08 | that gives us some |
---|
0:25:10 | feedback information |
---|
0:25:12 | do we really need to rely only users |
---|
0:25:14 | no because |
---|
0:25:15 | we can always drive relevant information automatically |
---|
0:25:19 | we can assume the top N utterances on the first-pass retrieval results are relevant |
---|
0:25:25 | oh actually sold around |
---|
0:25:28 | and this is referred to as a solo reverence |
---|
0:25:31 | and here you see scenario |
---|
0:25:34 | when the user and required |
---|
0:25:36 | and the system gives the first pass retrieval result |
---|
0:25:40 | and this result would not be shown to the user |
---|
0:25:42 | but instead we just assume the from the top and utterances |
---|
0:25:46 | are relevant |
---|
0:25:48 | and all the rest are compared with these top N |
---|
0:25:51 | and see whether they are similar or not and based on similarity |
---|
0:25:57 | and based on the similarity |
---|
0:25:59 | we rented results |
---|
0:26:00 | and only this rear end results are shown |
---|
0:26:06 | and |
---|
0:26:07 | now we need to |
---|
0:26:09 | all |
---|
0:26:11 | okay here is the results |
---|
0:26:13 | you can see that with this pseudo relevance feedback |
---|
0:26:17 | for different acoustic models |
---|
0:26:19 | and we really have |
---|
0:26:22 | slightly better improvements here |
---|
0:26:25 | now what else can we do |
---|
0:26:27 | where we can further improve |
---|
0:26:29 | the above pseudo-relevance feedback approaches |
---|
0:26:32 | for example we can use graph based approach |
---|
0:26:36 | remember |
---|
0:26:38 | above that when we |
---|
0:26:40 | E in these to the right of feedback approach we assume the top N utterances are taken as the reference |
---|
0:26:47 | of we assume they are relevant |
---|
0:26:50 | but in this way of course they are not reliable |
---|
0:26:53 | so why don't we simply as |
---|
0:26:55 | considered for the first pass retrieval results probably using the graph |
---|
0:27:01 | in other words |
---|
0:27:02 | we can construct a graph for all utterances in D first pass |
---|
0:27:07 | retrieval results |
---|
0:27:09 | and all the utterances are taken as a represent a signal |
---|
0:27:13 | and then |
---|
0:27:14 | the edge weights are actually the acoustic similarities between a |
---|
0:27:20 | now we may assume that you utterance is strongly connected to |
---|
0:27:25 | utterances with high scores |
---|
0:27:27 | or very similar to utterances with high schools should have high school |
---|
0:27:32 | for example if here X two X three have high school then X one should |
---|
0:27:38 | similarly if X two S three all have |
---|
0:27:41 | have lost all the N X one channel school |
---|
0:27:44 | in that case discourse can propagate on the right |
---|
0:27:49 | and then spruce among strongly content notes |
---|
0:27:52 | in this we all the scores can be |
---|
0:27:55 | corrected |
---|
0:27:56 | so we can then reranked forty utterances in the first-pass which you retrieval results using this |
---|
0:28:03 | we use these correct a score |
---|
0:28:07 | and here is the results |
---|
0:28:09 | and again for three sets of acoustic models |
---|
0:28:12 | and you may notice that now graph based approach |
---|
0:28:16 | get provide higher |
---|
0:28:18 | and may result in or |
---|
0:28:21 | this is a reasonable because this basic where a graph based approach really considered global globally or the first pass |
---|
0:28:28 | retrieval results rather than reference on top and utterances |
---|
0:28:35 | okay what else can we do |
---|
0:28:36 | well we should and of course |
---|
0:28:39 | machine learning has been used and shown useful in some work |
---|
0:28:43 | so then sure one example of use of support vector machine |
---|
0:28:47 | in the scenario of |
---|
0:28:48 | we just mentioned this to the right feedback |
---|
0:28:52 | and here is this scenario again when the user entered query Q |
---|
0:28:58 | here |
---|
0:28:59 | and this is the first pass retrieval results |
---|
0:29:03 | this is not shown to the user but instead we simply take the first pass retrieval results we consider that |
---|
0:29:11 | the top ten |
---|
0:29:13 | utterance are assumed to be relevant and taken as possible examples |
---|
0:29:18 | the bottom and i'll soon to irrelevant and taking as negative examples |
---|
0:29:22 | and then we simply expressed some feature vectors from them |
---|
0:29:26 | and then |
---|
0:29:27 | we try to |
---|
0:29:28 | train a |
---|
0:29:30 | support vector machine |
---|
0:29:32 | now for the rest of |
---|
0:29:34 | utterances |
---|
0:29:35 | we simply |
---|
0:29:37 | okay we |
---|
0:29:38 | X ray D feature parameters and then |
---|
0:29:41 | we drank by just a |
---|
0:29:42 | support vector machine and then only to rewrite the results are shown to the user |
---|
0:29:49 | so in this case |
---|
0:29:51 | please note that we need to train an svm for every query online |
---|
0:29:56 | is it possible yes because we only have limit number of |
---|
0:30:00 | a training data so they can be verified |
---|
0:30:04 | no |
---|
0:30:07 | the first thing we need to do is we need to define how to extract a feature parameters to beach |
---|
0:30:13 | used in training svm |
---|
0:30:15 | where again we can use the we just mention a hypothesis region |
---|
0:30:20 | and suppose these same utterance and this is the corresponding lattice and here is a query and so these occur |
---|
0:30:26 | i also read |
---|
0:30:28 | we can divide this region into action states |
---|
0:30:31 | this action state |
---|
0:30:33 | and those feature vectors in once they can be averaged into one vector and then these vectors what different states |
---|
0:30:39 | can be concatenated into H supervector and there's a feature vector for this |
---|
0:30:45 | okay |
---|
0:30:46 | in that way what's results |
---|
0:30:48 | again we can |
---|
0:30:50 | see the results for different |
---|
0:30:52 | oh for different markers models |
---|
0:30:56 | you may notice that now that's |
---|
0:30:58 | svm is much better than reference |
---|
0:31:01 | which is much better |
---|
0:31:03 | the previous result |
---|
0:31:07 | all |
---|
0:31:08 | okay and of course i have to mention that all results report here are very preliminary the are just obtained |
---|
0:31:14 | in preliminary results experiments |
---|
0:31:18 | now what else can we do |
---|
0:31:20 | well |
---|
0:31:21 | all the above discussions are primarily considering acoustic models include features |
---|
0:31:27 | ha linguist information |
---|
0:31:29 | yes |
---|
0:31:30 | for example |
---|
0:31:31 | of the most straightforward information from linguists we can use is a context dependency |
---|
0:31:36 | a context consistency |
---|
0:31:38 | in other words the same term usually have very similar con |
---|
0:31:43 | while quite different context usually implies that rams are quite different |
---|
0:31:48 | so |
---|
0:31:49 | what can we do we can do exactly same as we did |
---|
0:31:52 | using svm |
---|
0:31:54 | except now the feature vectors represent context information |
---|
0:31:59 | so we use exact the same |
---|
0:32:01 | scenario that the for the first pass retrieval results you we use top and bottom N to train his yeah |
---|
0:32:07 | except now we use different features |
---|
0:32:10 | vectors here |
---|
0:32:12 | suppose these and are trying |
---|
0:32:14 | and the corresponding lattice |
---|
0:32:17 | and here are the |
---|
0:32:18 | query here and we can construct a left context |
---|
0:32:22 | vector |
---|
0:32:23 | who is the man dimensionality is the lexicon size |
---|
0:32:27 | yeah only those words appear to D left context |
---|
0:32:31 | have their posterior probabilities as the score |
---|
0:32:34 | or the other words has zero they are |
---|
0:32:37 | similarly we can have a right context vector and the whole segment complex |
---|
0:32:42 | and then we can come cut them together into a feature vector |
---|
0:32:46 | and this has dimensionality of three times detection sides |
---|
0:32:51 | three times of the like |
---|
0:32:53 | now we can use this to do the experiments and here are the results |
---|
0:32:58 | again for three sets of codes models |
---|
0:33:00 | and you may notice the context information really helps |
---|
0:33:06 | so what else can we do |
---|
0:33:08 | where certain concept match |
---|
0:33:11 | in other words we wish to |
---|
0:33:13 | match the concept rather than lead to |
---|
0:33:17 | in other words |
---|
0:33:18 | when we are we should just system can return utterances or documents |
---|
0:33:23 | semantically related to do carry but not necessarily include the card |
---|
0:33:28 | for example if the query is white house of your nice they |
---|
0:33:32 | and if the utterance includes present a bottom-up but not whitehouse not us |
---|
0:33:38 | we should it can be returned as well |
---|
0:33:41 | where they are many approaches have been proposed for this direction |
---|
0:33:45 | for example we can close to D documents into sets |
---|
0:33:49 | so we know |
---|
0:33:51 | which sets of documents are talking about the same concept |
---|
0:33:54 | we can use web data to expand the query or expanded documents |
---|
0:33:59 | we can also you using a |
---|
0:34:02 | a latent topic model |
---|
0:34:04 | it or should we just one example of latent topic approach |
---|
0:34:08 | where this is very straightforward and we just use a very popular where are used probabilistic latent semantic analysis or |
---|
0:34:17 | pos |
---|
0:34:18 | and in which we simply assume a set of latent topics |
---|
0:34:23 | pitching is set up to |
---|
0:34:25 | and a set of documents |
---|
0:34:27 | and the relationship can be modelled by properties models |
---|
0:34:31 | trained with em algorithm |
---|
0:34:33 | of course there are many so many other approaches |
---|
0:34:37 | and i are complementary |
---|
0:34:39 | and here is an example work we did |
---|
0:34:41 | for the us for a car |
---|
0:34:45 | and we transformed into lattices |
---|
0:34:47 | then for any given where we simply use the |
---|
0:34:51 | plsa model we just mention based on the latent topics to estimate the distance between the primary and the lattices |
---|
0:34:58 | and that gives the results |
---|
0:35:00 | here are some X a preliminary results these results are in terms of recall or precision curve |
---|
0:35:08 | and this three lower curves |
---|
0:35:10 | are the baseline of later matching simply matching words |
---|
0:35:14 | and the lowest one is on one best results and to chew up a one |
---|
0:35:21 | two upper ones are based on a lattice |
---|
0:35:26 | now yeah |
---|
0:35:28 | the three curves here are |
---|
0:35:30 | concept matching using the plsa i just mentioned |
---|
0:35:34 | as you can see |
---|
0:35:35 | a concept matching certainly this much |
---|
0:35:39 | so what else can we do |
---|
0:35:41 | where are you seconds content interactions that are not important |
---|
0:35:45 | all |
---|
0:35:47 | we know that user content |
---|
0:35:49 | the interaction is important even for text content |
---|
0:35:54 | in other words when we retrieve text content very open we also need a few iterations to get the desired |
---|
0:36:00 | information |
---|
0:36:02 | now for spoken content is much more difficult because i spoken content on not easily summarised on screen |
---|
0:36:09 | they are just signals |
---|
0:36:11 | so it's difficult for the user to browse |
---|
0:36:14 | to scan and to select that |
---|
0:36:17 | so when this isn't gives a whole bunch of |
---|
0:36:20 | retrieval results we cannot listen to everyone old and then decide which one |
---|
0:36:25 | we like |
---|
0:36:26 | so that's a problem |
---|
0:36:29 | what we propose is first we can |
---|
0:36:31 | try to a select |
---|
0:36:33 | automatically he turns and construct titles summaries |
---|
0:36:37 | to help browse |
---|
0:36:39 | and then we try to do some semantic structuring |
---|
0:36:43 | to have a user interface |
---|
0:36:45 | and then we can try to have some dialogue |
---|
0:36:48 | to help the interaction between the user and the system |
---|
0:36:53 | so been a very briefly go through some of |
---|
0:36:56 | for example cute and extraction |
---|
0:36:58 | which is very helpful in labelling the retrieval results and for user to browse |
---|
0:37:05 | the key trends include two types at least keywords and key phrases |
---|
0:37:10 | keyphrase include several keywords several words together |
---|
0:37:14 | so for key phrase we need to detect the boundary |
---|
0:37:18 | and there are many approaches to do this use one example suppose you them up model is a key for |
---|
0:37:25 | one where it is defined that had it is always followed by the same word markup |
---|
0:37:32 | mark over is always followed by the same word a get a model is always followed by the same word |
---|
0:37:38 | model |
---|
0:37:39 | however the model |
---|
0:37:41 | is followed by many different words |
---|
0:37:44 | and that means these at the boundary of the frames |
---|
0:37:47 | in this way we know |
---|
0:37:49 | there are a number it can be detected by context |
---|
0:37:51 | statistics |
---|
0:37:54 | now with the chi turn candidate |
---|
0:37:58 | i there is a word or phrase |
---|
0:38:01 | then we constrain many features |
---|
0:38:03 | to identify with the object you transform that |
---|
0:38:06 | for example prosodic features |
---|
0:38:09 | because very often the key terms are produced with longer duration wider pitch range and high energy |
---|
0:38:16 | we can also use |
---|
0:38:17 | semantic features for example from hearsay because key trends are usually focused on smaller number of top |
---|
0:38:25 | for example this is a distribution of topic probabilities obtained from plsa given a cat |
---|
0:38:32 | now this one looked like she turned because it's focus on only smaller number of |
---|
0:38:37 | topics D horizontal axis of topics |
---|
0:38:40 | and this one doesn't like acute right because you need for me used in many different futures and in many |
---|
0:38:45 | different |
---|
0:38:47 | of course lex and feature a very important that includes term frequency and inverse document frequency of part of speech |
---|
0:38:54 | tag |
---|
0:38:54 | and so on |
---|
0:38:56 | here is the result of weak of a attraction she turns |
---|
0:39:00 | using different sets of speech |
---|
0:39:03 | here yeah |
---|
0:39:06 | the |
---|
0:39:07 | prosodic lexical and semantic features here and you notice that each |
---|
0:39:12 | it just single set of features are useful |
---|
0:39:15 | however when we integrate them together we get the highest result |
---|
0:39:21 | now for summarization where a lot of people in this room and doing summarization so i'll just |
---|
0:39:27 | of course with a very quickly |
---|
0:39:29 | the suppose these see that this is a document includes many but |
---|
0:39:34 | and we try to recognise them into |
---|
0:39:37 | words every circle is a word i the recognized correctly or incorrect incorrectly |
---|
0:39:43 | what we do is try to select a small number of utterances |
---|
0:39:47 | which are most representative |
---|
0:39:49 | and avoid we done |
---|
0:39:51 | and they are used to form a summary and this the |
---|
0:39:54 | so called extractive summarization we can even replace these utterance with the original voice |
---|
0:40:01 | so there is no correcting errors in the result |
---|
0:40:05 | and i just show one example here |
---|
0:40:08 | because we are selecting the most representative same utterances |
---|
0:40:12 | so it is reasonable to consider that the utterances topic or is similar to the represented representative utterances should also |
---|
0:40:21 | be considered as represented |
---|
0:40:23 | so we can do similar |
---|
0:40:25 | on the graph based analysis in other words every utterance represent as they |
---|
0:40:31 | note on the graph |
---|
0:40:33 | and then we let the scores for representatives yes |
---|
0:40:36 | propagate undergrad |
---|
0:40:38 | in this we compare get better scores and select better utterances |
---|
0:40:43 | these are some results and skipped |
---|
0:40:46 | title generation |
---|
0:40:47 | titles are very open useful for if we |
---|
0:40:51 | construct titles for retrieve the document second |
---|
0:40:55 | is useful to for the browsing and selection of utterances |
---|
0:41:00 | but i don't have to be very short but readable |
---|
0:41:03 | and tell you what it is |
---|
0:41:05 | here's one approach |
---|
0:41:07 | we perform viterbi that was over the summer |
---|
0:41:11 | based on the scores obtained by stream model |
---|
0:41:16 | to select the to try to order to test |
---|
0:41:19 | and to decide |
---|
0:41:21 | lance of the tide |
---|
0:41:23 | in this way we can have some good titles |
---|
0:41:27 | semantic structuring there can be different ways to do semantic structuring and we don't know what's |
---|
0:41:33 | good approach here just use one example |
---|
0:41:36 | and we can cross to retrieve the results into some kind of |
---|
0:41:41 | a tree structure based on the |
---|
0:41:43 | a semantic information for example they can tell |
---|
0:41:48 | in this way |
---|
0:41:49 | every cluster can be labelled by except she turns |
---|
0:41:53 | so that such intense indicating what they are talking about |
---|
0:41:57 | and |
---|
0:41:58 | every cluster can be for the next |
---|
0:42:01 | tainted into the next layer and so on |
---|
0:42:05 | here is another example |
---|
0:42:07 | or |
---|
0:42:08 | teaching |
---|
0:42:10 | in other words every we retrieve the spoken document or segment can be labelled by step two turns |
---|
0:42:17 | and then the relationship between the two terms |
---|
0:42:20 | can be construct |
---|
0:42:22 | represent as a graph |
---|
0:42:24 | so we know what kind of |
---|
0:42:27 | information about |
---|
0:42:32 | okay now finally the kind of |
---|
0:42:35 | if we have all this including semantic structuring forty turns summaries here |
---|
0:42:41 | on the system |
---|
0:42:42 | and the user is here getting providing some choir so what can we do to offer them to |
---|
0:42:49 | have a better interaction |
---|
0:42:51 | a dialogue may be possible |
---|
0:42:53 | and many people here in this room are very experienced in doing something spoken time so we wish to learn |
---|
0:43:00 | something from |
---|
0:43:02 | for example we may model this process as a markov decision process or M |
---|
0:43:09 | in this way what we can do is to |
---|
0:43:13 | for example we need to define some goals |
---|
0:43:16 | the goal is maybe higher text |
---|
0:43:19 | success rate |
---|
0:43:20 | except here the success indicates |
---|
0:43:23 | the user information need is satisfied |
---|
0:43:26 | we can also define a go to be |
---|
0:43:28 | small number of dialogue turns back here |
---|
0:43:32 | is small number of query terms entered |
---|
0:43:35 | in this way we can define the reward function or something similar and then maximise the reward function with similar |
---|
0:43:43 | uses |
---|
0:43:45 | and here is one example application scenario for retrieving broadcast news |
---|
0:43:50 | and here in every step when user and require every decision tree trends not only the retrieved results but also |
---|
0:43:58 | a list of key trends what user to select |
---|
0:44:02 | if the user is not satisfied with the results here then cheating |
---|
0:44:06 | looks through that |
---|
0:44:08 | Q chandler's from the top and select the first relevant to disney |
---|
0:44:13 | and this |
---|
0:44:14 | she turned miss can be ranked |
---|
0:44:16 | i M P |
---|
0:44:18 | and you're some results i'm escape |
---|
0:44:21 | so above i have mentioned something about she turned up a summary title and is the menu structure and dialogue |
---|
0:44:28 | so that's the something about user content interaction of course the a lot more work needed before we can do |
---|
0:44:36 | something really |
---|
0:44:38 | okay now let me |
---|
0:44:40 | have a few minutes socially them |
---|
0:44:52 | and this is a |
---|
0:44:54 | course lecture |
---|
0:44:56 | oh okay then you go through corporate |
---|
0:44:59 | the slides first |
---|
0:45:04 | this is |
---|
0:45:05 | on a coarse black |
---|
0:45:07 | and as we know there are many course lectures available over the internet |
---|
0:45:12 | however it takes a very long for user to learn to listen to a complete course for example forty five |
---|
0:45:18 | minutes |
---|
0:45:19 | and therefore is not easy for engineers or in those readers to learn new knowledge the other course lectures |
---|
0:45:26 | and we should |
---|
0:45:29 | so we also understand they are lecture browsers available over the internet |
---|
0:45:33 | however we have to bear in mind that |
---|
0:45:37 | the knowledge of course lectures are usually structured one concept follows |
---|
0:45:42 | so the retrieval vector segment |
---|
0:45:45 | possible it being very not easy to understand what are then without enough background knowledge |
---|
0:45:52 | and also given the retrieve the segment |
---|
0:45:55 | there is no information for the gonna regarding what should be the now |
---|
0:46:00 | so the proposed approach is to try to structure of course lectures parts line spanky turns |
---|
0:46:06 | we derive the course lectures by slide |
---|
0:46:09 | and drive the core or content what slides |
---|
0:46:13 | i T turns and then construct she can grow |
---|
0:46:16 | represent semantic relationship among the slides |
---|
0:46:20 | and also all slides are given its lens |
---|
0:46:24 | timing information in the course |
---|
0:46:26 | summary key trends |
---|
0:46:28 | and relay to transcend relay slides based on |
---|
0:46:32 | the kitchen |
---|
0:46:34 | or retrieve the spoken segments include all the information for the slides if you want to |
---|
0:46:41 | and this is a system for a course on digital speech processing over by myself in taiwan university so therefore |
---|
0:46:49 | is |
---|
0:46:51 | given in mandarin chinese |
---|
0:46:53 | however or determine on edges are produced directly in english so this is a call makes the data |
---|
0:47:00 | okay so now let me go to |
---|
0:47:06 | this |
---|
0:47:08 | and this is the course |
---|
0:47:10 | and the system it could be given name of and you virtual instructor |
---|
0:47:14 | and i'm |
---|
0:47:15 | it was recording your two thousand six the total and forty five hours |
---|
0:47:20 | now suppose i heard something in a lecture about backward catwalks i don't know what that is |
---|
0:47:27 | so i tried to retrieve it |
---|
0:47:30 | however because i don't know what |
---|
0:47:31 | so i mean i guess it is like work out with some |
---|
0:47:35 | so it just enter like workout wasn't and then do the search |
---|
0:47:39 | here i'm searching through the |
---|
0:47:42 | internet and on the server on the server entire university so i rely on the internet here |
---|
0:47:50 | and we see that |
---|
0:47:51 | here i'm retreating the voice rather than the words |
---|
0:47:55 | so the query words is bright work out some which are totally wrong |
---|
0:48:00 | but here what we treat a total of fifty six results in the course |
---|
0:48:04 | and here for example in this result the first one is utterance of a second long |
---|
0:48:11 | it is in the slides |
---|
0:48:12 | is a slight number twelve of chapter four that i told that finds this basic problem three for hmm |
---|
0:48:20 | and here is the |
---|
0:48:22 | and the slide is labeled by these key terms but what that looks a now i know this is about |
---|
0:48:28 | but that was rather than that helps and also baumwelch or for what else and so on |
---|
0:48:35 | and note that because this |
---|
0:48:38 | utterances are represented in terms of lattices of subword units |
---|
0:48:42 | so the supper unit sequence of this paper that works and is very similar to this one |
---|
0:48:47 | and that's why i can retrieve |
---|
0:48:50 | and there are many up and so on that goes with that |
---|
0:48:55 | now if i think i like to listen to this |
---|
0:48:59 | so i can click here to go to that slide |
---|
0:49:02 | this line number two of chapter four |
---|
0:49:05 | and side here that i don't is this that the this is done by myself so it is a human |
---|
0:49:11 | generate a type i don't need all the magenta title |
---|
0:49:15 | because every site has a title and the title is basic problems three for a channel |
---|
0:49:20 | and hearsay is this lies has a total length of twenty too many and fifty seven seconds so you buy |
---|
0:49:26 | like to listen to this i need to have twenty two minute |
---|
0:49:30 | and in addition this is the spend all these sites |
---|
0:49:34 | in chapter four out of the twenty two slides total |
---|
0:49:37 | and so |
---|
0:49:39 | and very important here is the key terms |
---|
0:49:42 | only those terms on the top in yellow are the key terms used in this line |
---|
0:49:48 | and those below here are real at times provided by the kitchen right |
---|
0:49:54 | in other words it you demographics are completely not easy to show here so instead i just list the highly |
---|
0:50:00 | related key terms here below every cute right |
---|
0:50:03 | so for example when i go through here i saw here is at train quarterback right now "'cause" i'm not |
---|
0:50:09 | support for L |
---|
0:50:12 | and background wasn't actually relate to for education and so on |
---|
0:50:16 | now if i don't understand this one so i like to know a little more if i understand this can |
---|
0:50:22 | i listened to dislike |
---|
0:50:24 | so i click here that give me that |
---|
0:50:26 | this cute ran off |
---|
0:50:28 | backward algorithm actually first appeared in on the slides |
---|
0:50:32 | which appears earlier |
---|
0:50:33 | and now you and the slides which is later so probably there's no experimentation up with this |
---|
0:50:40 | and you really don't know about backward algorithm you should go there so i can cut this one and then |
---|
0:50:45 | i go to |
---|
0:50:47 | that's lights |
---|
0:50:49 | where it's like okay yeah that's the other slides |
---|
0:50:52 | and that's is the first time that but what else was measure |
---|
0:50:56 | and that's lights |
---|
0:50:58 | and that's that helps them here that you change in the slides for example in that slide we also really |
---|
0:51:05 | have but what else |
---|
0:51:07 | and for help us |
---|
0:51:09 | the four at some he's actually relate to that alex |
---|
0:51:12 | that ellen ready to this |
---|
0:51:14 | for that and so on |
---|
0:51:17 | now let me show a second example suppose i like to enter another query which is frequency |
---|
0:51:24 | no i do the search |
---|
0:51:26 | not in this course there are a total of sixty four results for the frequencies |
---|
0:51:31 | here the first |
---|
0:51:32 | utterance |
---|
0:51:34 | of six second long appears in this |
---|
0:51:36 | slight of build a batch processing |
---|
0:51:39 | labeled by this |
---|
0:51:41 | he turns |
---|
0:51:42 | and the second |
---|
0:51:43 | is |
---|
0:51:44 | on |
---|
0:51:45 | pre-emphasis |
---|
0:51:46 | and so on if i'm interest in this one i can press here to go to this |
---|
0:51:51 | slots |
---|
0:51:52 | this is the slides on pre-emphasis |
---|
0:51:54 | and i notice there is summary of beeping second |
---|
0:51:58 | so i like to listen to this summer |
---|
0:52:00 | so okay retrieving the summary from the high in so you got been able when you wear usual |
---|
0:52:07 | you can change |
---|
0:52:10 | oh no but i'll call again a number of them are element that you just sensitive to the tangent angle |
---|
0:52:15 | that city and go under the try not to let your should formative initial sort of take a pre-emphasis factor |
---|
0:52:22 | cocaine not out there |
---|
0:52:24 | a given that you could solve |
---|
0:52:30 | okay this is the fifteen seconds summary it's in mandarin chinese i'm sorry but i tried the english a subtitle |
---|
0:52:38 | is actually done you know |
---|
0:52:40 | manually in order to show you what was that in that summary |
---|
0:52:46 | and okay so this end of the demo |
---|
0:52:48 | so let me come back to the |
---|
0:52:53 | powerpoint |
---|
0:52:54 | so in conclusion |
---|
0:52:56 | i usually divide the |
---|
0:52:58 | spoken language processing over internet into three parts |
---|
0:53:02 | user interface |
---|
0:53:04 | content analysis including such as keychain extraction or summarization or |
---|
0:53:10 | so on and the user content interaction |
---|
0:53:14 | and we notice that user interface has been very successful however not very easy usually because you which users usually |
---|
0:53:22 | expected technology to be placed human the |
---|
0:53:26 | for content analysis and use a complex interaction which are not easy i seven however because the technology can handle |
---|
0:53:34 | massive quantities of content |
---|
0:53:36 | what a human being can not |
---|
0:53:37 | so the technology does have some i think |
---|
0:53:41 | now the spoken content retrieval is the one which integrate user interface with content |
---|
0:53:48 | analysis and use a condom interaction therefore maybe offer some interesting applications in the future |
---|
0:53:56 | so eventually i like to say that i think this is only |
---|
0:54:00 | this area is only in its infancy stage |
---|
0:54:03 | and there's to you plenty of space to |
---|
0:54:07 | developed and plenty of opportunities to be investigated in the future |
---|
0:54:12 | and i notice that many groups |
---|
0:54:14 | i have been to some doing some work in this area and actual many people in this room have done |
---|
0:54:20 | some work in this area |
---|
0:54:22 | so i we should we can have more discussions and more work in the future |
---|
0:54:27 | and hopefully we can have something better |
---|
0:54:29 | much better than what we have today in the future justice |
---|
0:54:33 | the in speech recognition we are now having much better |
---|
0:54:37 | work then several years ago so we wish we can do something more |
---|
0:54:42 | okay this concludes my presentation thank you very much for your attention |
---|
0:54:58 | one child thank you very much for a very interesting to watch |
---|
0:55:02 | one question i have for you is that a lot of people in the audience or working on a related |
---|
0:55:09 | somewhat related problem which is voice search and one of the issues that come up and voice search sometimes you |
---|
0:55:17 | say something it gets released speech recognition one and then you repeat the query to get it right all the |
---|
0:55:24 | other sets of choice is it may come up with maybe sort of similar an overlapping |
---|
0:55:31 | oh it would seem to me that has some relation to relevance feedback in the sense that the user was |
---|
0:55:37 | sort of giving an additional set of information about the a priori previous query that was dictated i'm just wondering |
---|
0:55:45 | if |
---|
0:55:46 | you when you don't the people you work with the therefore looked into this |
---|
0:55:50 | sort of problem whether you're you have any opinions on whether you could get improvements by looking at the somehow |
---|
0:55:58 | taking the union of multiple queries |
---|
0:56:01 | in a voice search sort of tasks to join improve results using similar results are similar methods to what we |
---|
0:56:07 | talked about the in your talk |
---|
0:56:10 | thank you very much i think certain is very good idea and |
---|
0:56:14 | arg in actually are as i mentioned in the beginning in this part we |
---|
0:56:19 | we are not talking about a voice search but in your experience for example when they're bp do queries may |
---|
0:56:25 | provide |
---|
0:56:26 | some good information or correlation about the intent of the user so that they are helpful for example in my |
---|
0:56:33 | what i mentioned D dialogues we actually allow the user to enter the second query and so on and that |
---|
0:56:41 | key |
---|
0:56:42 | actually the interaction or the correlation between the first and second choirs and so i think that's the yeah the |
---|
0:56:50 | only thing we have done |
---|
0:56:52 | for up to this moment but i think probably what you what you say it implies much more we can |
---|
0:56:57 | do and we are a i think we |
---|
0:57:00 | as i mentioned we just have to |
---|
0:57:03 | too much work |
---|
0:57:04 | to be done and so we can think about how to implement what you think about in the future |
---|
0:57:13 | thank you very much but that's only for very interesting talk i have a detailed technical question on your svm |
---|
0:57:20 | slight |
---|
0:57:21 | also |
---|
0:57:24 | if you can go |
---|
0:57:31 | oh yeah you for |
---|
0:57:32 | for svm that you take a positive examples in there okay slight yes so my question is when you train |
---|
0:57:40 | your svm you it seems like you're only taking a high count is examples for a stationary example |
---|
0:57:48 | so in the margin you're pulling my you're pulling the examples from where is far from the margin |
---|
0:57:54 | and in the testing phase if you have some difficult examples that were close to the hyperplane then you my |
---|
0:58:02 | probably |
---|
0:58:04 | where part time |
---|
0:58:05 | three |
---|
0:58:06 | yeah certainly you're right |
---|
0:58:08 | but well that's all we can do at this moment because you know nothing about these results right you just |
---|
0:58:14 | have the first pass results here and we can or we can do is that you assume the top and |
---|
0:58:20 | bottom |
---|
0:58:21 | and then construct the svm of course in the in the middle close to the andre it's a problem |
---|
0:58:28 | however svm already provides some solution for them or the large margin concept |
---|
0:58:34 | so they tried to improvise some |
---|
0:58:38 | somehow almost margin and also there are some |
---|
0:58:41 | allowance right so |
---|
0:58:44 | what we just try to follow the idea from svm and try to that |
---|
0:58:50 | to see if we can do something that |
---|
0:58:52 | okay thank you |
---|
0:58:58 | i have a question sure |
---|
0:59:00 | making a great talk there's a lot of the parallels with the |
---|
0:59:05 | methods to go this is really a is telling what michael say |
---|
0:59:09 | with a text based web search with this problem |
---|
0:59:15 | you know even beyond voice search |
---|
0:59:17 | some methods i think the other have been developed by this community that would probably benefit the web search committee |
---|
0:59:24 | and vice versa i wanted to |
---|
0:59:27 | ask you to comment on that it and the your awareness of the literature there and opportunities for this |
---|
0:59:34 | cross fertilisation there's in web search the query writing |
---|
0:59:38 | is well established and |
---|
0:59:41 | also and the click feedback web search without a lot of benefits from distinguish between clicks from users |
---|
0:59:49 | and good click |
---|
0:59:51 | because clicks a ten to be noisy and so there's been a lot of work in the web search committee |
---|
0:59:56 | about modelling clicks and you know determine good clicks and use those you know with those more heavily for feedback |
---|
1:00:03 | also just basic things like editorial relevance feedback about you know bunch of the you know large groups of users |
---|
1:00:11 | to you know |
---|
1:00:13 | determine the relevance and use that as we that's what so i just one of this that have asked you |
---|
1:00:19 | what your thoughts are on the |
---|
1:00:20 | opportunities for cross fertilisation between these two areas |
---|
1:00:24 | yeah sure there are a lot of possibility to learn from the experience in voice search to do on this |
---|
1:00:31 | part of where |
---|
1:00:32 | we just don't have enough time to explore the possibilities as you mentioned the this kate may be divided into |
---|
1:00:41 | several categories and they can be learned or something like that |
---|
1:00:46 | and i think there should be a what could be done in the future |
---|
1:00:49 | but on the other hand we also learn a lot in from other areas such as text retrieval |
---|
1:00:56 | and then for example yeah |
---|
1:00:59 | rather be back or pseudorandom feedback or renting or learning to rank or something much of some ideas laurel from |
---|
1:01:09 | there |
---|
1:01:10 | community so certainly |
---|
1:01:13 | cost area |
---|
1:01:14 | interact is very helpful |
---|
1:01:17 | because these areas actually |
---|
1:01:19 | interdisciplinary |
---|
1:01:21 | on the other hand we really try to |
---|
1:01:24 | do something we are more from india in speech area for example acoustic model |
---|
1:01:31 | for example the acoustic features |
---|
1:01:35 | and so on and we try to for example spoken dialogue |
---|
1:01:39 | and we try to follow all those good ideas and good experiences in speech and see that can be used |
---|
1:01:46 | in comparison |
---|
1:01:48 | and i think we just have as i mentioned we data plenty of space to be explored in future |
---|
1:01:56 | thank you much |
---|