we introduce a professor lin-shan lee
oh he's been with the national taiwan university since nineteen eighty two
is early work focused on the
a broader area of spoken language systems particularly focused on the chinese language
and a number of breakthroughs early on in that language
is more recent work
it's been focused on the sort the fundamentals of speech recognition
at a network environment issues
like information retrieval semantic analysis
oh spoken content
is ieee vol
and it is gonna follow
he served on numerous boards
and received a number of a awards including a recently
the meritorious service work for ieee
signal process the signal processing society
so please what we welcome professor lin-shan
hello
so you can hear me right
good doesn't it
and thank you larry
is my great pleasure today to be here presented to you
spoken content retrieval
lattice and the young my name's and change from national high
in this talk of first introduce
the base
is a problem and some fundamentals
and then i'll spend more time
finding some recent research examples
all showed them before the conclusion
so first introduce
problem
we are all very familiar with
text content which because it says in there is that
for any user errors
or user instructions
use your every repair them as well as
the desired information can be obtained
very
in real time you refer to
documents
all users alike
and i even produce various
successful in dutch
now today we all know that all rows of have
can be accomplished by force
on the content side
the spoken
content we do have spoken content or mouth and yet
sports
all part
on the query side the voice query can and should be a hand
hand held devices
so it's time for us to consider
spoken kind
now this is what we have today
everyone
blender
when we ensure
a text
we get X
now boasts the choirs and content
can be
informal ports
future
first
we may use text queries
to retrieve spoken comedy
or not
okay
including all
for this case
very often
also this before then
a spoken content
i spoken document retrieval
very often
morse subset
oh that problem
is referred to as
spoken term detection
in other words to detect
query change it
i spoken con
of course we can also
retrieve
a text content using
all was clear
in that case
usually referred to by also
thus voice search
and
oh however in this part because the retreat you document to be retrieved is in tech school and therefore would
be out of the scope of this talk
so i'm not going to spend more time talking about voice
oh of course we can do the other side that is to retrieve
a spoken
content using spoken queries
and sometimes the use of it for two days
query by example
and so in this part of focus on retrieval of spoken content primarily using text
cards
but sometimes we can also consider the case of
spoke
now as we all understand if the spoken content is one chorus can be accurately recognise
then this problem would be reduced to you well known text content
retrieval
it will be no problem
of course
that never happened
because we know the us recognition
in most cases
so that's a major part
not today we understand that
the
they are many hand held devices
with multimedia functionalities available today commercial
and also they are unlimited quantities of multimedia can which is ban scrolling over the internet
so we should be able to retrieve not only the text content but that's where the multimedia and spoke
in other words you wireless and multimedia technologies today are creating an environment for spoken content retrieval
as to let me repeat again that the
network access is primarily
text based today
but almost all rows of text can be accomplished by voice
so the nice mentioned very briefly some fundamental
first
but this we wanna stand D recognition always give errors
for various reasons for example
oh spontaneous speech for example oov words or mismatch models and so on
and then makes the problem difficult
so a good approach may be to consider lattices with multiple alternative
rather than the one best output on
in this case
of course we can have a higher probability of including quite words
but also in that case we have to include more noisy words that cost problem
but on the other hand even if we have to be like this we still have the problem that some
correct words may not be
including because they are all keywords and still
on the other hand when we use lattices that implies we need shooting memory and computation requirements
there's another major problem
of course there exist other approaches to solve that similar problems
for example people use confusion matrix to model reading errors
and try to explain the query and document using confusion matrices
people also we use
pronunciation modeling to try to expand enquiry in that way
people also use say fuzzy matching in other words the matching between the quality and the content does not has
to be
exact
these are all very good approaches however i won't have time to
say more about things
since our focus on lattices here it just talk
now the first question is how can we index that
well unless a lattice that like this
and usually the most
popular approach to index a lattice is to transform the lattice into a sausage like structure like this
in other words it's serious of segments
and every segment includes its number of word hypothesis we use
posterior probability
in this way the position information for the words can be readily available in other words they were one is
on the first segment and word at some the suffers the second segment so we're one after work can you
want a followed by word eight and that is a bigram and so on
in this way this is a more compatible to existing text indexing techniques
also in this way the required a memory and computation can be read use the slightly
in addition we may notice that in this way we can add more possible path
for example were three cannot be found by where i ate in the original lattice
but here this becomes possible
also the noisy words can be discriminated by posterior probabilities
because we do have
scores
in either case this we notice that we can match the and right
oh this lattice
oh with the very for example we have the bigram were three followed by word of a then response of
these bigrams exist in the very that helps
so we can
come all the possible
and when spanky will accumulate the scores and still
now there are many approaches proposed for such kind of indexing of the lattices
i just list a few examples here
and i think that today the most popular ones maybe the top to the yeah posterior please position specific posterior
lattice
or P S P L
confusion networks or C and
also another very popular one would be the weighted finite state transducer
wfst
now let me take
one minute to explain the first two
the position specific posterior lattice this psp a and a confusion networks
and
suppose is a lattice
and these are the board your possible pasts here
they were sick
and
the P S appeal or positions of the signal
a posterior lattice try to locate every word
in a segment based on the position of that word in a path
for example where ten here appears only as the force were in the past
so it appeared in the force tacoma
on the other hand
the and the confusion networks
try to cluster words together in a cluster based on for example time spans into word pronunciation
so for example
the word to word five or ten may have very similar time span and pronunciation they may be clustered together
and they may appear in the
second for here
so in this case you may note that the different approach gives different in this
now a major problem here is oov words
as you understand the
only work cannot be recognized therefore never appear in the lattice
that's important because very often
the yeah according words
the query includes both we were
however if we look carefully
but there are many approaches to handle this problem i think the most fundamental approach is to use some for
you
a i mean let's take this example
suppose at all keyword W is composed of
these four somewhere units every small W i use a summer units for example a phoneme or syllable or something
these are also somewhere units and these eli
and these are arcs
and the word here because
this W is not a
in the vocabulary so it's not recognise here
however if we look at carefully we notice that
the work is actually here it is hinted at sub-word level
so it can actually be matched
and somewhere level without being recognised in the
and that's a major approach that a different ways can be developed to try to handle this using somewhere units
oh one example is to construct
the same
P S P L or C and based on separate units
for example
now there are many give principal units have been used in this approach
and usually we can categorise them into two class
the first one is linguistically motivated units
for example phonemes
set of a character or more times and so on
the other one is data-driven units in other words the a drive
using some data-driven L
and different ellison's may
produce different names
for example someone for the particle someone code word fragments of panama transform offsets
of course there are some other different approaches
if we do have the very invoice for available
in that case we can also manage to hurry
in speech and a
all containing speech directly
without doing recognition
in that case we can avoid the recognition error problem and we can even do it with the in the
unsupervised way
in that case we even don't need a lattice
and this can be performed in say
frame based matching for example like dtw
or segment based approach
just imagine the sex the segments
or model based action so
our board at this kind of approaches
do not use recognition and therefore do not
have lattices
so i won't spend more time using a about this approach is all just focus on those with laughter
okay so below all always this fundamental at all it's just described to you some recent research examples
i have to apologise i can only cover a small number of examples
so many examples i just cannot cover
all
below i'll assume the retrieval was look something like this
this is spoken archive
after recognition based on some clothes models we have lattices here
now you retrieval was applied on top of this lattices
here the search engine i mean index in the lattices and search over the in
and by retrieval model i mean anything in addition for example confusion matrix is mention that was in the waiting
room and whatever
all based on this
oh
graph to discuss the following
the first thing we can think about can do is to do integration and wait
and for example we can integrate different rules from recognition
from different recognition systems
from those
based on different subword units
oh in queens some of the
information and so on
in addition a good idea maybe to try to try and those of model parameters
if we have some training data available
what kind of training data are needed here well this kind of training data we need here are a set
of queries
indeed the so we shaded relevant irrelevant second
for example use when user entered query Q one we get a list of here and then the first two
are forced or irrelevant and the next two are two more relevant and still
we need a set of these kind of data
or such data does not necessary to be anointed by person abides by some people because we can collect them
from real clicks with data
for example if the user enter a query Q one and Q get this list
and then p2p skip the first two items and directories
click the next two
we may assume then
the first two are irrelevant or false
and the next two are round
in this way we can have chris with data
when we have this data then we can do something more for example we can use this training data to
train the
parameters here
for example we trained different weighting parameters to wait
different recognition output different
of subword units was the different information including open confidence or phone confusion matrix and so on
oh here let me show you to very briefly two examples here
and the first one
is that
and in this example we actually use two different ads
a indexing approach we just mention confusion network and position specific posterior lattices in each case we use not only
the we're page
in this thing but also those based on subword units in which case we can really one but right one
gram bigram three and trigram
and so we have a total of eighteen different scores
and we try to add them together by some weighting
to optimize some parameter described in the
oh retrieval perform
which is called M I P
here i am a P
oh the mlp i mentioned in this talk is mean average precision
which is the area integrate under this
recall precision her
and which is a yeah performs measure frequently used for information retrieval of course the are many other parameters that
i just have time to use one of them here
now we can try to optimize
this parameter using some
all extended version of
set of support vector machine
oh here's a result
here i am a few results for the yeah at different scores used in give usually and is the result
when we integrate them together
you see we get about a net gain of about seven to eight percent of the mlp which is not
here is another example that
we think it is possible to have context within the context dependent term weighting
in other words the same term may have different weights depending on the content
for example if the query term is the query this information series this information is very important
but if the previous speech information retrieval
then this work information is not so important because important terms are speech and retrieval
in this week different term may have different weights in different context
and these weights can be trained
and he got the results
using context-dependent wait we actually get some gain on the mlp
okay these are some directly waiting
now what can we do next
where the first thing we think about is how about acoustic model
can we do
just as we have so many expert in the clues modeling we can do discrete training on the quiz models
so can we use this training data to re-estimate "'cause" model
well
in the past or the retrieval are considered based on top of recognition output
they are two cascaded independent stages
and so retrieval performance really rely on the recognition act
so why don't we consider this two-stage together as a whole
then the acoustic models can be re-estimated by optimising the retrieval problem in performance here
in this way to coups models maybe better match to each respective dataset
so in this way we learn from the mpe and try to define they object function in this paper
and here is the results
here the this the results for a different set of course it unusual "'cause" models these supports speaker independent models
and these four adapted by global mllr and this by adapted further by
as mmi and
here is M E P but that is the adaptation for
but X men a posterior probability
and these numbers are mlp not yeah recognition accuracy
as you notice that
we do have some improvements
but relatively limited
probably because we were not able to define a good enough
objective function
another possible reason may be
because different christ given query is really have quite different characteristics
so when we put together many queries
and these different query really interfere with each other in the training data
so we are thinking of one not use query specific acoustic model
in other words you we can we estimate it "'cause" model for every query
then that means this has to be done
on real time
on the line
is it possible
we think that yes
because we can based on the first several utterances
they when the user clicks through
and browsing the retrieval results
then all their utterances not get browse can be reranked
by the "'cause" model
that means
the models actually can be updated and the lattice can be rescored very quickly
why because we have only a very limited number of training data so the retrieval can be very
so this is the
scenario that when the re
when the system gives the which you results here and the user clicks
browse and create the first several
but when assessing indicating
they are relevant or irrelevant
then these results are actually fit to the acoustic model to re-estimate them up models
where we get new models
and these are used to rescore the lattices
and that
is used to rerank the rest of the art
so what is the results
where we can see
just with one iteration of model react re-estimation which makes you real time
adaptation possible
and we do have some improvements
here
now what else can do
well how about acoustic features
well yes we can do something a focus feature
for example if we know an utterance is known to be relevant or irrelevant
then all the utterances similar to this one
can be is more probable to be relevant and iraq
so in this case
this we have the same scenario that the when the user see the output
and he clicked the first several utterances
and we can use the first separate or utterance as reference
and
does not give rows are compared with those correct
based on the acoustic similarity and then rewrite
in this way
let's see whether it is better or not
and then we need to first define the a similarity in the acoustic features
we first
define forty utterance
the a hypothesized region is these segment of
feature vector sequences corresponding to this lattice
these utterances and this lattice corresponding to the query Q
in the lattice with the highest score for example for this utterance of us see our feature vector sequence and
this is the corresponding arc for the choir and this is a high possibly
not similar there's another utterance
with the sequence here and high-pass reading here
then the similarity can be derived based on the dtw distance between these two regions
and in this way we can perform this scenario we just mentioned
and here are the results again for the three sets of acoustic model
and we may notice that in this way using a close similarities
we guess slightly better improvements
as compared to directly model we estimate acoustic model
okay so what else can we do
where we may consider a different approach
that in the above we always assume we need to rely on the users
that gives us some
feedback information
do we really need to rely only users
no because
we can always drive relevant information automatically
we can assume the top N utterances on the first-pass retrieval results are relevant
oh actually sold around
and this is referred to as a solo reverence
and here you see scenario
when the user and required
and the system gives the first pass retrieval result
and this result would not be shown to the user
but instead we just assume the from the top and utterances
are relevant
and all the rest are compared with these top N
and see whether they are similar or not and based on similarity
and based on the similarity
we rented results
and only this rear end results are shown
and
now we need to
all
okay here is the results
you can see that with this pseudo relevance feedback
for different acoustic models
and we really have
slightly better improvements here
now what else can we do
where we can further improve
the above pseudo-relevance feedback approaches
for example we can use graph based approach
remember
above that when we
E in these to the right of feedback approach we assume the top N utterances are taken as the reference
of we assume they are relevant
but in this way of course they are not reliable
so why don't we simply as
considered for the first pass retrieval results probably using the graph
in other words
we can construct a graph for all utterances in D first pass
retrieval results
and all the utterances are taken as a represent a signal
and then
the edge weights are actually the acoustic similarities between a
now we may assume that you utterance is strongly connected to
utterances with high scores
or very similar to utterances with high schools should have high school
for example if here X two X three have high school then X one should
similarly if X two S three all have
have lost all the N X one channel school
in that case discourse can propagate on the right
and then spruce among strongly content notes
in this we all the scores can be
corrected
so we can then reranked forty utterances in the first-pass which you retrieval results using this
we use these correct a score
and here is the results
and again for three sets of acoustic models
and you may notice that now graph based approach
get provide higher
and may result in or
this is a reasonable because this basic where a graph based approach really considered global globally or the first pass
retrieval results rather than reference on top and utterances
okay what else can we do
well we should and of course
machine learning has been used and shown useful in some work
so then sure one example of use of support vector machine
in the scenario of
we just mentioned this to the right feedback
and here is this scenario again when the user entered query Q
here
and this is the first pass retrieval results
this is not shown to the user but instead we simply take the first pass retrieval results we consider that
the top ten
utterance are assumed to be relevant and taken as possible examples
the bottom and i'll soon to irrelevant and taking as negative examples
and then we simply expressed some feature vectors from them
and then
we try to
train a
support vector machine
now for the rest of
utterances
we simply
okay we
X ray D feature parameters and then
we drank by just a
support vector machine and then only to rewrite the results are shown to the user
so in this case
please note that we need to train an svm for every query online
is it possible yes because we only have limit number of
a training data so they can be verified
no
the first thing we need to do is we need to define how to extract a feature parameters to beach
used in training svm
where again we can use the we just mention a hypothesis region
and suppose these same utterance and this is the corresponding lattice and here is a query and so these occur
i also read
we can divide this region into action states
this action state
and those feature vectors in once they can be averaged into one vector and then these vectors what different states
can be concatenated into H supervector and there's a feature vector for this
okay
in that way what's results
again we can
see the results for different
oh for different markers models
you may notice that now that's
svm is much better than reference
which is much better
the previous result
all
okay and of course i have to mention that all results report here are very preliminary the are just obtained
in preliminary results experiments
now what else can we do
well
all the above discussions are primarily considering acoustic models include features
ha linguist information
yes
for example
of the most straightforward information from linguists we can use is a context dependency
a context consistency
in other words the same term usually have very similar con
while quite different context usually implies that rams are quite different
so
what can we do we can do exactly same as we did
using svm
except now the feature vectors represent context information
so we use exact the same
scenario that the for the first pass retrieval results you we use top and bottom N to train his yeah
except now we use different features
vectors here
suppose these and are trying
and the corresponding lattice
and here are the
query here and we can construct a left context
vector
who is the man dimensionality is the lexicon size
yeah only those words appear to D left context
have their posterior probabilities as the score
or the other words has zero they are
similarly we can have a right context vector and the whole segment complex
and then we can come cut them together into a feature vector
and this has dimensionality of three times detection sides
three times of the like
now we can use this to do the experiments and here are the results
again for three sets of codes models
and you may notice the context information really helps
so what else can we do
where certain concept match
in other words we wish to
match the concept rather than lead to
in other words
when we are we should just system can return utterances or documents
semantically related to do carry but not necessarily include the card
for example if the query is white house of your nice they
and if the utterance includes present a bottom-up but not whitehouse not us
we should it can be returned as well
where they are many approaches have been proposed for this direction
for example we can close to D documents into sets
so we know
which sets of documents are talking about the same concept
we can use web data to expand the query or expanded documents
we can also you using a
a latent topic model
it or should we just one example of latent topic approach
where this is very straightforward and we just use a very popular where are used probabilistic latent semantic analysis or
pos
and in which we simply assume a set of latent topics
pitching is set up to
and a set of documents
and the relationship can be modelled by properties models
trained with em algorithm
of course there are many so many other approaches
and i are complementary
and here is an example work we did
for the us for a car
and we transformed into lattices
then for any given where we simply use the
plsa model we just mention based on the latent topics to estimate the distance between the primary and the lattices
and that gives the results
here are some X a preliminary results these results are in terms of recall or precision curve
and this three lower curves
are the baseline of later matching simply matching words
and the lowest one is on one best results and to chew up a one
two upper ones are based on a lattice
now yeah
the three curves here are
concept matching using the plsa i just mentioned
as you can see
a concept matching certainly this much
so what else can we do
where are you seconds content interactions that are not important
all
we know that user content
the interaction is important even for text content
in other words when we retrieve text content very open we also need a few iterations to get the desired
information
now for spoken content is much more difficult because i spoken content on not easily summarised on screen
they are just signals
so it's difficult for the user to browse
to scan and to select that
so when this isn't gives a whole bunch of
retrieval results we cannot listen to everyone old and then decide which one
we like
so that's a problem
what we propose is first we can
try to a select
automatically he turns and construct titles summaries
to help browse
and then we try to do some semantic structuring
to have a user interface
and then we can try to have some dialogue
to help the interaction between the user and the system
so been a very briefly go through some of
for example cute and extraction
which is very helpful in labelling the retrieval results and for user to browse
the key trends include two types at least keywords and key phrases
keyphrase include several keywords several words together
so for key phrase we need to detect the boundary
and there are many approaches to do this use one example suppose you them up model is a key for
one where it is defined that had it is always followed by the same word markup
mark over is always followed by the same word a get a model is always followed by the same word
model
however the model
is followed by many different words
and that means these at the boundary of the frames
in this way we know
there are a number it can be detected by context
statistics
now with the chi turn candidate
i there is a word or phrase
then we constrain many features
to identify with the object you transform that
for example prosodic features
because very often the key terms are produced with longer duration wider pitch range and high energy
we can also use
semantic features for example from hearsay because key trends are usually focused on smaller number of top
for example this is a distribution of topic probabilities obtained from plsa given a cat
now this one looked like she turned because it's focus on only smaller number of
topics D horizontal axis of topics
and this one doesn't like acute right because you need for me used in many different futures and in many
different
of course lex and feature a very important that includes term frequency and inverse document frequency of part of speech
tag
and so on
here is the result of weak of a attraction she turns
using different sets of speech
here yeah
the
prosodic lexical and semantic features here and you notice that each
it just single set of features are useful
however when we integrate them together we get the highest result
now for summarization where a lot of people in this room and doing summarization so i'll just
of course with a very quickly
the suppose these see that this is a document includes many but
and we try to recognise them into
words every circle is a word i the recognized correctly or incorrect incorrectly
what we do is try to select a small number of utterances
which are most representative
and avoid we done
and they are used to form a summary and this the
so called extractive summarization we can even replace these utterance with the original voice
so there is no correcting errors in the result
and i just show one example here
because we are selecting the most representative same utterances
so it is reasonable to consider that the utterances topic or is similar to the represented representative utterances should also
be considered as represented
so we can do similar
on the graph based analysis in other words every utterance represent as they
note on the graph
and then we let the scores for representatives yes
propagate undergrad
in this we compare get better scores and select better utterances
these are some results and skipped
title generation
titles are very open useful for if we
construct titles for retrieve the document second
is useful to for the browsing and selection of utterances
but i don't have to be very short but readable
and tell you what it is
here's one approach
we perform viterbi that was over the summer
based on the scores obtained by stream model
to select the to try to order to test
and to decide
lance of the tide
in this way we can have some good titles
semantic structuring there can be different ways to do semantic structuring and we don't know what's
good approach here just use one example
and we can cross to retrieve the results into some kind of
a tree structure based on the
a semantic information for example they can tell
in this way
every cluster can be labelled by except she turns
so that such intense indicating what they are talking about
and
every cluster can be for the next
tainted into the next layer and so on
here is another example
or
teaching
in other words every we retrieve the spoken document or segment can be labelled by step two turns
and then the relationship between the two terms
can be construct
represent as a graph
so we know what kind of
information about
okay now finally the kind of
if we have all this including semantic structuring forty turns summaries here
on the system
and the user is here getting providing some choir so what can we do to offer them to
have a better interaction
a dialogue may be possible
and many people here in this room are very experienced in doing something spoken time so we wish to learn
something from
for example we may model this process as a markov decision process or M
in this way what we can do is to
for example we need to define some goals
the goal is maybe higher text
success rate
except here the success indicates
the user information need is satisfied
we can also define a go to be
small number of dialogue turns back here
is small number of query terms entered
in this way we can define the reward function or something similar and then maximise the reward function with similar
uses
and here is one example application scenario for retrieving broadcast news
and here in every step when user and require every decision tree trends not only the retrieved results but also
a list of key trends what user to select
if the user is not satisfied with the results here then cheating
looks through that
Q chandler's from the top and select the first relevant to disney
and this
she turned miss can be ranked
i M P
and you're some results i'm escape
so above i have mentioned something about she turned up a summary title and is the menu structure and dialogue
so that's the something about user content interaction of course the a lot more work needed before we can do
something really
okay now let me
have a few minutes socially them
and this is a
course lecture
oh okay then you go through corporate
the slides first
this is
on a coarse black
and as we know there are many course lectures available over the internet
however it takes a very long for user to learn to listen to a complete course for example forty five
minutes
and therefore is not easy for engineers or in those readers to learn new knowledge the other course lectures
and we should
so we also understand they are lecture browsers available over the internet
however we have to bear in mind that
the knowledge of course lectures are usually structured one concept follows
so the retrieval vector segment
possible it being very not easy to understand what are then without enough background knowledge
and also given the retrieve the segment
there is no information for the gonna regarding what should be the now
so the proposed approach is to try to structure of course lectures parts line spanky turns
we derive the course lectures by slide
and drive the core or content what slides
i T turns and then construct she can grow
represent semantic relationship among the slides
and also all slides are given its lens
timing information in the course
summary key trends
and relay to transcend relay slides based on
the kitchen
or retrieve the spoken segments include all the information for the slides if you want to
and this is a system for a course on digital speech processing over by myself in taiwan university so therefore
is
given in mandarin chinese
however or determine on edges are produced directly in english so this is a call makes the data
okay so now let me go to
this
and this is the course
and the system it could be given name of and you virtual instructor
and i'm
it was recording your two thousand six the total and forty five hours
now suppose i heard something in a lecture about backward catwalks i don't know what that is
so i tried to retrieve it
however because i don't know what
so i mean i guess it is like work out with some
so it just enter like workout wasn't and then do the search
here i'm searching through the
internet and on the server on the server entire university so i rely on the internet here
and we see that
here i'm retreating the voice rather than the words
so the query words is bright work out some which are totally wrong
but here what we treat a total of fifty six results in the course
and here for example in this result the first one is utterance of a second long
it is in the slides
is a slight number twelve of chapter four that i told that finds this basic problem three for hmm
and here is the
and the slide is labeled by these key terms but what that looks a now i know this is about
but that was rather than that helps and also baumwelch or for what else and so on
and note that because this
utterances are represented in terms of lattices of subword units
so the supper unit sequence of this paper that works and is very similar to this one
and that's why i can retrieve
and there are many up and so on that goes with that
now if i think i like to listen to this
so i can click here to go to that slide
this line number two of chapter four
and side here that i don't is this that the this is done by myself so it is a human
generate a type i don't need all the magenta title
because every site has a title and the title is basic problems three for a channel
and hearsay is this lies has a total length of twenty too many and fifty seven seconds so you buy
like to listen to this i need to have twenty two minute
and in addition this is the spend all these sites
in chapter four out of the twenty two slides total
and so
and very important here is the key terms
only those terms on the top in yellow are the key terms used in this line
and those below here are real at times provided by the kitchen right
in other words it you demographics are completely not easy to show here so instead i just list the highly
related key terms here below every cute right
so for example when i go through here i saw here is at train quarterback right now "'cause" i'm not
support for L
and background wasn't actually relate to for education and so on
now if i don't understand this one so i like to know a little more if i understand this can
i listened to dislike
so i click here that give me that
this cute ran off
backward algorithm actually first appeared in on the slides
which appears earlier
and now you and the slides which is later so probably there's no experimentation up with this
and you really don't know about backward algorithm you should go there so i can cut this one and then
i go to
that's lights
where it's like okay yeah that's the other slides
and that's is the first time that but what else was measure
and that's lights
and that's that helps them here that you change in the slides for example in that slide we also really
have but what else
and for help us
the four at some he's actually relate to that alex
that ellen ready to this
for that and so on
now let me show a second example suppose i like to enter another query which is frequency
no i do the search
not in this course there are a total of sixty four results for the frequencies
here the first
utterance
of six second long appears in this
slight of build a batch processing
labeled by this
he turns
and the second
is
on
pre-emphasis
and so on if i'm interest in this one i can press here to go to this
slots
this is the slides on pre-emphasis
and i notice there is summary of beeping second
so i like to listen to this summer
so okay retrieving the summary from the high in so you got been able when you wear usual
you can change
oh no but i'll call again a number of them are element that you just sensitive to the tangent angle
that city and go under the try not to let your should formative initial sort of take a pre-emphasis factor
cocaine not out there
a given that you could solve
okay this is the fifteen seconds summary it's in mandarin chinese i'm sorry but i tried the english a subtitle
is actually done you know
manually in order to show you what was that in that summary
and okay so this end of the demo
so let me come back to the
powerpoint
so in conclusion
i usually divide the
spoken language processing over internet into three parts
user interface
content analysis including such as keychain extraction or summarization or
so on and the user content interaction
and we notice that user interface has been very successful however not very easy usually because you which users usually
expected technology to be placed human the
for content analysis and use a complex interaction which are not easy i seven however because the technology can handle
massive quantities of content
what a human being can not
so the technology does have some i think
now the spoken content retrieval is the one which integrate user interface with content
analysis and use a condom interaction therefore maybe offer some interesting applications in the future
so eventually i like to say that i think this is only
this area is only in its infancy stage
and there's to you plenty of space to
developed and plenty of opportunities to be investigated in the future
and i notice that many groups
i have been to some doing some work in this area and actual many people in this room have done
some work in this area
so i we should we can have more discussions and more work in the future
and hopefully we can have something better
much better than what we have today in the future justice
the in speech recognition we are now having much better
work then several years ago so we wish we can do something more
okay this concludes my presentation thank you very much for your attention
one child thank you very much for a very interesting to watch
one question i have for you is that a lot of people in the audience or working on a related
somewhat related problem which is voice search and one of the issues that come up and voice search sometimes you
say something it gets released speech recognition one and then you repeat the query to get it right all the
other sets of choice is it may come up with maybe sort of similar an overlapping
oh it would seem to me that has some relation to relevance feedback in the sense that the user was
sort of giving an additional set of information about the a priori previous query that was dictated i'm just wondering
if
you when you don't the people you work with the therefore looked into this
sort of problem whether you're you have any opinions on whether you could get improvements by looking at the somehow
taking the union of multiple queries
in a voice search sort of tasks to join improve results using similar results are similar methods to what we
talked about the in your talk
thank you very much i think certain is very good idea and
arg in actually are as i mentioned in the beginning in this part we
we are not talking about a voice search but in your experience for example when they're bp do queries may
provide
some good information or correlation about the intent of the user so that they are helpful for example in my
what i mentioned D dialogues we actually allow the user to enter the second query and so on and that
key
actually the interaction or the correlation between the first and second choirs and so i think that's the yeah the
only thing we have done
for up to this moment but i think probably what you what you say it implies much more we can
do and we are a i think we
as i mentioned we just have to
too much work
to be done and so we can think about how to implement what you think about in the future
thank you very much but that's only for very interesting talk i have a detailed technical question on your svm
slight
also
if you can go
oh yeah you for
for svm that you take a positive examples in there okay slight yes so my question is when you train
your svm you it seems like you're only taking a high count is examples for a stationary example
so in the margin you're pulling my you're pulling the examples from where is far from the margin
and in the testing phase if you have some difficult examples that were close to the hyperplane then you my
probably
where part time
three
yeah certainly you're right
but well that's all we can do at this moment because you know nothing about these results right you just
have the first pass results here and we can or we can do is that you assume the top and
bottom
and then construct the svm of course in the in the middle close to the andre it's a problem
however svm already provides some solution for them or the large margin concept
so they tried to improvise some
somehow almost margin and also there are some
allowance right so
what we just try to follow the idea from svm and try to that
to see if we can do something that
okay thank you
i have a question sure
making a great talk there's a lot of the parallels with the
methods to go this is really a is telling what michael say
with a text based web search with this problem
you know even beyond voice search
some methods i think the other have been developed by this community that would probably benefit the web search committee
and vice versa i wanted to
ask you to comment on that it and the your awareness of the literature there and opportunities for this
cross fertilisation there's in web search the query writing
is well established and
also and the click feedback web search without a lot of benefits from distinguish between clicks from users
and good click
because clicks a ten to be noisy and so there's been a lot of work in the web search committee
about modelling clicks and you know determine good clicks and use those you know with those more heavily for feedback
also just basic things like editorial relevance feedback about you know bunch of the you know large groups of users
to you know
determine the relevance and use that as we that's what so i just one of this that have asked you
what your thoughts are on the
opportunities for cross fertilisation between these two areas
yeah sure there are a lot of possibility to learn from the experience in voice search to do on this
part of where
we just don't have enough time to explore the possibilities as you mentioned the this kate may be divided into
several categories and they can be learned or something like that
and i think there should be a what could be done in the future
but on the other hand we also learn a lot in from other areas such as text retrieval
and then for example yeah
rather be back or pseudorandom feedback or renting or learning to rank or something much of some ideas laurel from
there
community so certainly
cost area
interact is very helpful
because these areas actually
interdisciplinary
on the other hand we really try to
do something we are more from india in speech area for example acoustic model
for example the acoustic features
and so on and we try to for example spoken dialogue
and we try to follow all those good ideas and good experiences in speech and see that can be used
in comparison
and i think we just have as i mentioned we data plenty of space to be explored in future
thank you much