that's right tree full column and weakness migrated ones are introduced
we use word from a distance from time spectrum modeling one recognition
and she's also of interest can
that's what they should trust
a huge
you see over you know trying to you is rover bachelor's and master's
operations research and industrial engineering
no you can do not one which passes spoken by what sort of quite a
long time and the
i'm happy to be able to introduce are also your colleague of solution is to
open laboratories with really and to mention risk
so much closer to speak about interpreting spoken referring expressions empirical studies
right
and have thank you
good morning
and things for having here
i will be don't know how down there
challenges that ice for interpreting spoken referring expressions in physical setting
i will be grabbing the presentation in my own icsi the system but they don't
and yesterday to where some challenges mentioned already so why are we all of the
end of some of my
so
this is the three
well above the dream in nineteen sixty two
and the for those of you more for the jetsons
and the dream was okay may example there
there we have to be these days
he's actually better than the green
actually because the woman in presence of
and i don't know if adding more actually achieve the conversational capabilities that we want
to but i
if move
like every are it will be achieved
so
one of the channel is
so and that's a little
and i do anything but for their share that computers the robot or something think
be reasoned say that on the code rate of the but they have like resampling
and the message result may still day
it because if you are in there is a reasonable in there is anything k
engine their appropriate for us
and what exactly trust probably just
you know when to what we need and you know not
in each okay
so you have different one interaction is in it that
so how this is a fixed
that i got challenges of first of all evaluation
we might be able to provide policies and sorted they actually
we thank you challenge
i read a novel
we don't trust
in addition from a game theoretic point of view
these are i
five favourite challenge is
in addition of questions yesterday
so all we need to be able to deal perceptual complexity
and i will illustrate shortly the to these challenges
we need to be able to be with linguistic phenomena such as signal addressee and
you would be
but it's not gonna see it is not just asr error
but also position error or
several papers yesterday discuss the thai patient
and finally we need to integrate directly probably the i-th knowing about something may help
you figure
i
so noticeable for perceptual complexity
so well i
so
i see
by the way this is that one and one female prime minister
we have ueller
from the by
handy
that is the difference in the training right flowers and the right
but the lexical when you talk about three vowels is actually more security
so that has to be that we that's where
what are talking about i
we can talk about a large a the small a
there are a factor in smaller than this more bass
so sizes because either in context
no
in addition we gave topological relation which are spatial relations
well carolina
so in this example the oranges
and the ball
and or infeasible
no even day
okay on the left the position
the one
or just one
no okay
in the
in the okay
the orange the scale in the bowl
but in the okay
on this i
the orders is null
thank
a
you want to say the origins in the old even though it's not that well
and the explanation the psychological explanation is that is related to one for
if you move the ball we wouldn't the order
but you know it humidity calculation the audience is not in the water
on
so in this wow
i is very clear global or and here the plan on the wall but
horizontally on the war ok
a picture
now we have also a project each relation
which a particular direction from a landmark
so we have a dc you're still far from being too
and the last but back to the right of the day
we try to see you also directly
so it's another
i tend to congregate
okay that can referring expressions
no from point of view of linguistic phenomena
we have enough data c
i mean i
well they want a thread and the reward is more
it was to do it sort of teen
we have on you know anybody will be in a
additional with
in the to the problem that prepositional phrases
so we have
the
a few e
because we don't know if the back to the lack of the side of you
know the plan the lamb
but not as shown in our case we have
which more
i do you get it will
even if you identify all the possible and you need at the end of the
day it doesn't matter because there is only one flower however
this is not the case in this example
well
in the case
it's the table that's near the lack what is your the flat or near that
this is to be
and yes people do that
asr error or out-of-vocabulary words
so all of these are
someone manufactured example
it is not entirely vol all the flower on the table
it is that
that would be maxent
you can
something that we on the table and this happens when people who are usually and
the main
one worked out of it can even make one or are often and all before
the user can be added there is a get because a status but no will
not come up before right
but this is just to illustrate the sort of affection from
at this time ever saw can result in our vocabulary word
and of course again if fusion errors
the
make the situation even
so what we want to do
we have no framework for spoken language understanding in this phenomena
hey
this is the store in we aim to handle the picture will or
g is the average since upon this is due to the left of the table
then we have that are also we have side scott are an example of what
little
and then it very precise description prepositional phrase
so what we want to talk about
and a few slides and one of about this interpretation process each of you know
and then i believe that our approach
then we describe
the results were right now response generation can have a chart
so this is the set of problems small
you to anybody of the speech recognizer
then some syntactic analyses in
then you may going to show my or my
so
the speech way speech recognizers such as we will now
in my o of such errors
these ones
you can always speech recognizers are really bad mode
it
after the syntactic and i is the
but also lengthening and live apart
to produce
but
and then you one semantics and but i
so if you do we in two stages of semantic interpretation for the robot
what i e
again every that about on the table again
doors the mappings are here the relation my and or
and that's prepended is wider rc
and we have label a cop not they're not in the table shows for this
particular scene
there are not be
i didn't you all table one
so this is an interpretation that is grounded e how we have
so what if we
so
the first we consider this model that i just described
well
okay
like the standard role in
we found was insufficient
so we will consider alternate interpretation
why everyone provide a system for five in a just one used to be the
base
so the little amount stage process where stages of my has not the patient
the addressee we don't want to start local maxima might not be what appears to
be a based interface
so we have a stochastic optimization process where we provide security different stages
okay we want to right
the different interpretations so we need somebody ways to make their problem
at me about being used only the recognition is speakers the
so this is illustrated our approach
the first thing we do you and you like this waterfall roles what we call
we so we have some of the presentation
and then we
products i we i
we don't they should also try
we different stages probabilistically in we can continue and you see
it's not null and of my there
that is one and one
so i don't completion officer and i
we assert that looks like
now we one o is estimated probably these
all their relations
and
may just apply bayes rule
sure if you basically with a given set my impression that this implies that all
day
no context can be anything i story
and i don't history i mean at the moment is the rule more data
and
we need like to ask for my i don't know so i want to make
more complicated but
imagine that are
think that problem is formulated from i know
so all then it is worth this problem
the first one directly from the speech recognizer scores we use probabilities lose your number
between zero and one
parser generates parsers are real users probably e
here
we favour or simple interpretation sell the urinal the better
and
this is the more there are what we get the problem
so let's illustrate this so what we have this argument of j o
this is a crime and what we want you know that
is how well each of the prime i really am i and my
the corresponding to my
so in the first one
we have a problem
that it
you will designate got three by that are not by the colour blue
then it is
well that's that relation location or could be designated by
the provisional
and whether or not goal table one
that
in addition
one who assigned a probably be i mean on the well
wow so we can see the models can you on the world and everybody these
buttons them on kind of work
over the table to be than the problem is
shell
i just a continuation of the problem but you make some simplifying assumption
so
the remote will eat corpus to the user and able to refer to
it does and of are more or fess okay why this or something that all
and it really ambitious
and he thought would have a robot and the mobile
be able to walk around the room and both
and we won one whole the role of all you see a actions that the
we you of the time i want to get a better
so that's why we make this assumption
in addition each object is
in a more label
and then his sound
the next life and deletions will assume that each object region
so it may be circumscribed by a block each object is a single and
but we have another and that's no way to from the speakers in y because
if an object is able to it
the problem the speaker is referred to we explore the and or not
so we calculate is probably e
so this is all technology channel we got a doing the learning
will improve
so you
the lexicon new data was calculated using wordnet similarity function
that are similar to what is calculated using a particular function you one i
and
exactly about ten percent you system or changing current system origins in
similar
so long as you probably you know what it was reported e
how similar to you
the
but we are
in this i
we probably you got me
dean you know
and this was only by comparing the exercise for the bottom row
we
this is all
a be consider the
and if you're curious we used at a constant
so
we have a topological relations
so the most interest while he's
where we have a function that what the is nice
represent we should for large
i hope to continue for another way that's order to be in near each other
so we have right
i'm not sure
that is done anything that they lack the thing like that and between the flower
the baseline
but what i say that these were in here
these two are not
so our function reflects this intuition
and finally relations between your sentence frame of reference
which means that
you know there may be also
we adopted it will be adopted the point of view that we are able he
where interview speaker
so this is the plan that means the right okay or speak
so
these where
this is a short overview of what i
so what can i don't think so far what we know
so this is the case where we have audience participation
so i'll
therefore it play a little the microwave
which one
the
you can sample
the time course
need only my yes can second guess we here
but none of the missile
okay about the case
but
the one okay again
i mean in do you have three factors system
that is
now i really
the label y is what we are some participants describe
in this every the screen so what the intended it is actually one it is
easy well i
okay
i want to find humour
so well this is
so the okay a
this project is a few years all
so i
our speech recognizer was really giving us a lot of all
we were using the microsoft the u i it before deep learning
so what we decided we have some e
about it and e so all error correction for the speech recognizer
so what we need
each we had some steps
it is more like of course incorporated into are lower
so we had to record speech recognition errors one but i think error correction
it was a preprocessing step and robot error correction the possible across the things
and yes
now that you have been speech recognizer the impact of this it is floor
but especially what
marian discussed yesterday maybe kind of thing hand
so that the semantic error correction
in this was like every year
we propose gently words ripley's or words that have expect i'm expect the boxes
so you are described in all you get the bar in
that can expect
so use a generic were replayed
however more than we replace the
all of the problem you the new word i in a remote location so probably
be a really planet
the probability of those on a five of the problem you do not ever so
we don't around just replacing work we don't lie you have to read to make
a replace
so this is the right for example here
this is really a
we will light on the back wall
then we guess what the person actually
but
that's what they meant
but that's what we're to build played the bus stop right interpretation
so well
if we
me
i five times in the end of that side of their own set
so all
we replace you that i don't think that this is really okay
but you only have a few scenes on the cable
it's better
then
okay so no
this is what we start right now we have all these i
in america okay i and say
from one can i
which one that models like late
but only from this guy gonna different places
so no okay
it's play invented for their instead of everything that
so
i
so that's what we've done
and because one of my favourite sergeant's and she's performance me
so first describe the corpus
twenty six point six r d c back
a native english speakers counter and it is but i will resonate adopted for images
in we had a hundred and forty one descriptions
no this is the asr performance
and you would be split into a similar experiment we will a
so you see they difference in what we head
there but it hears signal
and we will now
so we're the word error rate all thirty percent okay
in mind that this is an older version of the microsoft speech api
and the only fourteen percent for the asr interpretations of the top around one for
all right
what is what will now where the rate of the top ranked interpretation thirteen and
a
but
still a real
so the resulting images that we shall i participants
and some location for designed for example in this one
each requires that all here it should i don't know that have anything it is
there
so we believe it uses and parts of speech
in this work but we have seen as
so okay
we got the image and call it and we want and
car
as well as positions
this one particular
because they can use color size is it or
basically you before loading a project you've relations
and then just like real is i
what
where they had to describe the
so no
just some characterization of what people the
in terms of known it
there you know that were somewhere out of vocabulary
so not just speech recognition error but words like that you words like model with
the
and they're gonna do not and then you will see
we may
is there
we distinguish two types of one
why are descriptions
max at least one interpretation in every respect
any perfect descriptions means max k
so for a in prior description they come from multiple interpret it
so these tasks i for our core well about three or four or eight
and then apply that wordperfect in there was only one possible right side and that
makes sense
then sixty percent without which means that we're several reference mask perfectly
and then we had to kind of thing accuracy
and where only one object matches ending perfect remote one will do not depend
no performance matrix
again i'm going back to the ideal result how we wanna make explore the interpretation
he's reasonable so yes but gold standard annotation
by we my
a perfect match
like
contrary to what
this is a popular nowadays the screen
not address yesterday you say okay i is all words in the list x and
y
sorry the object but the wall
i don't care much percent of the request just retrieve the roles e
so a perfect match not present such as
it's a severe heart because at the end of the day
you want all you know
if you wanted and role
so little or no but anyways for everything you want to understand perfect what
well
in addition
we want to know if we probably their projects like you will see what problem
if you use a live recording okay
right of the roll can be a really no particular range
she'll
the roundness constantly as one unit profile of our systems that well
so what you
we have the right
a two
and
and we have the probably the deceased in this kind of the this at the
top right of the replace your
matches
the user's intention so this would be
all day however the bottom right meaning it's wrong
so it in this killer graph
they refer the reader is referred by the system
if at all
and then we have a second one is the green one and then you have
more probable one
which one
is small
so for this for probable one
you mean one and everybody three quarters of the brown
not give a great
so all our main breaks
are three core which is actually recall
where we is not always fractional round balls location
to do it would probably interpretations
and in c g which was defined by automating can get i don't
a in the
why does what side of the fraction that are reward
you'd also or a discount lower right
it right stand recognition does not have lower right but dct a
the normalization component that i
you divide whatever this is thinking about we'd like here
by this score of an option
where you're based on the beam was not the goal i think the situation where
you are more advanced up right one
so
you by like the score of the option and then you
so how do
and we did okay that's the short version but i
syllable is not actually
it's not like that or
that in our money left labelled c is
that's
better than that will allow okay
if we use their predictions that's not very interesting about that for all i k
so we might better now there is a reasonable that we have more than three
and e c g is not into one
by one or two but with a prayer
but this surprising is a use rc replacement but in a war
that
but it would be why the problem replacement pretty or does not
that's certainly not second guessing
so that a surprise
okay
let's go on to response generation
this is more control
a popular problem yees select part in particular that features such as a as a
side so that okay
for the current approach is used on the fact
there is only one acceptable
but the main more than one
maybe we will and stuff
so the goal of this last part of the result was first of all learn
what context of a response to
the weather instead we rely different schools this
and whether we
distinguish between what did you in but like our two
i think we all on the reason like a microwave
but you want your what you agree in that my there but we
not sure maybe you want you're able to be more sources than you
so the design of y
we compare the refer to convert a relations in two ways
so
you just added over from
we assume the ones that are based on the i it
we have all been we want you did they are able but that's the robot
can find at the end of that
we consider for response i
which means just what to do so on
a tool which means a
it is eager wire between v two or three k entries phrase by phrase level
of a whole
don't be a different way
and we can see what we have conducted one experiment anywhere in the process of
combat and the second experiment
so far in the first experiment we got artist incorrect responses
a silence of what's
so well i guess i want to solve a
because there are the asr
we one relay
well of the asr be
people can guess really where would you are
and i known
we train the classifier to produce acceptable responses
and okay you use a score
you're the first experiment using a
so all we
thirty five participants some of which were still from the one experiment
describe the same okay
we got
that and seventy five descriptions in to draw a little right
so you see when it is likely by not nsu and well
asr performance is all the previous slide
word error rate was only thirteen percent and by
jointly of the requested object at least are also asr errors in indulging section driver
the landmark search
so you have something that will enhance the back the
the correct ones and also interesting
and you can guess can you guess what people say
yes
and
like
larger
okay then we got it
a simple or false
where p c where a
how this all for a so i
for someone else's lazily or max a
based solely on l two
the dialogue policy and the results
and for this experiment with four participants again
both with
so this is still in the participants were show
and or something but not all the objects on
and that was all again mentioned about five
for us
and
you can see that they're talking about
yes
yes
it
in this and then that would be used in
four options to a value that is that for the purposes of this presentation participants
were not so that it is that there were a total of four intraframe
but what is it but it's a huge rooms one score
and then
for the first response is a number
from
so if you are going to fix
the request at all
in which all
so
we don't sell all and
we train some classifiers
we the trained and you are able just database and two side guy
it side
indeed it is not bad because there wasn't enough
so
influential features where they can see that the third problem efficiently
if you know that the performance you have one about nine percent of your updated
is okay
so the eventual users use percent of
wrong words in the asr how do we know words are we have a classifier
that
that's which works well
and you will be sold disease
not all right their predictions that this you are scored
so it would someday
i se
score all
locations already i meaning the task force between requires in all day
and in the u number of out-of-vocabulary words
so this is also
what we consider all the board but
to is dangerous here january english native new this
where
useful and recall and f-score of seventy four
so we were coming from
what the participants were common
this is
then
see here i all the data
and
we got
and the score of nine two
so that can be something here with the system so this is not fair
but i is the
or what you from his this is user rate and preferences are the big
so what is the main inside yes people based on the differently in fact
this is an extreme example because if you know more participants in this experiment in
the previous call also we had used very able to work on the exact same
so
again
any other
yes i placed on the right of the right
and
this is
what are participants
we saw what parts and say okay the ones that come from
one possible scores phrase
no you have the sack part
this is what the user was described
so
being courses is not about
okay so i o k v c r challenge is
is the bottom right
so we need to do
first of all we need to deal with real c
our case where there were constructed using all three tool
it sounds great but their sin
and eighty somewhere so at least i hereby are
i can be this work but we re scenes but that were causing some problem
got its own problems
because it can be very frustrating that kind of all
car is being
so that are and so that an
have a paper addresses some of the other problem
then that i
and i like one of the texture
she
that's
about okay so
frames of reference
there are lots of frames of reference speaker oriented here or the absolute
in c
but in the basic frame of reference in the fate
the front of your lips easily the front of your data doesn't matter course there
so
also you can be all frames of reference s b one seen that and incorporated
into interpretation
and context positional relation
the left of the front of the table doesn't something that somebody is
linguistic phenomena hold it is the white or by nicole all the weak lexical stimuli
yes what's a presentation about out of vocabulary words
and more work has to be done about inaccuracy in u e
perceptual i a busy
yes asr grammar scale a problem in something better problems in
v error or
i don't know in this is but you know that
is still not there's no
user adaptation
which all the different people to use right reference s
and this adaptation is to be
but what are trying to understand what people say
in this case and the way people the or there are so a sign a
nation
it also response generation
before this is why in different ways
some people prefer the system should be able just seeing something record
we need to integrate all i and
i the overall view all the interpretation rules to not
if you while seeing
we know how are preferred interpretation right context of other of c e
evaluation we need a system is reasonable
and
what i
because lack of trust
these you
we perform human evaluations yes we don't like a mass
and
we must do not based once the result here
so we need to be quite different interpretations are closing the need to you swatting
italians in different interpretations on can ask
appropriate questions
and is used in this i will tell when it does not know wow
in a just
you see response
so
that's about i'm thinking all the people
ever worked on this problem
and then you
with
i'm going to disappoint either
just looking around
there was no okay so you just look around
what meanwhile
but it is very minimal
we want it then all singing all bands and rowboat that
so we also there are
and we had these make them where we would match access to say for example
where
you can extra exam seldom in a
right of the ball or i might i heard correctly
so what but that one by the board because
reality check and we start the referring expressions are mainly
looking for things around a
i
okay a
the standard names of the
what you are and you would the
goal for
rock and category is if you one but we were very low just to name
and then one side of the wordnet for see now i might
but that was the idea of done
there's a turn
right one
i'm like i'm not in the kitchen
why don't like okay
and if i didn't or anything like a and by or not
like
there i think my house and i one and then use them
so yes i mean it's
you would contextual i
but
what if i want the flow and identically
exactly but i would be one i mean one of the sound while we are
appropriate
what we're not appropriate
so
where context and i mean exactly what
we will now that
however in this case it you are
model
on work like
i was actually haven't all possible problem i was saying is that star flower like
flower
or
our car phone or things like i a lot of normally i want to anything
other than flower
so there is
in my contextual i think that kind of like second guessing the person towards the
call
the commentary
i mentioned it is something that training with how much context relative scale
for i mean we can prove our
a slider direction problem h
hasn't been the used by lee
at the moment thing to get this unit but that are instantly
well at some point of that is
long or you have phone
and
i mean that we know why people thinking
only when they were not restrict that would just a
whatever why the point that about twenty percent of the time or
there are
so that we are going to be point
they tend to become more me now we can get that
but definitely i mean whatever right okay
that's
why didn't yourself
and that goes to the definition part in fact the there was a paper yesterday
but
an hour
the ones
kind of limited in the interpretation for by already spoken the colour
and then we using your also
is that
five around one
the that if you need for every
about that a but there are a
so i have to do this in the problem doesn't performance for
so that doesn't surprise me some point i mean
but maybe we should
the fees we are now whether we have several problems a minus right and i'll
and their be assigned to me
so how much for down within therefore it is
exactly
it's all could have an and
that there probably but when we saw those with a in can see that the
the main aim at ever or is in your in great deal in you don't
get much mileage out three
they
i think
you are looking at the fourth basically
it was somebody
okay like the first five better
because we try the that the dean at the beginning very ambitious constraints on the
object of the accent so we had the and
well we had a
actually or the actions for a particular case the what i think each other
and all that weighted by the board when we had every and six of the
i-th class
in some but once
yes definitely the four
one of ten
vol
and
and likewise if you have particular we are not sure whether they're the syllable or
goal
then
you will go back and constraint of our off
but as i said we had to know where r
and okay what the user is embedded in the very large one the
i don't know what to say to make a
what
well but the way we can design cation that we listened my only there
so estimate only relative the thing mean segmentation for
and it was incorrect hundred percent of the anybody problems with the problem and better
than that of
right
so the only thing a lot of there is if you live semantic role labeling
and you and that the thing that only or did
you really don't they can be more
this is what the you know
if there is still are a bit not like war
band
you know that c
if you
at some point get to know that you don't know
the things that the
well as the semantic in our case the semantic role labeling there was trained on
a referring expression with the various don't expect even when it's all of our paper
segment mostly in the right place but you have a
very briefly that saying it's and the expectations would be much better
i cannot
i denote better success there but for referring expressions was quite well
you mean just for the five or
well for the parse tree we got indicted from what they were trying
three
it wasn't from portals like to thank you but if one of them somebody sitting
or whatever
it is reached their maxima this work the lexical my
at all of the sixteen year and by
no like can go like
i plan
and that it is are then you get the pay to get like the score
of a second we don't like little recall
it's time for mapping but you get the very low score for that matter
that that's why we don't think that environment and that's why at home or two
two we review fire and
the slogan of efficiency
so
you know that a framework
okay let's call it could have a coffee breaks into