that's right tree full column and weakness migrated ones are introduced

we use word from a distance from time spectrum modeling one recognition

and she's also of interest can

that's what they should trust

a huge

you see over you know trying to you is rover bachelor's and master's

operations research and industrial engineering

no you can do not one which passes spoken by what sort of quite a

long time and the

i'm happy to be able to introduce are also your colleague of solution is to

open laboratories with really and to mention risk

so much closer to speak about interpreting spoken referring expressions empirical studies

right

and have thank you

good morning

and things for having here

i will be don't know how down there

challenges that ice for interpreting spoken referring expressions in physical setting

i will be grabbing the presentation in my own icsi the system but they don't

and yesterday to where some challenges mentioned already so why are we all of the

end of some of my

so

this is the three

well above the dream in nineteen sixty two

and the for those of you more for the jetsons

and the dream was okay may example there

there we have to be these days

he's actually better than the green

actually because the woman in presence of

and i don't know if adding more actually achieve the conversational capabilities that we want

to but i

if move

like every are it will be achieved

so

one of the channel is

so and that's a little

and i do anything but for their share that computers the robot or something think

be reasoned say that on the code rate of the but they have like resampling

and the message result may still day

it because if you are in there is a reasonable in there is anything k

engine their appropriate for us

and what exactly trust probably just

you know when to what we need and you know not

in each okay

so you have different one interaction is in it that

so how this is a fixed

that i got challenges of first of all evaluation

we might be able to provide policies and sorted they actually

we thank you challenge

i read a novel

we don't trust

in addition from a game theoretic point of view

these are i

five favourite challenge is

in addition of questions yesterday

so all we need to be able to deal perceptual complexity

and i will illustrate shortly the to these challenges

we need to be able to be with linguistic phenomena such as signal addressee and

you would be

but it's not gonna see it is not just asr error

but also position error or

several papers yesterday discuss the thai patient

and finally we need to integrate directly probably the i-th knowing about something may help

you figure

i

so noticeable for perceptual complexity

so well i

so

i see

by the way this is that one and one female prime minister

we have ueller

from the by

handy

that is the difference in the training right flowers and the right

but the lexical when you talk about three vowels is actually more security

so that has to be that we that's where

what are talking about i

we can talk about a large a the small a

there are a factor in smaller than this more bass

so sizes because either in context

no

in addition we gave topological relation which are spatial relations

well carolina

so in this example the oranges

and the ball

and or infeasible

no even day

okay on the left the position

the one

or just one

no okay

in the

in the okay

the orange the scale in the bowl

but in the okay

on this i

the orders is null

thank

a

you want to say the origins in the old even though it's not that well

and the explanation the psychological explanation is that is related to one for

if you move the ball we wouldn't the order

but you know it humidity calculation the audience is not in the water

on

so in this wow

i is very clear global or and here the plan on the wall but

horizontally on the war ok

a picture

now we have also a project each relation

which a particular direction from a landmark

so we have a dc you're still far from being too

and the last but back to the right of the day

we try to see you also directly

so it's another

i tend to congregate

okay that can referring expressions

no from point of view of linguistic phenomena

we have enough data c

i mean i

well they want a thread and the reward is more

it was to do it sort of teen

we have on you know anybody will be in a

additional with

in the to the problem that prepositional phrases

so we have

the

a few e

because we don't know if the back to the lack of the side of you

know the plan the lamb

but not as shown in our case we have

which more

i do you get it will

even if you identify all the possible and you need at the end of the

day it doesn't matter because there is only one flower however

this is not the case in this example

well

in the case

it's the table that's near the lack what is your the flat or near that

this is to be

and yes people do that

asr error or out-of-vocabulary words

so all of these are

someone manufactured example

it is not entirely vol all the flower on the table

it is that

that would be maxent

you can

something that we on the table and this happens when people who are usually and

the main

one worked out of it can even make one or are often and all before

the user can be added there is a get because a status but no will

not come up before right

but this is just to illustrate the sort of affection from

at this time ever saw can result in our vocabulary word

and of course again if fusion errors

the

make the situation even

so what we want to do

we have no framework for spoken language understanding in this phenomena

hey

this is the store in we aim to handle the picture will or

g is the average since upon this is due to the left of the table

then we have that are also we have side scott are an example of what

little

and then it very precise description prepositional phrase

so what we want to talk about

and a few slides and one of about this interpretation process each of you know

and then i believe that our approach

then we describe

the results were right now response generation can have a chart

so this is the set of problems small

you to anybody of the speech recognizer

then some syntactic analyses in

then you may going to show my or my

so

the speech way speech recognizers such as we will now

in my o of such errors

these ones

you can always speech recognizers are really bad mode

it

after the syntactic and i is the

but also lengthening and live apart

to produce

but

and then you one semantics and but i

so if you do we in two stages of semantic interpretation for the robot

what i e

again every that about on the table again

doors the mappings are here the relation my and or

and that's prepended is wider rc

and we have label a cop not they're not in the table shows for this

particular scene

there are not be

i didn't you all table one

so this is an interpretation that is grounded e how we have

so what if we

so

the first we consider this model that i just described

well

okay

like the standard role in

we found was insufficient

so we will consider alternate interpretation

why everyone provide a system for five in a just one used to be the

base

so the little amount stage process where stages of my has not the patient

the addressee we don't want to start local maxima might not be what appears to

be a based interface

so we have a stochastic optimization process where we provide security different stages

okay we want to right

the different interpretations so we need somebody ways to make their problem

at me about being used only the recognition is speakers the

so this is illustrated our approach

the first thing we do you and you like this waterfall roles what we call

we so we have some of the presentation

and then we

products i we i

we don't they should also try

we different stages probabilistically in we can continue and you see

it's not null and of my there

that is one and one

so i don't completion officer and i

we assert that looks like

now we one o is estimated probably these

all their relations

and

may just apply bayes rule

sure if you basically with a given set my impression that this implies that all

day

no context can be anything i story

and i don't history i mean at the moment is the rule more data

and

we need like to ask for my i don't know so i want to make

more complicated but

imagine that are

think that problem is formulated from i know

so all then it is worth this problem

the first one directly from the speech recognizer scores we use probabilities lose your number

between zero and one

parser generates parsers are real users probably e

here

we favour or simple interpretation sell the urinal the better

and

this is the more there are what we get the problem

so let's illustrate this so what we have this argument of j o

this is a crime and what we want you know that

is how well each of the prime i really am i and my

the corresponding to my

so in the first one

we have a problem

that it

you will designate got three by that are not by the colour blue

then it is

well that's that relation location or could be designated by

the provisional

and whether or not goal table one

that

in addition

one who assigned a probably be i mean on the well

wow so we can see the models can you on the world and everybody these

buttons them on kind of work

over the table to be than the problem is

shell

i just a continuation of the problem but you make some simplifying assumption

so

the remote will eat corpus to the user and able to refer to

it does and of are more or fess okay why this or something that all

and it really ambitious

and he thought would have a robot and the mobile

be able to walk around the room and both

and we won one whole the role of all you see a actions that the

we you of the time i want to get a better

so that's why we make this assumption

in addition each object is

in a more label

and then his sound

the next life and deletions will assume that each object region

so it may be circumscribed by a block each object is a single and

but we have another and that's no way to from the speakers in y because

if an object is able to it

the problem the speaker is referred to we explore the and or not

so we calculate is probably e

so this is all technology channel we got a doing the learning

will improve

so you

the lexicon new data was calculated using wordnet similarity function

that are similar to what is calculated using a particular function you one i

and

exactly about ten percent you system or changing current system origins in

similar

so long as you probably you know what it was reported e

how similar to you

the

but we are

in this i

we probably you got me

dean you know

and this was only by comparing the exercise for the bottom row

we

this is all

a be consider the

and if you're curious we used at a constant

so

we have a topological relations

so the most interest while he's

where we have a function that what the is nice

represent we should for large

i hope to continue for another way that's order to be in near each other

so we have right

i'm not sure

that is done anything that they lack the thing like that and between the flower

the baseline

but what i say that these were in here

these two are not

so our function reflects this intuition

and finally relations between your sentence frame of reference

which means that

you know there may be also

we adopted it will be adopted the point of view that we are able he

where interview speaker

so this is the plan that means the right okay or speak

so

these where

this is a short overview of what i

so what can i don't think so far what we know

so this is the case where we have audience participation

so i'll

therefore it play a little the microwave

which one

the

you can sample

the time course

need only my yes can second guess we here

but none of the missile

okay about the case

but

the one okay again

i mean in do you have three factors system

that is

now i really

the label y is what we are some participants describe

in this every the screen so what the intended it is actually one it is

easy well i

okay

i want to find humour

so well this is

so the okay a

this project is a few years all

so i

our speech recognizer was really giving us a lot of all

we were using the microsoft the u i it before deep learning

so what we decided we have some e

about it and e so all error correction for the speech recognizer

so what we need

each we had some steps

it is more like of course incorporated into are lower

so we had to record speech recognition errors one but i think error correction

it was a preprocessing step and robot error correction the possible across the things

and yes

now that you have been speech recognizer the impact of this it is floor

but especially what

marian discussed yesterday maybe kind of thing hand

so that the semantic error correction

in this was like every year

we propose gently words ripley's or words that have expect i'm expect the boxes

so you are described in all you get the bar in

that can expect

so use a generic were replayed

however more than we replace the

all of the problem you the new word i in a remote location so probably

be a really planet

the probability of those on a five of the problem you do not ever so

we don't around just replacing work we don't lie you have to read to make

a replace

so this is the right for example here

this is really a

we will light on the back wall

then we guess what the person actually

but

that's what they meant

but that's what we're to build played the bus stop right interpretation

so well

if we

me

i five times in the end of that side of their own set

so all

we replace you that i don't think that this is really okay

but you only have a few scenes on the cable

it's better

then

okay so no

this is what we start right now we have all these i

in america okay i and say

from one can i

which one that models like late

but only from this guy gonna different places

so no okay

it's play invented for their instead of everything that

so

i

so that's what we've done

and because one of my favourite sergeant's and she's performance me

so first describe the corpus

twenty six point six r d c back

a native english speakers counter and it is but i will resonate adopted for images

in we had a hundred and forty one descriptions

no this is the asr performance

and you would be split into a similar experiment we will a

so you see they difference in what we head

there but it hears signal

and we will now

so we're the word error rate all thirty percent okay

in mind that this is an older version of the microsoft speech api

and the only fourteen percent for the asr interpretations of the top around one for

all right

what is what will now where the rate of the top ranked interpretation thirteen and

a

but

still a real

so the resulting images that we shall i participants

and some location for designed for example in this one

each requires that all here it should i don't know that have anything it is

there

so we believe it uses and parts of speech

in this work but we have seen as

so okay

we got the image and call it and we want and

car

as well as positions

this one particular

because they can use color size is it or

basically you before loading a project you've relations

and then just like real is i

what

where they had to describe the

so no

just some characterization of what people the

in terms of known it

there you know that were somewhere out of vocabulary

so not just speech recognition error but words like that you words like model with

the

and they're gonna do not and then you will see

we may

is there

we distinguish two types of one

why are descriptions

max at least one interpretation in every respect

any perfect descriptions means max k

so for a in prior description they come from multiple interpret it

so these tasks i for our core well about three or four or eight

and then apply that wordperfect in there was only one possible right side and that

makes sense

then sixty percent without which means that we're several reference mask perfectly

and then we had to kind of thing accuracy

and where only one object matches ending perfect remote one will do not depend

no performance matrix

again i'm going back to the ideal result how we wanna make explore the interpretation

he's reasonable so yes but gold standard annotation

by we my

a perfect match

like

contrary to what

this is a popular nowadays the screen

not address yesterday you say okay i is all words in the list x and

y

sorry the object but the wall

i don't care much percent of the request just retrieve the roles e

so a perfect match not present such as

it's a severe heart because at the end of the day

you want all you know

if you wanted and role

so little or no but anyways for everything you want to understand perfect what

well

in addition

we want to know if we probably their projects like you will see what problem

if you use a live recording okay

right of the roll can be a really no particular range

she'll

the roundness constantly as one unit profile of our systems that well

so what you

we have the right

a two

and

and we have the probably the deceased in this kind of the this at the

top right of the replace your

matches

the user's intention so this would be

all day however the bottom right meaning it's wrong

so it in this killer graph

they refer the reader is referred by the system

if at all

and then we have a second one is the green one and then you have

more probable one

which one

is small

so for this for probable one

you mean one and everybody three quarters of the brown

not give a great

so all our main breaks

are three core which is actually recall

where we is not always fractional round balls location

to do it would probably interpretations

and in c g which was defined by automating can get i don't

a in the

why does what side of the fraction that are reward

you'd also or a discount lower right

it right stand recognition does not have lower right but dct a

the normalization component that i

you divide whatever this is thinking about we'd like here

by this score of an option

where you're based on the beam was not the goal i think the situation where

you are more advanced up right one

so

you by like the score of the option and then you

so how do

and we did okay that's the short version but i

syllable is not actually

it's not like that or

that in our money left labelled c is

that's

better than that will allow okay

if we use their predictions that's not very interesting about that for all i k

so we might better now there is a reasonable that we have more than three

and e c g is not into one

by one or two but with a prayer

but this surprising is a use rc replacement but in a war

that

but it would be why the problem replacement pretty or does not

that's certainly not second guessing

so that a surprise

okay

let's go on to response generation

this is more control

a popular problem yees select part in particular that features such as a as a

side so that okay

for the current approach is used on the fact

there is only one acceptable

but the main more than one

maybe we will and stuff

so the goal of this last part of the result was first of all learn

what context of a response to

the weather instead we rely different schools this

and whether we

distinguish between what did you in but like our two

i think we all on the reason like a microwave

but you want your what you agree in that my there but we

not sure maybe you want you're able to be more sources than you

so the design of y

we compare the refer to convert a relations in two ways

so

you just added over from

we assume the ones that are based on the i it

we have all been we want you did they are able but that's the robot

can find at the end of that

we consider for response i

which means just what to do so on

a tool which means a

it is eager wire between v two or three k entries phrase by phrase level

of a whole

don't be a different way

and we can see what we have conducted one experiment anywhere in the process of

combat and the second experiment

so far in the first experiment we got artist incorrect responses

a silence of what's

so well i guess i want to solve a

because there are the asr

we one relay

well of the asr be

people can guess really where would you are

and i known

we train the classifier to produce acceptable responses

and okay you use a score

you're the first experiment using a

so all we

thirty five participants some of which were still from the one experiment

describe the same okay

we got

that and seventy five descriptions in to draw a little right

so you see when it is likely by not nsu and well

asr performance is all the previous slide

word error rate was only thirteen percent and by

jointly of the requested object at least are also asr errors in indulging section driver

the landmark search

so you have something that will enhance the back the

the correct ones and also interesting

and you can guess can you guess what people say

yes

and

like

larger

okay then we got it

a simple or false

where p c where a

how this all for a so i

for someone else's lazily or max a

based solely on l two

the dialogue policy and the results

and for this experiment with four participants again

both with

so this is still in the participants were show

and or something but not all the objects on

and that was all again mentioned about five

for us

and

you can see that they're talking about

yes

yes

it

in this and then that would be used in

four options to a value that is that for the purposes of this presentation participants

were not so that it is that there were a total of four intraframe

but what is it but it's a huge rooms one score

and then

for the first response is a number

from

so if you are going to fix

the request at all

in which all

so

we don't sell all and

we train some classifiers

we the trained and you are able just database and two side guy

it side

indeed it is not bad because there wasn't enough

so

influential features where they can see that the third problem efficiently

if you know that the performance you have one about nine percent of your updated

is okay

so the eventual users use percent of

wrong words in the asr how do we know words are we have a classifier

that

that's which works well

and you will be sold disease

not all right their predictions that this you are scored

so it would someday

i se

score all

locations already i meaning the task force between requires in all day

and in the u number of out-of-vocabulary words

so this is also

what we consider all the board but

to is dangerous here january english native new this

where

useful and recall and f-score of seventy four

so we were coming from

what the participants were common

this is

then

see here i all the data

and

we got

and the score of nine two

so that can be something here with the system so this is not fair

but i is the

or what you from his this is user rate and preferences are the big

so what is the main inside yes people based on the differently in fact

this is an extreme example because if you know more participants in this experiment in

the previous call also we had used very able to work on the exact same

so

again

any other

yes i placed on the right of the right

and

this is

what are participants

we saw what parts and say okay the ones that come from

one possible scores phrase

no you have the sack part

this is what the user was described

so

being courses is not about

okay so i o k v c r challenge is

is the bottom right

so we need to do

first of all we need to deal with real c

our case where there were constructed using all three tool

it sounds great but their sin

and eighty somewhere so at least i hereby are

i can be this work but we re scenes but that were causing some problem

got its own problems

because it can be very frustrating that kind of all

car is being

so that are and so that an

have a paper addresses some of the other problem

then that i

and i like one of the texture

she

that's

about okay so

frames of reference

there are lots of frames of reference speaker oriented here or the absolute

in c

but in the basic frame of reference in the fate

the front of your lips easily the front of your data doesn't matter course there

so

also you can be all frames of reference s b one seen that and incorporated

into interpretation

and context positional relation

the left of the front of the table doesn't something that somebody is

linguistic phenomena hold it is the white or by nicole all the weak lexical stimuli

yes what's a presentation about out of vocabulary words

and more work has to be done about inaccuracy in u e

perceptual i a busy

yes asr grammar scale a problem in something better problems in

v error or

i don't know in this is but you know that

is still not there's no

user adaptation

which all the different people to use right reference s

and this adaptation is to be

but what are trying to understand what people say

in this case and the way people the or there are so a sign a

nation

it also response generation

before this is why in different ways

some people prefer the system should be able just seeing something record

we need to integrate all i and

i the overall view all the interpretation rules to not

if you while seeing

we know how are preferred interpretation right context of other of c e

evaluation we need a system is reasonable

and

what i

because lack of trust

these you

we perform human evaluations yes we don't like a mass

and

we must do not based once the result here

so we need to be quite different interpretations are closing the need to you swatting

italians in different interpretations on can ask

appropriate questions

and is used in this i will tell when it does not know wow

in a just

you see response

so

that's about i'm thinking all the people

ever worked on this problem

and then you

with

i'm going to disappoint either

just looking around

there was no okay so you just look around

what meanwhile

but it is very minimal

we want it then all singing all bands and rowboat that

so we also there are

and we had these make them where we would match access to say for example

where

you can extra exam seldom in a

right of the ball or i might i heard correctly

so what but that one by the board because

reality check and we start the referring expressions are mainly

looking for things around a

i

okay a

the standard names of the

what you are and you would the

goal for

rock and category is if you one but we were very low just to name

and then one side of the wordnet for see now i might

but that was the idea of done

there's a turn

right one

i'm like i'm not in the kitchen

why don't like okay

and if i didn't or anything like a and by or not

like

there i think my house and i one and then use them

so yes i mean it's

you would contextual i

but

what if i want the flow and identically

exactly but i would be one i mean one of the sound while we are

appropriate

what we're not appropriate

so

where context and i mean exactly what

we will now that

however in this case it you are

model

on work like

i was actually haven't all possible problem i was saying is that star flower like

flower

or

our car phone or things like i a lot of normally i want to anything

other than flower

so there is

in my contextual i think that kind of like second guessing the person towards the

call

the commentary

i mentioned it is something that training with how much context relative scale

for i mean we can prove our

a slider direction problem h

hasn't been the used by lee

at the moment thing to get this unit but that are instantly

well at some point of that is

long or you have phone

and

i mean that we know why people thinking

only when they were not restrict that would just a

whatever why the point that about twenty percent of the time or

there are

so that we are going to be point

they tend to become more me now we can get that

but definitely i mean whatever right okay

that's

why didn't yourself

and that goes to the definition part in fact the there was a paper yesterday

but

an hour

the ones

kind of limited in the interpretation for by already spoken the colour

and then we using your also

is that

five around one

the that if you need for every

about that a but there are a

so i have to do this in the problem doesn't performance for

so that doesn't surprise me some point i mean

but maybe we should

the fees we are now whether we have several problems a minus right and i'll

and their be assigned to me

so how much for down within therefore it is

exactly

it's all could have an and

that there probably but when we saw those with a in can see that the

the main aim at ever or is in your in great deal in you don't

get much mileage out three

they

i think

you are looking at the fourth basically

it was somebody

okay like the first five better

because we try the that the dean at the beginning very ambitious constraints on the

object of the accent so we had the and

well we had a

actually or the actions for a particular case the what i think each other

and all that weighted by the board when we had every and six of the

i-th class

in some but once

yes definitely the four

one of ten

vol

and

and likewise if you have particular we are not sure whether they're the syllable or

goal

then

you will go back and constraint of our off

but as i said we had to know where r

and okay what the user is embedded in the very large one the

i don't know what to say to make a

what

well but the way we can design cation that we listened my only there

so estimate only relative the thing mean segmentation for

and it was incorrect hundred percent of the anybody problems with the problem and better

than that of

right

so the only thing a lot of there is if you live semantic role labeling

and you and that the thing that only or did

you really don't they can be more

this is what the you know

if there is still are a bit not like war

band

you know that c

if you

at some point get to know that you don't know

the things that the

well as the semantic in our case the semantic role labeling there was trained on

a referring expression with the various don't expect even when it's all of our paper

segment mostly in the right place but you have a

very briefly that saying it's and the expectations would be much better

i cannot

i denote better success there but for referring expressions was quite well

you mean just for the five or

well for the parse tree we got indicted from what they were trying

three

it wasn't from portals like to thank you but if one of them somebody sitting

or whatever

it is reached their maxima this work the lexical my

at all of the sixteen year and by

no like can go like

i plan

and that it is are then you get the pay to get like the score

of a second we don't like little recall

it's time for mapping but you get the very low score for that matter

that that's why we don't think that environment and that's why at home or two

two we review fire and

the slogan of efficiency

so

you know that a framework

okay let's call it could have a coffee breaks into