hello everyone we're ready for final session of one or assertion of a conference

on discourse

i was to produce a session chair the first talk remove your is from the

ritual corpora too

space reference a new approach to spatial reference resolution in a real environment

thank you graph to everybody so my name is middle and i'm a piece the

student here at age and of course is a busy student you get to read

a lot of papers and every time everything you wanna get confused by the title

the most

so i decided to represent mine i'm going to make sure all of you would

understand why those words i mean the title

so we start backwards paraphrase a realistic environment referring to the main we're working in

which is but that's train wayfinding

in the real sick this

and when you are in i'm from elicit this the first thing you do is

you take a smartphone and launch something like mobile apps

and

the way they global map something typically guide use the same as they would not

be cars typically pitch is present you some think will turn by turn irrigation so

you get the bunch of instructions presented to you on the screen supplemented by map

of the movie marker indicating your position

and the instruction can be watched as well

and they would sound like turn right on the wallet eigen slash route

two hundred seventy seven then go six hundred fifty meters and turn left on the

frame on which is not exactly the most natural thing you would expect from the

system mainly because of two reasons

the first one is that it relies most and quantitative data so on cardinal directions

street names and distances and these exactly the things that we humans of trying to

avoid when guiding each other instead we tend to rely more on landmarks according to

present to the previous research solo salient objects in the vicinity

so what we would really like to have here is the shift from turn by

turn navigation to the landmark by landmarks navigation

the second reason is that the wayfinding process is inherently interactive because it happens in

a dotted between two humans so we would like to have more and more interactions

from the wayfinding systems which led us to that a spoken dialogue system that uses

landmark navigation to wear for wayfinding

this what you're saying structures like these go forward until a see the fountain either

glass building with some slicer and if the person got lost assess i think i'm

lost but they see yellow house to my right a system should but and still

be able to respond with no what is due possibly see apart opposite to this

yellow house

not only to do that the basic task is to identify that the yellow house

to my right is really referring to something and this being and being able to

find this geographical object and done

process it under support response accordingly which leads us to the basic thought that we

need to solve that of spatial reference resolution and this is the next phrase in

the title that we have

so

what we're talking here is the referring expressions there's of those words that people use

when the reference or something

like those in the pink over here

then the optics that amount by those referring expressions the geographical object

a cold reference like those with green frames and where what variations that here is

that

three level referring expressions so when you're walking down the street whatever you see

those of the objects were adjusted

and then the task of reference solution is defined simply as resolving your for expressions

the reference

now some very

to use the listener's my say wait there is also that

it is also a referring expression and indeed but this is a core reference so

it refers to something that is inside the discourse that is under forty for expression

whereas in this work going to sit in x afford referring expressions so that those

referring to something else i to discourse the jurors could object and then another problem

here is that will have nested for expression

so have class we don't their which refers to

thus a small shop and for this particular work we decided to take the maximal

for expressions of the largest in case we have nested ones

okay so from this for example is the seems like it's pretty easy you just

take not phrases in your don't say there was a referring expressions

is that so

well not really

can see the district samples for example first question is you know if there is

a subway station your life and the subway station is an all face but it's

not there for expression and the reason is because there is no reference you're not

meeting any specific object it just and kind of subway station can be there can

be not we don't know and the same as it goes for the

two examples below

than the last phrase is space recognized and you're approach which is sort of the

method we're proposing here and all the word neural might think you that there will

be neural networks yes in the that would two

and when you thinking about neural networks you think but there really hungry for they

that so what the it to the use

and with the dataset called space ref

and it was collected by letting ten users walk to predefined rules and just basically

describing the weights of like thinking allow so like i see a red brick building

over there are going down the steps and so on so forth

and this way

one thousand three hundred and three geographical for expression

have been collected which we're going to use for the purpose of this work so

now we see the problem of special efforts a solution is being decomposed into three

stages the first one is what have spoken utterance you want to identify referring suppression

in the so those words if we but potentially five something the second step would

be find potential reference potential geographical objects which we call can that's and the third

step would be the resolution itself so we goal

bottom to top

so one we're thinking about referring expressions identification we realise that it's actually very similar

to named entity recognition "'cause" what you need to do is just fine specific kind

of face instance named and in one case and referring expressions in the other

well actually named in this are can also be referring expressions so we were thinking

okay then we can maybe borrow or get inspired by the methods for the named

entity recognition

and we started by labeling the data in the same weight in fig with this

famous by you

labeling and in here

what's your is if you have assumed a word

we can label it as and then still have a referring expression because think it's

to can have noncontiguous referring expressions be labeled that

and then we're thinking at the method you also be inspired by than that by

the methods when the net recognition and guess in this case is your network with

architecture to the right we go definite so that see how it works

as i we have an utterance noisy a big red building so the first one

thing we do is with that it at the fixed with because we're standard when

they want fixed with dancers

then we fitted word-byword reference

and then every word for every word with first encoded with a fifty dimensional are

more demanding so that pre-trained we have downloaded those and of course there are out

of the capital cases and mostly those are sweeney street names in our case and

to those

we encode the character level information using a character-level bidirectional are now

and the reasoning why we're using biran and is speakers

this we restrict names tend to have this bit that the and like a diagonal

in your of are part of that again or holding up and meaning the street

actually so we're thinking we're hoping but this small myrna and can identify those patterns

and that we have some kind of information for these words

lacking la vectors

so them

the final including for every word would be

the column vector

then the hidden state of the forward cell of the small they're the level by

or not and hidden state of the backward cell and we do for each work

so we get so the sort of metrics

i don't of course sentence we want to have sentence level information there as well

so not there is a larger by are known to account for sentence level information

and

at the end we get and all the matrix

which we got all such as sub sentence encodings and

the idea here is that for each word we want to account for information that

all the preceding words beginning and those exceeding words are giving

so for example

for the word b

where taking that hidden state of the forward cell

but sort of encodes the information for all the preceding word so noisy and word

biggest one and

also we take a hidden state of the backward so that encodes information about all

the words

from the backward from the backward direction so big red building and a number of

bands and we have there

why

why do we need to do that

so that's consider two examples the wording green strain

now if you consider only the preceding words

in both cases they're the same you can see hey and you can see so

when you're deciding whether this word will be part for expression you have to have

to look at the succeeding words and in the first case the station with hopefully

indicate that this is a part of referring expressions the first train and in the

second case departing would indicate that it's not hopefully

and the center by spoken cistern but in the different direction

so is turn

is the same succeeding words but then proceeding more to different so we can hopefully

labels and differently

on them

the part of the network

part of breath that is getting of the subset and sub sentence encoding metrics

we double as ref not and will be using it later

so than with it

this

output of the red not through the fully connected layer followed by a drop out

and then we get the final softmax layer of gives us this kind of metrics

there word so far ward where getting a distribution over the three labels so be

rough and direct then we take the maximum probability there you see the green that

which is sort of the labeling so now i and c would get a and

a million be so this is where the ranks person starts and the bigger building

we get

i rough and then all the possible get so this is

then i began building is a referring expression

when it comes to evaluation what do we consider as a positive data point positively

labeled data point

so we are interested only in those cases where the whole very expressions table so

if a part of free expression is then we say it's not correct

like the second case or whatever

but then we also mm notice that there are cases where you have filler words

in between

and we label them with

for the filler words but the network sometimes tend to put are up there and

that's a pity to counter this the wrong example directly use it's also sort of

with post processing can be used so we introduce the notion of partial precision and

recall

so we say the point is partially correct

if the that they're for expression is labeled that's partially correct if its start at

the same ward

and then it has at most one error one labeling error and of course it's

more longer than two words gives you start with one word while limiting our its

role

right

then we have the baseline that we're comparing with of this is the most natural

baseline you can think of this just basically taking no phrases you have an utterance

u parsing in to get all the non-phrasal that you say those of the referring

expressions

let's see

about

what the results we didn't we had so the rest not

perform better than the baseline

but this not the most interesting result partial precision and recall is

multipath the for definite than precision recall which indicates that

probably if we get more data you will get much better performance but just precision

recall there's of the whole architecture has the potential with thing

and the second step is finding the can that's

the geographical objects then we for that we use that open street map specifically two

types of objects an open street maps ways representing most this trees and buildings and

nodes representing the points-of-interest like say and from somewhere or the function for the static

now the way we've construct a can that's that is all the objects that you

can see

from the point where is standing in

so that say you standing over that

and then we know the direction that a working walking in by just taking the

fifth and

that you're coordinates five ten and fifteen seconds before

so that we

look

in the radius of one hundred meters from minus one hundred two hundred degrees and

they called the objects visible

and so in this case

you get

those objects over the

and on average you have thirty three such objects in the candidate set

and then each object we're going to encode it

and the following way so first

we have taken a four hundred and twenty seven

automatically derived binary features from the open street maps and the way they were derived

as by considering

open always sam tags over here both tags and values and the typical that could

be building with about the university

and this would get one of those slots

of zero months over that and we also take the normalized distance feature and then

also take normalized swedish with sweet being how much of your visual field that's this

particular object occupied and we divided by three hundred and sixty degrees

so that this is the second network as promised

it's called space ref that and this added it easier so it operates on the

pair is not on the pairs of the referring expression and the candidate

so

for example we have a referring expression the bus station

and we can that set which is just three objects here because it's hard to

put thirty three there

so

it starts by building the bus station using the rest i think older as we've

seen before and having the sub sentence encoding that kicks and it takes the last

hidden state of the forward sell the first and say to the backward so concatenate

those

and this is the representation of your referring expression

then it takes each candidate is we're operating in paris referring expression can that

the first can that in this case and represented with those or some features distance

and sweet as we've seen just couple of slides before

then we concatenate all of those

put it through fully connected layer and have a

final softmax for each or rather the sigmoid prediction there is a binary classification and

we have the label

between zero and one also are zero or one zero meaning that there are faring

spoken and the candidate do not match and one meaning that they do not so

we

so resolving water for expression

would involve one averaged thirty three binary classification problem result and that after we don

that hopefully the first one is being labeled as every frame as a

reference for this referring expression "'cause" it is a bias about station

and then we do the same thing with a second again

and hopefully the second in the third a label as

and now what kind of baseline do we have to compare two

that's pretty straightforward the baseline that the first thing you could have focal

so it can referring expression like a very nice big part

displayed by space then you lower case of and remove stopwords we give a set

of words like a very nice big part

and then you look at the open street map tags for every can that

and if any word from this set but we got and for a second step

it appears in either technique or a value

we say to match

otherwise it's not too much

and these of the results

we also compared it with another method previously reported in literature that's called words s

classifiers and spacecraft not performs better

that

which is where you stop sleeping and probably this is why tuple for everybody

and

many things can go wrong so

that's what works

so

blue dot able to represent my position where am something

so i'll just put myself

just near the building where we are

namely

van

i

say the utterance like i

standing near the university

different number

that one green is the work of rough not so we found an utterance

a referring expression in the utterance

now we take the data from the open street map

these are all with the sort of

do you have of objects that are present in the open street map

now we assume that we're looking north

so this direction where sort of

going to

and

now we're trying to resolve

the reference

yes i

so that was in orange is it those objects in orange the know the counter

that's up so this is those object that have been considered by else

to be possible reference and the one green denotes the actual reference s spacecraft not

things

and this is the building where in exactly

so if we move

i don't that down over here

and tried to say

i see

the fountain

in front of me

and the same trick of that

we see again all everything in orange is a can that's that

and the one in green there is no the actual fountains with

does not only find the ways of the buildings but also the notes so the

point of interest

then if we camelot bit different direction

and say

i

passing

at each for example

with see that that's capable also finding multiple reference "'cause" sometimes from the cat your

river expression can be ambiguous so it can be the case that you get more

than one reference you can't also the case if you give the reference what

but

then

it of course not perfect

because you have sixty four percent precision cannot be perfect so let's see

where is a perfect

if i say something where i'm standing

and the bits

cool on the mean things

right

have

so it somehow for some reason it selects as part of the street

so i mean some streets not all of them we don't know why yet but

this is also to the research question for us

to understand why in this case it selected like something like eight object "'cause" the

streets are actually not to contain the contiguous objects some for some reason an open

street map the street there just

stored as bit sort of just part of the streets and of the one contiguous

trees which makes

definitely our job harder

right hand down

we try one more

somewhere

here

here somewhere

and we say

i see that george for example

i mean in some cases it does not actually identify the optics although

the charges up with our you see probably the cross

but then

if you come to be closer

i still doesn't work

models

right okay out of course doesn't

because

when it did work because it was very hard for example

if you say i see the church and trans

and works on a one

so it's sort of sensitive and we don't know what why yet but this is

raise number of research questions we addressed in future thank you very much and not

compressed

the you're much for the very interesting talk and the

we can call so we don't fall asleep again

our questions

i think you for your talk that was great i was wondering it in with

the earlier slides you had an example where the person said now ica and then

there was like an explicit reference to the

the object or the building

and i was what i was just wondering can you handle purely anaphoric references like

if the person had just said now i see it

no in there it's just like the for reference that we're howling in this paper

we consciously excluded anaphoric references case we think it's separate problem with separate that's okay

great

well time to work are uses like to have a question the back to the

power

this like

the powers for a what happen if it's close that we consider the distance i

mean

for some object like a church or something

if the user might say that i see a charge

well maybe just a direct form a really marcus that's

also very short distance

that is true so that the well as a as i said previously that the

way we sort of the final can't is that is

we take a fixed radius of one hundred meters in this case so if it's

really far and the user says that if we do not increase the rate use

it will not be able to track currently

i hope to transfer sequence

and thank you for the top is in that the ml ui the couple of

examples in which you know it was and i nd a near the university and

then there is another one with k t h where

these are

the not dean

a large joe graphical object right

but then especially in the for example in the first case you're is all to

the building we re saul and

and just wondering if you can speculate how

these sort of references can be you know they are really context-dependent right so that

university

you identified that building but actually i n

like in the corner of the campus and i'm near to the whole university in

a sense right

true so the first thing is again we have is rate is one hundred meters

we can all get the whole universe that is the first limitation we have the

second is again this was more to show the imperfection of the system rather than

the fraction so actually when you say is seriously it and it and it identified

as building it was just one because we're in this building

but really you have also the building on the right-hand side which it didn't identified

and this is more to show that you know it's imperfect and has sixty one

percent precision so it sort of we have still the way to go

okay how would improve okay by ice so that so okay a i'm not as

effective requesting you press so the obvious thing is to collect more data and try

to try to train the same thing and see if it works

and the second the second thing it might it might be

you know

probably don't take on that those objects that are in the immediate vicinity maybe as

one but this is probably will be harder because you know it is computationally will

become invisible i guess

"'cause" you know you have to identify which of the object you

you i mean you still need to have some notion of visibility so to identify

which of these you can potentially referred to run this you have a collisions computations

you have mine side and you in these if you collide with specific operational and

the lobster vision and also that

i don't know that answers the question i probably not it seems like it is

impeccable

we conducted later "'cause" we sort of have a like times filling it here but

we can take it later we have of the right

okay thank you

thanks i think that's all the time we had suppresses the sixes that's think the

speaker again