hello everyone we're ready for final session of one or assertion of a conference
on discourse
i was to produce a session chair the first talk remove your is from the
ritual corpora too
space reference a new approach to spatial reference resolution in a real environment
thank you graph to everybody so my name is middle and i'm a piece the
student here at age and of course is a busy student you get to read
a lot of papers and every time everything you wanna get confused by the title
the most
so i decided to represent mine i'm going to make sure all of you would
understand why those words i mean the title
so we start backwards paraphrase a realistic environment referring to the main we're working in
which is but that's train wayfinding
in the real sick this
and when you are in i'm from elicit this the first thing you do is
you take a smartphone and launch something like mobile apps
and
the way they global map something typically guide use the same as they would not
be cars typically pitch is present you some think will turn by turn irrigation so
you get the bunch of instructions presented to you on the screen supplemented by map
of the movie marker indicating your position
and the instruction can be watched as well
and they would sound like turn right on the wallet eigen slash route
two hundred seventy seven then go six hundred fifty meters and turn left on the
frame on which is not exactly the most natural thing you would expect from the
system mainly because of two reasons
the first one is that it relies most and quantitative data so on cardinal directions
street names and distances and these exactly the things that we humans of trying to
avoid when guiding each other instead we tend to rely more on landmarks according to
present to the previous research solo salient objects in the vicinity
so what we would really like to have here is the shift from turn by
turn navigation to the landmark by landmarks navigation
the second reason is that the wayfinding process is inherently interactive because it happens in
a dotted between two humans so we would like to have more and more interactions
from the wayfinding systems which led us to that a spoken dialogue system that uses
landmark navigation to wear for wayfinding
this what you're saying structures like these go forward until a see the fountain either
glass building with some slicer and if the person got lost assess i think i'm
lost but they see yellow house to my right a system should but and still
be able to respond with no what is due possibly see apart opposite to this
yellow house
not only to do that the basic task is to identify that the yellow house
to my right is really referring to something and this being and being able to
find this geographical object and done
process it under support response accordingly which leads us to the basic thought that we
need to solve that of spatial reference resolution and this is the next phrase in
the title that we have
so
what we're talking here is the referring expressions there's of those words that people use
when the reference or something
like those in the pink over here
then the optics that amount by those referring expressions the geographical object
a cold reference like those with green frames and where what variations that here is
that
three level referring expressions so when you're walking down the street whatever you see
those of the objects were adjusted
and then the task of reference solution is defined simply as resolving your for expressions
the reference
now some very
to use the listener's my say wait there is also that
it is also a referring expression and indeed but this is a core reference so
it refers to something that is inside the discourse that is under forty for expression
whereas in this work going to sit in x afford referring expressions so that those
referring to something else i to discourse the jurors could object and then another problem
here is that will have nested for expression
so have class we don't their which refers to
thus a small shop and for this particular work we decided to take the maximal
for expressions of the largest in case we have nested ones
okay so from this for example is the seems like it's pretty easy you just
take not phrases in your don't say there was a referring expressions
is that so
well not really
can see the district samples for example first question is you know if there is
a subway station your life and the subway station is an all face but it's
not there for expression and the reason is because there is no reference you're not
meeting any specific object it just and kind of subway station can be there can
be not we don't know and the same as it goes for the
two examples below
than the last phrase is space recognized and you're approach which is sort of the
method we're proposing here and all the word neural might think you that there will
be neural networks yes in the that would two
and when you thinking about neural networks you think but there really hungry for they
that so what the it to the use
and with the dataset called space ref
and it was collected by letting ten users walk to predefined rules and just basically
describing the weights of like thinking allow so like i see a red brick building
over there are going down the steps and so on so forth
and this way
one thousand three hundred and three geographical for expression
have been collected which we're going to use for the purpose of this work so
now we see the problem of special efforts a solution is being decomposed into three
stages the first one is what have spoken utterance you want to identify referring suppression
in the so those words if we but potentially five something the second step would
be find potential reference potential geographical objects which we call can that's and the third
step would be the resolution itself so we goal
bottom to top
so one we're thinking about referring expressions identification we realise that it's actually very similar
to named entity recognition "'cause" what you need to do is just fine specific kind
of face instance named and in one case and referring expressions in the other
well actually named in this are can also be referring expressions so we were thinking
okay then we can maybe borrow or get inspired by the methods for the named
entity recognition
and we started by labeling the data in the same weight in fig with this
famous by you
labeling and in here
what's your is if you have assumed a word
we can label it as and then still have a referring expression because think it's
to can have noncontiguous referring expressions be labeled that
and then we're thinking at the method you also be inspired by than that by
the methods when the net recognition and guess in this case is your network with
architecture to the right we go definite so that see how it works
as i we have an utterance noisy a big red building so the first one
thing we do is with that it at the fixed with because we're standard when
they want fixed with dancers
then we fitted word-byword reference
and then every word for every word with first encoded with a fifty dimensional are
more demanding so that pre-trained we have downloaded those and of course there are out
of the capital cases and mostly those are sweeney street names in our case and
to those
we encode the character level information using a character-level bidirectional are now
and the reasoning why we're using biran and is speakers
this we restrict names tend to have this bit that the and like a diagonal
in your of are part of that again or holding up and meaning the street
actually so we're thinking we're hoping but this small myrna and can identify those patterns
and that we have some kind of information for these words
lacking la vectors
so them
the final including for every word would be
the column vector
then the hidden state of the forward cell of the small they're the level by
or not and hidden state of the backward cell and we do for each work
so we get so the sort of metrics
i don't of course sentence we want to have sentence level information there as well
so not there is a larger by are known to account for sentence level information
and
at the end we get and all the matrix
which we got all such as sub sentence encodings and
the idea here is that for each word we want to account for information that
all the preceding words beginning and those exceeding words are giving
so for example
for the word b
where taking that hidden state of the forward cell
but sort of encodes the information for all the preceding word so noisy and word
biggest one and
also we take a hidden state of the backward so that encodes information about all
the words
from the backward from the backward direction so big red building and a number of
bands and we have there
why
why do we need to do that
so that's consider two examples the wording green strain
now if you consider only the preceding words
in both cases they're the same you can see hey and you can see so
when you're deciding whether this word will be part for expression you have to have
to look at the succeeding words and in the first case the station with hopefully
indicate that this is a part of referring expressions the first train and in the
second case departing would indicate that it's not hopefully
and the center by spoken cistern but in the different direction
so is turn
is the same succeeding words but then proceeding more to different so we can hopefully
labels and differently
on them
the part of the network
part of breath that is getting of the subset and sub sentence encoding metrics
we double as ref not and will be using it later
so than with it
this
output of the red not through the fully connected layer followed by a drop out
and then we get the final softmax layer of gives us this kind of metrics
there word so far ward where getting a distribution over the three labels so be
rough and direct then we take the maximum probability there you see the green that
which is sort of the labeling so now i and c would get a and
a million be so this is where the ranks person starts and the bigger building
we get
i rough and then all the possible get so this is
then i began building is a referring expression
when it comes to evaluation what do we consider as a positive data point positively
labeled data point
so we are interested only in those cases where the whole very expressions table so
if a part of free expression is then we say it's not correct
like the second case or whatever
but then we also mm notice that there are cases where you have filler words
in between
and we label them with
for the filler words but the network sometimes tend to put are up there and
that's a pity to counter this the wrong example directly use it's also sort of
with post processing can be used so we introduce the notion of partial precision and
recall
so we say the point is partially correct
if the that they're for expression is labeled that's partially correct if its start at
the same ward
and then it has at most one error one labeling error and of course it's
more longer than two words gives you start with one word while limiting our its
role
right
then we have the baseline that we're comparing with of this is the most natural
baseline you can think of this just basically taking no phrases you have an utterance
u parsing in to get all the non-phrasal that you say those of the referring
expressions
let's see
about
what the results we didn't we had so the rest not
perform better than the baseline
but this not the most interesting result partial precision and recall is
multipath the for definite than precision recall which indicates that
probably if we get more data you will get much better performance but just precision
recall there's of the whole architecture has the potential with thing
and the second step is finding the can that's
the geographical objects then we for that we use that open street map specifically two
types of objects an open street maps ways representing most this trees and buildings and
nodes representing the points-of-interest like say and from somewhere or the function for the static
now the way we've construct a can that's that is all the objects that you
can see
from the point where is standing in
so that say you standing over that
and then we know the direction that a working walking in by just taking the
fifth and
that you're coordinates five ten and fifteen seconds before
so that we
look
in the radius of one hundred meters from minus one hundred two hundred degrees and
they called the objects visible
and so in this case
you get
those objects over the
and on average you have thirty three such objects in the candidate set
and then each object we're going to encode it
and the following way so first
we have taken a four hundred and twenty seven
automatically derived binary features from the open street maps and the way they were derived
as by considering
open always sam tags over here both tags and values and the typical that could
be building with about the university
and this would get one of those slots
of zero months over that and we also take the normalized distance feature and then
also take normalized swedish with sweet being how much of your visual field that's this
particular object occupied and we divided by three hundred and sixty degrees
so that this is the second network as promised
it's called space ref that and this added it easier so it operates on the
pair is not on the pairs of the referring expression and the candidate
so
for example we have a referring expression the bus station
and we can that set which is just three objects here because it's hard to
put thirty three there
so
it starts by building the bus station using the rest i think older as we've
seen before and having the sub sentence encoding that kicks and it takes the last
hidden state of the forward sell the first and say to the backward so concatenate
those
and this is the representation of your referring expression
then it takes each candidate is we're operating in paris referring expression can that
the first can that in this case and represented with those or some features distance
and sweet as we've seen just couple of slides before
then we concatenate all of those
put it through fully connected layer and have a
final softmax for each or rather the sigmoid prediction there is a binary classification and
we have the label
between zero and one also are zero or one zero meaning that there are faring
spoken and the candidate do not match and one meaning that they do not so
we
so resolving water for expression
would involve one averaged thirty three binary classification problem result and that after we don
that hopefully the first one is being labeled as every frame as a
reference for this referring expression "'cause" it is a bias about station
and then we do the same thing with a second again
and hopefully the second in the third a label as
and now what kind of baseline do we have to compare two
that's pretty straightforward the baseline that the first thing you could have focal
so it can referring expression like a very nice big part
displayed by space then you lower case of and remove stopwords we give a set
of words like a very nice big part
and then you look at the open street map tags for every can that
and if any word from this set but we got and for a second step
it appears in either technique or a value
we say to match
otherwise it's not too much
and these of the results
we also compared it with another method previously reported in literature that's called words s
classifiers and spacecraft not performs better
that
which is where you stop sleeping and probably this is why tuple for everybody
and
many things can go wrong so
that's what works
so
blue dot able to represent my position where am something
so i'll just put myself
just near the building where we are
namely
van
i
say the utterance like i
standing near the university
different number
that one green is the work of rough not so we found an utterance
a referring expression in the utterance
now we take the data from the open street map
these are all with the sort of
do you have of objects that are present in the open street map
now we assume that we're looking north
so this direction where sort of
going to
and
now we're trying to resolve
the reference
yes i
so that was in orange is it those objects in orange the know the counter
that's up so this is those object that have been considered by else
to be possible reference and the one green denotes the actual reference s spacecraft not
things
and this is the building where in exactly
so if we move
i don't that down over here
and tried to say
i see
the fountain
in front of me
and the same trick of that
we see again all everything in orange is a can that's that
and the one in green there is no the actual fountains with
does not only find the ways of the buildings but also the notes so the
point of interest
then if we camelot bit different direction
and say
i
passing
at each for example
with see that that's capable also finding multiple reference "'cause" sometimes from the cat your
river expression can be ambiguous so it can be the case that you get more
than one reference you can't also the case if you give the reference what
but
then
it of course not perfect
because you have sixty four percent precision cannot be perfect so let's see
where is a perfect
if i say something where i'm standing
and the bits
cool on the mean things
right
have
so it somehow for some reason it selects as part of the street
so i mean some streets not all of them we don't know why yet but
this is also to the research question for us
to understand why in this case it selected like something like eight object "'cause" the
streets are actually not to contain the contiguous objects some for some reason an open
street map the street there just
stored as bit sort of just part of the streets and of the one contiguous
trees which makes
definitely our job harder
right hand down
we try one more
somewhere
here
here somewhere
and we say
i see that george for example
i mean in some cases it does not actually identify the optics although
the charges up with our you see probably the cross
but then
if you come to be closer
i still doesn't work
models
right okay out of course doesn't
because
when it did work because it was very hard for example
if you say i see the church and trans
and works on a one
so it's sort of sensitive and we don't know what why yet but this is
raise number of research questions we addressed in future thank you very much and not
compressed
the you're much for the very interesting talk and the
we can call so we don't fall asleep again
our questions
i think you for your talk that was great i was wondering it in with
the earlier slides you had an example where the person said now ica and then
there was like an explicit reference to the
the object or the building
and i was what i was just wondering can you handle purely anaphoric references like
if the person had just said now i see it
no in there it's just like the for reference that we're howling in this paper
we consciously excluded anaphoric references case we think it's separate problem with separate that's okay
great
well time to work are uses like to have a question the back to the
power
this like
the powers for a what happen if it's close that we consider the distance i
mean
for some object like a church or something
if the user might say that i see a charge
well maybe just a direct form a really marcus that's
also very short distance
that is true so that the well as a as i said previously that the
way we sort of the final can't is that is
we take a fixed radius of one hundred meters in this case so if it's
really far and the user says that if we do not increase the rate use
it will not be able to track currently
i hope to transfer sequence
and thank you for the top is in that the ml ui the couple of
examples in which you know it was and i nd a near the university and
then there is another one with k t h where
these are
the not dean
a large joe graphical object right
but then especially in the for example in the first case you're is all to
the building we re saul and
and just wondering if you can speculate how
these sort of references can be you know they are really context-dependent right so that
university
you identified that building but actually i n
like in the corner of the campus and i'm near to the whole university in
a sense right
true so the first thing is again we have is rate is one hundred meters
we can all get the whole universe that is the first limitation we have the
second is again this was more to show the imperfection of the system rather than
the fraction so actually when you say is seriously it and it and it identified
as building it was just one because we're in this building
but really you have also the building on the right-hand side which it didn't identified
and this is more to show that you know it's imperfect and has sixty one
percent precision so it sort of we have still the way to go
okay how would improve okay by ice so that so okay a i'm not as
effective requesting you press so the obvious thing is to collect more data and try
to try to train the same thing and see if it works
and the second the second thing it might it might be
you know
probably don't take on that those objects that are in the immediate vicinity maybe as
one but this is probably will be harder because you know it is computationally will
become invisible i guess
"'cause" you know you have to identify which of the object you
you i mean you still need to have some notion of visibility so to identify
which of these you can potentially referred to run this you have a collisions computations
you have mine side and you in these if you collide with specific operational and
the lobster vision and also that
i don't know that answers the question i probably not it seems like it is
impeccable
we conducted later "'cause" we sort of have a like times filling it here but
we can take it later we have of the right
okay thank you
thanks i think that's all the time we had suppresses the sixes that's think the
speaker again