so the first present there is a man you know so
these start you presentation
good after don't know to one
so my name is manner thus generally amount of furniture from the interaction lab
of they headed for university and then gonna present work have done we don't have
an so an oliver lemon
about a docking outmoded task natural language understanding system for cross domain conversationally i that
we call and meet nlu
so and another language understanding is quite a white concept
a most of the time when is about compositionally i a dialogue system it of
us to the process of extracting the meeting from natural language and providing key to
the dialogue system in a structured way so that the dialogue system can perform definitely
better
and we begin end up
study studying this problem is for the sake of it but actually
we did it in the context of the moment project which will see as you
to be project that was about
at the deployment of a robot with the
multimodal interaction capability it was supposed to be deployed in a shopping one thing around
and it was supposed to interact with the user's a giving them a structure entertaining
them would only be little bit of chit chatting
and i'm gonna show a video of it that may be explained it be better
what the robot was supposed to do
help you can hear the audio although they don't the subtitles
i dunno one of the recording
so the robot with both i sent and if no indication we just the and
voice
in this five phase
and with or without the backing being detriment and the preference of the user
right
one value no straight i actually and no not attacking
but for some with of the next to
so we so a lot of generation but everything started with a request from the
user
and that's the mute one where we are focusing today so is basically designing an
nlu component of with a robust enough to work and is very complex dialogue move
to model dialogue system
again most often in compositionally i
not a language understanding is a synonym of shallow semantic parsing so this can actually
the beat with the next to the
morning keynote and which is the process of extracting some frame an argument structure
that completely meaning in a sentence and it doesn't really matter how we call them
if is intent of slot
well and most of the time this types are defined according to
the application domain
whether they have a system two db i'm like framesemantic switched off and isolate of
abstraction and is the one we are using in our context
but actually some problems especially in our case when we wanted to be then interface
there was able to but using several different domains while most of the time
in dialogue system when you have another language understanding component they always did we must
single domain or
if you don't through domains at the same time
and this also
what because
the resources are available the are always or about so looking restaurants so booking flights
while we wanted our interface to be use them in several different location that can
be in a domestic environmental rights of the shopping mall or in sin for example
why you have to command robot
formant in unseen offshore all drinks
and so
one of the first problem want to the system to be the system that was
cross domain
and even if there may be noted see a recipe for that we what trying
to this problem anyway
and the big problem is that
most of the time dependencies into that are designed i you for dialogue system error
only contain a single intent or frame
while in our case there are many sentences that given to the robot
which contains two different free more intense and four as can be very important to
a detect both of them because if we ignore the temporal relation between these two
different frames for every important to you know satisfy the user both for the codec
a mess by action and also the needing of a pole at the same time
so that's another problem that when you rely on these
hi you know the and structure
most of the time
two different kind of interaction might end up being the exact same intent or frame
like in this case while the actually belong in the dialogue
two different kind of interaction so what we actually wanted to do is not only
targeting the frame and en
and the slots
but also wanting a layer of dialogue acts they will tell the dialogue system
the context in which these are has been said so for example in the first
case we are informing the robot's that starbucks next on the all imagine that we
want to teach the robot how the shopping mall is done and the second one
days at a customer that is ask asking a an information about the location
all starbucks
so in two
quickly to cup we wanted to deal with different domain of the same time if
possible
we wanted to talk more than one single intent and arguments
the sentence and since we are also during the dialogue act so we have a
moody task i could that share
we have to deal also we multiple dialogue act
we might argue why the
is actually very important to understand both the dialogue act in this case
if not the final intent is only to give information about the location of starbucks
but actually we might want also to understand why
the user is asking for starbucks because we need a coffee if maybe was meeting
and meet shaken does not starbucks you could do could have pointed it somewhere else
so far have this stuff is real important
and of course
we wanted to try to benchmark of and the you system a initiatives
and eye gaze to off-the-shelf tools in this was given by the people are there
was actually
providing us with these utterances and evaluations and we will see later
note the very quickly i mean is nothing complicated we tried with this
this problem by
addressing the three different task
at the same time so this asks another of locating dialogue acts the frame
and the arguments
each task was solve the with a sequence labeling approach in which we were giving
and label to each token of the sentence is
something very common in nlp
and each label was actually composed by the class
of the structured we were able to target for a given task
enriched with the label that can be o i o
depending well
the and the type was the beginning of a span of a structure they inside
or was outside one of these and here we have a very easy example
now the problem is that
this is a linear solution for a problem which is
and i gotta save because the language is a gaussian then we might end up
having some structure which set actually nested inside other structure especially for freeman arguments this
doesn't happen that basically never for dialogue acts
but for frame and arguments this is happens quite of an especially in the data
we collected
so what we that was solutions kit was to
basically collapse
the just actual in a single linear selection and trying to get whether one of
this structure
was actually inside
a previously target that one
by using some realistic on the syntactic relation among the words of an example if
find was actually
syntactic child of two
we could but usually sticks a by some roots actually say what that the locating
nh frame was actually a embedded inside the requirement argument of the needing frame
now there has been solved in a multitask fashion so we basically generate them created
a single network that was dealing with that the ti in task at the same
time is basically other sequence of stick with the t within quadrants yet if that
is that i'm gonna show
next slide is nothing but the only complicated but there are two main reason why
we adopt the d is
architecture first of all we wanted more or less to replicate
and yet a key of
and task difficulty in a sense that we were assuming actually we were
not the think that the tagging they'll that backs is easier than typing frames any
it easy if the target frame t v then tagging arguments
and that's also
i kind of structural relationship between you do it between these three because many times
some frames tend to appear model friend in the context of some dialogue acts and
arguments are almost always dependent on and frames
extra especially when there is a strong to be i'm like from semantics
and
so this is these are the reason why the network is down like this
and i'm going to illustrate the network quite quickly because this is a little bit
more
technical stuff so
the input file a network with only a pretty and then one betting that we
were not be training and that with the firstly there was encoding with a step
of encoded with some set potentially there was supposed to capture
some relationship that the bidirectional lstm encoder was in capturing because he wouldn't sometimes of
attention is more able to capture relationship among words which are quite distant in the
sentence
and then we were feeding us yet if layer
there was actually typing the sequence of four by your tags for the dialogue act
in a right of the this of attention delay
so for the frames it was basically the same thing
but we were
using shot recognition before because we wanted to provide encoded with the fresh information
from the first layer so actually the lexical information but also
which some information that was encoded while
being it
kind of i and directly being a condition on what the
the dialogue act was starting so we were putting the information together and with serving
the information to the next layer
and the with a crf for typing of before
and finally for the arguments whether again the same thing
another step of encoding and crf layer with lots of attention and these came up
from the experiments we have done with some ablation study it is on the p
but we're another button you hear about this is the final network we manage to
tune at the very end
so in either was think at the beginning we wanted to
benchmark this
these nlu
components now benchmarking and nlu for the system is quite of a big issue in
a sense that the dataset and that was thing before most of these are that
are quite
single domain
and then very few stuff
i mean about an hour now that there are some doubt that direct
the started go popping up but the beginning of this year we were still put
on that side
by likely that was these results which is score the nlu benchmark
which is a bicycle cross domain corpus of hundred interaction with the house assistant the
robot
is mostly i or orient that is not a collection of dialogue is the only
single interaction utterance interaction we with the system
and callers a lot of the mean we will see later
and but is mostly not oriented there are some
a comments that can be used for a robot bodies mostly again i go to
oriented
what does a second rest of that we started collecting along the esn is taking
a lot of time
which is the rubber score was a is called the is like that because we
stand for robotics oriented mostly task language understanding corpus
and is again is a collection of single interaction with the robot that called a
different domains that more think them of kind of interaction there is there is to
chopping that is
is state common the robot's there is a also a lot of information you can
give to the robot about completion of the environmental name of both on
well this kind of tough
that's quite a huge overlap between the two in terms of kind of interaction
but they spun on
different domains
so
the first corpus the nlu benchmark provide us three different semantically yes
and their code scenario action an entity i know this sounds completely different of
from what we said before but we had to find some mappings with the stuff
we where we wanted to that are go over the sentences
the robot is good big the full set of it is twenty five almost twenty
six thousand percent sentences
and there are agent different this scenario types and each scenario busy a domain
and that of the fifty four different action types and fifty six different entities
there is something the goal and intent which is basically the sum up of scenario
plus action and this is important for the model for the evaluation will see later
as you can see there is a problem with this the dataset is that is
that it is gonna cost domain
is that it is more t task because we have three different semantic layer
but
we have always one single send audio and actions so one single intent per sentence
so what we could benchmark on these it
corpus was mostly these two initial
these two initial factors
we did evaluation according to the paper that was presenting
the benchmark
and this was done on a ten fold cross validation with like half of the
sentences that eleven off of the sentences in this was to balance
the number of classes and it is inside the on the results
so i that was saying that we had to do a mapping
between
their tagging scheme and whatever we wanted to die which is very general approach for
extracting the semantics from sentences in the context of a dialogue system
bum we also so that
the kind of relationship that what holding between
they are semantically at one or more or less the same there were holding for
our approach
and so these at some result
this is that are reported in the be but there are quite old in a
sense that they are from the beginning of this the they've been evaluated in two
thousand eighteen
they have been around on all the open source
reduction of these that nlu component of dialogue system available of the shots
that's a problem we want some because you know why second specific training for entities
and these was not possible because it does a constraint on the number
of entity types and ended example you can pass do we do we try to
talk with what some people but we didn't manage to get the licensed at least
to run a one training with the full set of things so do you have
to take that into account too much unfortunately
the intent that was think is the sum up of the scenario
and an action
and these
performance is then
obtain it on ten fold cross validation i didn't about the standard deviation because
it would they were almost all stable but if you want to look at them
they're on the paper
and the other important thing is that we want to take into account whether it's
upon
of a target structure to was matching exactly actually
the elders of the people when in taking into account that
but they got the true positive whether there was a and an overlap
an overlap of the of the spun
so these are kind of a lose metric
that we whatever we are evaluating one
we can see that the entity for the entity and then the combined setting a
our system was the performing on average better than the other while for the intent
we will actually not performing as what is what some but better than the other
two system
the other important bit is that the combined the
measure is actually the sum up of the two confusion matrix of intents and entities
are we doesn't
actually give us anything about the pipeline
our the full pipeline is working
but these a something that we have done
on our corpus which is much smaller
and is not yet available because that we are still gathering data
probably end of this year we're gonna release it
i know if you colours are very natural environment but for people doing a chair
are your dialogue in the context of robotics this can be
one interesting
so here we have eleven dialogue types and fifty eight frame types
which compared to the number of example is quite high
and eighty four frame element types of which are the arguments
and if you can see
not always but there are many cases in which will we have more than one
frame per sentence and what them more than one that about but sentence
and no idea the frame elements are quite a lot
we i have like
they fit into semantic space body into these three is more formally the only tool
because
we have thirteen dialogue acts exactly like we so during the in the rest of
the presentation
and we also provide semantics in a them in terms of frame semantics
well we have three main frame elements these are actually this the same the same
semantic layer theoretically but there are two different layers or variational e
and if you can see we have a lot of four
embedded structure a frame inside on the frame and this kind of stuff
a this is the mapping we had to do again
with the different semantic layer is basically same dialogue acts dialogue acts frames and frames
and frame element some arguments
and of course
the these are the two aspect that we could tackle why using this corpus so
is not incur of domain because he's not a score of the mean of the
other one
it is enough to have that we have
different kind of interaction and we have also sentences coming
from two to different scenarios that can be
the house scenario and the shopping mall scenario jealousy charting something coming from these interaction
with the month in answer about
but we don't want to sell it is completely closed domain mostly because the other
record with a much more of the mean than this one
but it every multi task and is there really moody dialogue at frame on each
sentence
and k that is out of
the might look quite we hear the about
i'm gonna explain why the like this
so most that's one i report here is the same exact measure that was reporting
for the nh the nlu benchmark so
we have take into account only when the span
of to structure the overlap okay
and
the results are quite high
and the main reason is that to the corpus is not been delexicalised
so there are sentences are quite similar
and then the system be a very well
but you don't have to get parts of by doubt because
if we look at the last one could be the second one is basically only
using the
the coal two thousand set of task evaluation which is a standard and we report
the need for general comparison with other system
but the most important one is the last one with a that is the exact
match
and the laughter of the exact match is telling us
how well the system over the pipeline with working completely so we were taking into
account the exact span
of
all of the target structure
and also
we were
yes we were
we were actually
trying to get
i mean a frame was actually correctly dog only if the also the dialogue that
what's quality data so with actually the end-to-end system
in a pipeline and that is
the measure we have to chase
no two
conclude and some future work so the system that i presented which is these their
cross domain moody task
and that you system for not a language understanding to
for conversational i a that we designed a is actually running in the shopping mall
you feel on
the video i showed you was formed from the deployment we have done
and is gonna be derived for three months in a role
some pos during the weekend to do some matter out easy vendors rebooting the system
but we
manage to collect a lot of the time order maybe integrate them in the corpus
and release it and of this year
if we manage to back them properly into the checking only the latest beginning of
next year
we have to deal with their this area with different a demon sad this
it means not relying on these heuristic on the syntactic structure but actually simultaneous most
honestly starting
in but that's sequences are moved event sequence e the canopy one inside the other
if any topic because we actually already of this system we
finally the final added few months ago so we didn't have time to the meeting
here but these exist and then there is a branch on that everybody the ti
show you which is about this new system
but of our work is
this one of generating a general framework for frame neck structure so it doesn't
method it's you audio the application that is the reason behind
we are trying to create a network that can be with all the possible frame
like structure passing this is our a long-term goal something very big but we are
actually pushing for that
and the last bit is mostly dealing with this special tagging of segment that a
segmented utterances we are like that in our corpus there were many
small bit of sentence that the user with one thing because they were stopping you
the basic dating so the missing the first part of the sentence like i would
like to
and there's asr what actually this equation is that was sending the thing to the
bus set and the bus to work correctly by think it by the with some
bit missing
now when the user with thing
to find the starbucks for example we receiving these find the starbucks there was contextualize
the as a fine finding locating frame
but we didn't know it was also a frame element of the previous
structured so we are studying the way to
make the system aware of what has been part before
so that you can actually give more info what information in the context of the
same utterance even if these broken by idea is to
and
this is everything
okay thanks very much
okay so that's it's time for questions
no him
hi and thanks to the rate talk and always good to see rows of being
benchmark i'm just curious did you use i just default out of the box parameters
the did you do but it during
so i we just with the results from the people of the benchmark and they
were only saying that the
something like a little bit of the and specific training and would for the end
it is something like that
and bumper for and they use the version
there was to using the crf and not the narrow one and a tensor for
one okay so that's actually like a very basic version i suppose
questions
okay
so he showed the architecture their with some intermediate layers also be serious are they
also into me just supervision here
thirty one so this labels via alarm and sonar they also
supervised labels used as you know that is all the supervised parts of the five
multitasking in this sense that we are solving the three task at the same time
so you need
slightly more complicated data set for that to have all of that supervised
while we have more labels than just and
we need to the dialogue act in this case what are the scenarios we need
the egg the actions and the frame and their the arguments basically so that's why
the data vectors is called the moody does because we have this three layers okay
but for a c was really important to different seed we didn't action and dialogue
acts because have a show you
it will many cases in which it was important for the robot to have a
better idea of what was going on in the single sentence okay
okay
thanks for talking a question in the last slide you mentioned it's a frame like
so what's the difference between four and like on the framenet
a frame like so unlike what if a to whatever can be
mm someone is the enough traction which represent a predication in a sentence and have
some arguments
this is like the general frame like you know like the very
bold
it's the same as the frame that's so the data was this decision making the
same that big difference is that frame at the very busy fight ut behind
and that there are some extra two d is the most things like some relationship
between frames and the results of special frame elements like at the lexical unit itself
which make it easier to look at the frame in the sentence
but
what we like to do is it doesn't matter where e framenet thirty five just
in time slot like from the i-th this corpus or any other corpus
wait like to i'm we are trying to build the is a shallow semantic by
so they can deal with all this stuff of the same time
as better as possible is if a kind of map task but we have trying
to incorporate these different
aspects of the ut is then we have trying to deal with them
more or less that in different ways but without compromising
the assistive led to all their kind of formant
one other question with us what to that used for data annotation
so we actually had to for our corpus we had to develop already interface
is always nice basically a web interface where we have all the token i sentence
and we can talk everything on that and the score was as be entirely i
mean something with been collecting in the last we have then it takes a long
time ago it's a it's
it is a hard task to collect these sentences and also we have to filter
out many of them because the context of the most different i sometimes we went
to the rubber gap to do this collection and of a lot of noise and
things we were also value that you're
file of these then we stopped but in the and we were always applying some
people from all alarm
to annotate them like to three of them then you know doing some unintended beam
and annotation trying to get whether the actual understood out that but with working if
a very long process okay and
we're the computational linguist but opposite thing point so
it is very hard but this that's
that's that the situation with the corpus
okay so we have run time so it's not speak again
okay