Speech Transcript - Hierarchical Multi-Task Natural Language Understanding for Cross-domain Conversational AI: HERMIT NLU

so the first present there is a man you know so

these start you presentation

good after don't know to one

so my name is manner thus generally amount of furniture from the interaction lab

of they headed for university and then gonna present work have done we don't have

an so an oliver lemon

about a docking outmoded task natural language understanding system for cross domain conversationally i that

we call and meet nlu

so and another language understanding is quite a white concept

a most of the time when is about compositionally i a dialogue system it of

us to the process of extracting the meeting from natural language and providing key to

the dialogue system in a structured way so that the dialogue system can perform definitely

better

and we begin end up

study studying this problem is for the sake of it but actually

we did it in the context of the moment project which will see as you

to be project that was about

at the deployment of a robot with the

multimodal interaction capability it was supposed to be deployed in a shopping one thing around

and it was supposed to interact with the user's a giving them a structure entertaining

them would only be little bit of chit chatting

and i'm gonna show a video of it that may be explained it be better

what the robot was supposed to do

help you can hear the audio although they don't the subtitles

i dunno one of the recording

so the robot with both i sent and if no indication we just the and

voice

in this five phase

and with or without the backing being detriment and the preference of the user

right

one value no straight i actually and no not attacking

but for some with of the next to

so we so a lot of generation but everything started with a request from the

user

and that's the mute one where we are focusing today so is basically designing an

nlu component of with a robust enough to work and is very complex dialogue move

to model dialogue system

again most often in compositionally i

not a language understanding is a synonym of shallow semantic parsing so this can actually

the beat with the next to the

morning keynote and which is the process of extracting some frame an argument structure

that completely meaning in a sentence and it doesn't really matter how we call them

if is intent of slot

well and most of the time this types are defined according to

the application domain

whether they have a system two db i'm like framesemantic switched off and isolate of

abstraction and is the one we are using in our context

but actually some problems especially in our case when we wanted to be then interface

there was able to but using several different domains while most of the time

in dialogue system when you have another language understanding component they always did we must

single domain or

if you don't through domains at the same time

and this also

what because

the resources are available the are always or about so looking restaurants so booking flights

while we wanted our interface to be use them in several different location that can

be in a domestic environmental rights of the shopping mall or in sin for example

why you have to command robot

formant in unseen offshore all drinks

and so

one of the first problem want to the system to be the system that was

cross domain

and even if there may be noted see a recipe for that we what trying

to this problem anyway

and the big problem is that

most of the time dependencies into that are designed i you for dialogue system error

only contain a single intent or frame

while in our case there are many sentences that given to the robot

which contains two different free more intense and four as can be very important to

a detect both of them because if we ignore the temporal relation between these two

different frames for every important to you know satisfy the user both for the codec

a mess by action and also the needing of a pole at the same time

so that's another problem that when you rely on these

hi you know the and structure

most of the time

two different kind of interaction might end up being the exact same intent or frame

like in this case while the actually belong in the dialogue

two different kind of interaction so what we actually wanted to do is not only

targeting the frame and en

and the slots

but also wanting a layer of dialogue acts they will tell the dialogue system

the context in which these are has been said so for example in the first

case we are informing the robot's that starbucks next on the all imagine that we

want to teach the robot how the shopping mall is done and the second one

days at a customer that is ask asking a an information about the location

all starbucks

so in two

quickly to cup we wanted to deal with different domain of the same time if

possible

we wanted to talk more than one single intent and arguments

the sentence and since we are also during the dialogue act so we have a

moody task i could that share

we have to deal also we multiple dialogue act

we might argue why the

is actually very important to understand both the dialogue act in this case

if not the final intent is only to give information about the location of starbucks

but actually we might want also to understand why

the user is asking for starbucks because we need a coffee if maybe was meeting

and meet shaken does not starbucks you could do could have pointed it somewhere else

so far have this stuff is real important

and of course

we wanted to try to benchmark of and the you system a initiatives

and eye gaze to off-the-shelf tools in this was given by the people are there

was actually

providing us with these utterances and evaluations and we will see later

note the very quickly i mean is nothing complicated we tried with this

this problem by

addressing the three different task

at the same time so this asks another of locating dialogue acts the frame

and the arguments

each task was solve the with a sequence labeling approach in which we were giving

and label to each token of the sentence is

something very common in nlp

and each label was actually composed by the class

of the structured we were able to target for a given task

enriched with the label that can be o i o

depending well

the and the type was the beginning of a span of a structure they inside

or was outside one of these and here we have a very easy example

now the problem is that

this is a linear solution for a problem which is

and i gotta save because the language is a gaussian then we might end up

having some structure which set actually nested inside other structure especially for freeman arguments this

doesn't happen that basically never for dialogue acts

but for frame and arguments this is happens quite of an especially in the data

we collected

so what we that was solutions kit was to

basically collapse

the just actual in a single linear selection and trying to get whether one of

this structure

was actually inside

a previously target that one

by using some realistic on the syntactic relation among the words of an example if

find was actually

syntactic child of two

we could but usually sticks a by some roots actually say what that the locating

nh frame was actually a embedded inside the requirement argument of the needing frame

now there has been solved in a multitask fashion so we basically generate them created

a single network that was dealing with that the ti in task at the same

time is basically other sequence of stick with the t within quadrants yet if that

is that i'm gonna show

next slide is nothing but the only complicated but there are two main reason why

we adopt the d is

architecture first of all we wanted more or less to replicate

and yet a key of

and task difficulty in a sense that we were assuming actually we were

not the think that the tagging they'll that backs is easier than typing frames any

it easy if the target frame t v then tagging arguments

and that's also

i kind of structural relationship between you do it between these three because many times

some frames tend to appear model friend in the context of some dialogue acts and

arguments are almost always dependent on and frames

extra especially when there is a strong to be i'm like from semantics

and

so this is these are the reason why the network is down like this

and i'm going to illustrate the network quite quickly because this is a little bit

technical stuff so

the input file a network with only a pretty and then one betting that we

were not be training and that with the firstly there was encoding with a step

of encoded with some set potentially there was supposed to capture

some relationship that the bidirectional lstm encoder was in capturing because he wouldn't sometimes of

attention is more able to capture relationship among words which are quite distant in the

sentence

and then we were feeding us yet if layer

there was actually typing the sequence of four by your tags for the dialogue act

in a right of the this of attention delay

so for the frames it was basically the same thing

but we were

using shot recognition before because we wanted to provide encoded with the fresh information

from the first layer so actually the lexical information but also

which some information that was encoded while

being it

kind of i and directly being a condition on what the

the dialogue act was starting so we were putting the information together and with serving

the information to the next layer

and the with a crf for typing of before

and finally for the arguments whether again the same thing

another step of encoding and crf layer with lots of attention and these came up

from the experiments we have done with some ablation study it is on the p

but we're another button you hear about this is the final network we manage to

tune at the very end

so in either was think at the beginning we wanted to

benchmark this

these nlu

components now benchmarking and nlu for the system is quite of a big issue in

a sense that the dataset and that was thing before most of these are that

are quite

single domain

and then very few stuff

i mean about an hour now that there are some doubt that direct

the started go popping up but the beginning of this year we were still put

on that side

by likely that was these results which is score the nlu benchmark

which is a bicycle cross domain corpus of hundred interaction with the house assistant the

robot

is mostly i or orient that is not a collection of dialogue is the only

single interaction utterance interaction we with the system

and callers a lot of the mean we will see later

and but is mostly not oriented there are some

a comments that can be used for a robot bodies mostly again i go to

oriented

what does a second rest of that we started collecting along the esn is taking

a lot of time

which is the rubber score was a is called the is like that because we

stand for robotics oriented mostly task language understanding corpus

and is again is a collection of single interaction with the robot that called a

different domains that more think them of kind of interaction there is there is to

chopping that is

is state common the robot's there is a also a lot of information you can

give to the robot about completion of the environmental name of both on

well this kind of tough

that's quite a huge overlap between the two in terms of kind of interaction

but they spun on

different domains

the first corpus the nlu benchmark provide us three different semantically yes

and their code scenario action an entity i know this sounds completely different of

from what we said before but we had to find some mappings with the stuff

we where we wanted to that are go over the sentences

the robot is good big the full set of it is twenty five almost twenty

six thousand percent sentences

and there are agent different this scenario types and each scenario busy a domain

and that of the fifty four different action types and fifty six different entities

there is something the goal and intent which is basically the sum up of scenario

plus action and this is important for the model for the evaluation will see later

as you can see there is a problem with this the dataset is that is

that it is gonna cost domain

is that it is more t task because we have three different semantic layer

but

we have always one single send audio and actions so one single intent per sentence

so what we could benchmark on these it

corpus was mostly these two initial

these two initial factors

we did evaluation according to the paper that was presenting

the benchmark

and this was done on a ten fold cross validation with like half of the

sentences that eleven off of the sentences in this was to balance

the number of classes and it is inside the on the results

so i that was saying that we had to do a mapping

between

their tagging scheme and whatever we wanted to die which is very general approach for

extracting the semantics from sentences in the context of a dialogue system

bum we also so that

the kind of relationship that what holding between

they are semantically at one or more or less the same there were holding for

our approach

and so these at some result

this is that are reported in the be but there are quite old in a

sense that they are from the beginning of this the they've been evaluated in two

thousand eighteen

they have been around on all the open source

reduction of these that nlu component of dialogue system available of the shots

that's a problem we want some because you know why second specific training for entities

and these was not possible because it does a constraint on the number

of entity types and ended example you can pass do we do we try to

talk with what some people but we didn't manage to get the licensed at least

to run a one training with the full set of things so do you have

to take that into account too much unfortunately

the intent that was think is the sum up of the scenario

and an action

and these

performance is then

obtain it on ten fold cross validation i didn't about the standard deviation because

it would they were almost all stable but if you want to look at them

they're on the paper

and the other important thing is that we want to take into account whether it's

upon

of a target structure to was matching exactly actually

the elders of the people when in taking into account that

but they got the true positive whether there was a and an overlap

an overlap of the of the spun

so these are kind of a lose metric

that we whatever we are evaluating one

we can see that the entity for the entity and then the combined setting a

our system was the performing on average better than the other while for the intent

we will actually not performing as what is what some but better than the other

two system

the other important bit is that the combined the

measure is actually the sum up of the two confusion matrix of intents and entities

are we doesn't

actually give us anything about the pipeline

our the full pipeline is working

but these a something that we have done

on our corpus which is much smaller

and is not yet available because that we are still gathering data

probably end of this year we're gonna release it

i know if you colours are very natural environment but for people doing a chair

are your dialogue in the context of robotics this can be

one interesting

so here we have eleven dialogue types and fifty eight frame types

which compared to the number of example is quite high

and eighty four frame element types of which are the arguments

and if you can see

not always but there are many cases in which will we have more than one

frame per sentence and what them more than one that about but sentence

and no idea the frame elements are quite a lot

we i have like

they fit into semantic space body into these three is more formally the only tool

because

we have thirteen dialogue acts exactly like we so during the in the rest of

the presentation

and we also provide semantics in a them in terms of frame semantics

well we have three main frame elements these are actually this the same the same

semantic layer theoretically but there are two different layers or variational e

and if you can see we have a lot of four

embedded structure a frame inside on the frame and this kind of stuff

a this is the mapping we had to do again

with the different semantic layer is basically same dialogue acts dialogue acts frames and frames

and frame element some arguments

and of course

the these are the two aspect that we could tackle why using this corpus so

is not incur of domain because he's not a score of the mean of the

other one

it is enough to have that we have

different kind of interaction and we have also sentences coming

from two to different scenarios that can be

the house scenario and the shopping mall scenario jealousy charting something coming from these interaction

with the month in answer about

but we don't want to sell it is completely closed domain mostly because the other

record with a much more of the mean than this one

but it every multi task and is there really moody dialogue at frame on each

sentence

and k that is out of

the might look quite we hear the about

i'm gonna explain why the like this

so most that's one i report here is the same exact measure that was reporting

for the nh the nlu benchmark so

we have take into account only when the span

of to structure the overlap okay

and

the results are quite high

and the main reason is that to the corpus is not been delexicalised

so there are sentences are quite similar

and then the system be a very well

but you don't have to get parts of by doubt because

if we look at the last one could be the second one is basically only

using the

the coal two thousand set of task evaluation which is a standard and we report

the need for general comparison with other system

but the most important one is the last one with a that is the exact

match

and the laughter of the exact match is telling us

how well the system over the pipeline with working completely so we were taking into

account the exact span

all of the target structure

and also

we were

yes we were

we were actually

trying to get

i mean a frame was actually correctly dog only if the also the dialogue that

what's quality data so with actually the end-to-end system

in a pipeline and that is

the measure we have to chase

no two

conclude and some future work so the system that i presented which is these their

cross domain moody task

and that you system for not a language understanding to

for conversational i a that we designed a is actually running in the shopping mall

you feel on

the video i showed you was formed from the deployment we have done

and is gonna be derived for three months in a role

some pos during the weekend to do some matter out easy vendors rebooting the system

but we

manage to collect a lot of the time order maybe integrate them in the corpus

and release it and of this year

if we manage to back them properly into the checking only the latest beginning of

next year

we have to deal with their this area with different a demon sad this

it means not relying on these heuristic on the syntactic structure but actually simultaneous most

honestly starting

in but that's sequences are moved event sequence e the canopy one inside the other

if any topic because we actually already of this system we

finally the final added few months ago so we didn't have time to the meeting

here but these exist and then there is a branch on that everybody the ti

show you which is about this new system

but of our work is

this one of generating a general framework for frame neck structure so it doesn't

method it's you audio the application that is the reason behind

we are trying to create a network that can be with all the possible frame

like structure passing this is our a long-term goal something very big but we are

actually pushing for that

and the last bit is mostly dealing with this special tagging of segment that a

segmented utterances we are like that in our corpus there were many

small bit of sentence that the user with one thing because they were stopping you

the basic dating so the missing the first part of the sentence like i would

like to

and there's asr what actually this equation is that was sending the thing to the

bus set and the bus to work correctly by think it by the with some

bit missing

now when the user with thing

to find the starbucks for example we receiving these find the starbucks there was contextualize

the as a fine finding locating frame

but we didn't know it was also a frame element of the previous

structured so we are studying the way to

make the system aware of what has been part before

so that you can actually give more info what information in the context of the

same utterance even if these broken by idea is to

and

this is everything

okay thanks very much

okay so that's it's time for questions

no him

hi and thanks to the rate talk and always good to see rows of being

benchmark i'm just curious did you use i just default out of the box parameters

the did you do but it during

so i we just with the results from the people of the benchmark and they

were only saying that the

something like a little bit of the and specific training and would for the end

it is something like that

and bumper for and they use the version

there was to using the crf and not the narrow one and a tensor for

one okay so that's actually like a very basic version i suppose

questions

okay

so he showed the architecture their with some intermediate layers also be serious are they

also into me just supervision here

thirty one so this labels via alarm and sonar they also

supervised labels used as you know that is all the supervised parts of the five

multitasking in this sense that we are solving the three task at the same time

so you need

slightly more complicated data set for that to have all of that supervised

while we have more labels than just and

we need to the dialogue act in this case what are the scenarios we need

the egg the actions and the frame and their the arguments basically so that's why

the data vectors is called the moody does because we have this three layers okay

but for a c was really important to different seed we didn't action and dialogue

acts because have a show you

it will many cases in which it was important for the robot to have a

better idea of what was going on in the single sentence okay

okay

thanks for talking a question in the last slide you mentioned it's a frame like

so what's the difference between four and like on the framenet

a frame like so unlike what if a to whatever can be

mm someone is the enough traction which represent a predication in a sentence and have

some arguments

this is like the general frame like you know like the very

bold

it's the same as the frame that's so the data was this decision making the

same that big difference is that frame at the very busy fight ut behind

and that there are some extra two d is the most things like some relationship

between frames and the results of special frame elements like at the lexical unit itself

which make it easier to look at the frame in the sentence

but

what we like to do is it doesn't matter where e framenet thirty five just

in time slot like from the i-th this corpus or any other corpus

wait like to i'm we are trying to build the is a shallow semantic by

so they can deal with all this stuff of the same time

as better as possible is if a kind of map task but we have trying

to incorporate these different

aspects of the ut is then we have trying to deal with them

more or less that in different ways but without compromising

the assistive led to all their kind of formant

one other question with us what to that used for data annotation

so we actually had to for our corpus we had to develop already interface

is always nice basically a web interface where we have all the token i sentence

and we can talk everything on that and the score was as be entirely i

mean something with been collecting in the last we have then it takes a long

time ago it's a it's

it is a hard task to collect these sentences and also we have to filter

out many of them because the context of the most different i sometimes we went

to the rubber gap to do this collection and of a lot of noise and

things we were also value that you're

file of these then we stopped but in the and we were always applying some

people from all alarm

to annotate them like to three of them then you know doing some unintended beam

and annotation trying to get whether the actual understood out that but with working if

a very long process okay and

we're the computational linguist but opposite thing point so

it is very hard but this that's

that's that the situation with the corpus

okay so we have run time so it's not speak again

okay

Hierarchical Multi-Task Natural Language Understanding for Cross-domain Conversational AI: HERMIT NLU

Oral Session 4: Understanding and Dialogue State Tracking

Andrea Vanzo, Emanuele Bastianelli and Oliver Lemon