okay so i
how we can start hello everyone good morning and
we'll come to the third session
and today the topic is the end-to-end dialog systems and natural language generation
we have none natural language generation model to end-to-end systems
and the
first speaker to the is post and saying
i with the paper on a tree structured semantic encoder with knowledge sharing for domain
adaptation in nlg
so this is this is the natural language generation model
a are we ready
okay so
go ahead you have the four
hello everyone
good morning work on to my presentation
my name is both and then run university of cambridge and today i'm going to
share my word tree structure semantic encoder with knowledge sharing for them annotation in nature
language generation
i guess
pretty much of you
are pretty much familiar with this pipeline dolls system
here just one a high like that
this work is focusing on
this in a chilling generation components
so the input is just these semantics from the policy network and the output is
natural language
okay so given the semantics representation like this
he really too many source from the man
and the system is informed about the in the end of the rest run
address
and it's phone number
so we soar nature language model
it would produce the nature of language to a user
and this sentences this all turns has to contain all the correct information in the
semantics
that's the goal of an image a model
we focus on domain that the patients in there really in this work
which means that
you might have bunch of data from your source in
and you can use that data
to put on your model
to get a preacher model
and then you want to use some of the limited data from your target the
men
to finding your model
that makes you model maybe able to work well in the in the domain you
are interested in
that's of the meditation scenario
so how do we on usually encode all semantics
among prior work
pretty much to mend approach
the first one the this
people will use pine the representation
like this
so each element
each element in the back to representation
its corresponding to the certain slot value pairs
and your ontology
or we can treat
or semantics
as a sequence of tokens
and singleuser lstm
to encode your semantics
actually
both of approach works well
however
they don't really capture
the internal structure of the something takes
for example
in the semantics
you actually have this kind of tree structure
because
under the request
there's a full price slot
then a more data system used to ask from the user
so like here like this up to here
and on there'll inform dialogue act
you actually have three slot
information that you want to tell the user
and both style that's
are on their the restaurant domain
so
that's the semantic structure is not capture by lows by lows to approach
but doing you really need to capture these kind of structure
the c help if it's not help then what about the right
i'll give it a very simple example
so again given this then summing takes like this
for the source in
and you have the corresponding tree like this
during adaptation
in the domain adaptation scenario
you mike you might have these similar on semantics
we sure some contents
and that's its corresponding tree structure
as you can see here
most of the information
i shared between those two semantics in the tree structure
besides
domain information
so if we can come up with about a weight to capture low structures
within a someone thinks
perhaps the model is able to surely information
more effectively
between domains doing them annotation
and that's the motivation of this work
so the question here is
how to encode the structure
so here is the on the pos model forty
tree structure semantic encoder
actually the structure is pretty much
the one you see
in the previous slide
first
we have the slot layer
and all your slots in the ontology
will be listed here
and then you have dialogue act layer
it is used to describe all it a lattice you have
in your system
and then we have done the layer
i bought and of the tree
we designed a property layer
that is used to describe
the property of a slot
because for example
any slot
perhaps area can be requestable
or sort can be requestable
and the
here is informal
so we use it to describe the property of whistle
so
and given the semantics like this
based on all information all the structure you has
we can build a corresponding tree
we with this definition of a tree
so first but you sound basically based on the property of a slot you can
peel the links between the property layer
between the property layer and this follow your
and then
all the slots will goes to load a lax
it belongs to in the semantics
like this
and two of the da lacks in this example will go to respond to men
i finally
we'll take the root of the tree
as they find the representation
so that this is the way we can
encode
the tree structure in the semantics
how what we really compute what do we exactly compute in the three
and basically we focus on we follow the work the problem worked three lstm
in the two thousand fifteen
first
for example on that say the node here
we compute
the summation over all is chosen
the hidden state the summation of the hidden state in the summation of the
memory cell but always trojan
and then
like the when you live lstm
we compute the input gate forget gate and a bouquet
and finally
we can compute the memory cell and hidden state
at is clear enough to you
so
on
again give a again the same simple example
given the semantic thing the source in
we have the corresponding trick structure
and doing of the patient
you might have this then you might have the steamer some intakes in the target
domain
and thus we can see here
without design
two structured
most information the tree
are shared
and we hope that can help model fisher information between domains
okay
so now so far we know how to encode a tree
of the semantics
then that's go to the generation process
it is very straightforward to just take the output it
the final representation of a tree as teens initialization
of your decoder
and we follow some prior work
where the value in the all turns are delexicalise as the
so our token we do something
so in this work we designed a slot spoken as domain information dialect information and
slot information
so we just follow the center cross entropy
to train our decoder
sounds alright sounds good
we have a way to encode a trick structure
but actually for think more we just use the battery abstract information of a tree
however they are
punching him a bunch of information at intermediate level
thanks to our on define tree
so this moldable us to
come up with a better way
to access to information at intermediate level
so that it decoder
can have more information about three structure
so here we propose are on it is very sorry for
we apply
we applied attention to the
to the to the top man tell at and slow later
do you have otherwise attention we can't
whenever the model
the decoder
produce the special flock
slot token like this
the hidden state at each time-step
will be used as acquire we
to trigger the tension mechanics in
like this so for example at the slot later
all the slot information
will be treated as the context
for the
for the attention what kind of you
and then the model
we compute
a proper the probability distribution over or information for the three layers
so for example again
in slot s law they are
you will have a distribution over all possible slot
it basically tales model which slot
each to focus on which information the models you focus on that is done step
of course during training
we do have supervision signals
from the input semantics
this can help the model this can guy to model
to tell him what to focus on
at each time-step
and then will use this extra
we use this attention distributions as the ask for information for the next time step
and the and the generation process
goes on
so
with all they'll wise attainable kind is an
on a loss function becomes standard cross entropy
then the cross entropy plots
or a loss for only loss
from the
three attention mechanisms
that's how we use a channel or model
okay that's goes to some basic setups
for experiments
we are using models i was dataset which is which has on ten thousand dialogues
over seven domains
and within all utterance
it's actually have more than one dialogue act
we have three strong baselines the first one is as the lstm
on it basically use a binary representation to encode the semantics
and we have
t gen and ra lstm
lows to model i'll basically sector set model
so they are using lstm encode i think older
i think all the semantics
a small evaluation
we have on the stander
on all the mathematics such as blue
and also to fly error rate
because we don't we don't want all we don't one or channel
nature link generation model
just before when but also
the content should be correct
and we also conduct a human evaluation
okay let's see some numbers first
on
here this database
source the man is first run
any target domain is hotel
the have access
is the different amount of the adaptation data
any white athens is the bleu score
three baseline models are here
and tree structure tree structure encoder
and its variant
tree structure with attention we kind of them
as you can see on
with for the patient data a hundred percent data
that all the all the model performed pretty much similar
because the data is
pretty much enough
however
on with last data
such as the last m five percent
our model start again benefits
thanks to the on
structure
sense to the tree structure
that's the last again number of these slot error rate
that's not error rate is defined like this
we don't want our model
to produce
missing slots
to have missing slots or put to use redundant slot
so again with a hundred percent of data
all the model performs very similar
they're all good
with all data
however
which pretty much last data
with pretty limited data
even in the
one point twenty five percent of the data
our model start to
on produce very good performance
overall the baselines
previous like just show one setups
we actually conduct three c given kind of set up to show that
the model works in different scenarios
the first column is
the one used all in the previous line
restaurant tube don't hotel adaptation
and the second one
the middle column is the restaurant at attraction
and the second and the sort of one is trying to taxi
here we just want to show that we can observe a similar trend similar results
overall different setups
okay so we all know that natural language generation task
is not enough
to just evaluate by the automatic metrics
so we also conduct you may validation
we use but amazon mechanical turk
each mturk that asked to score five out of it in terms of
informativeness
and they show in this
so here some basic numbers
in terms of informativeness
the tree structure with attention
score the best
and the tree without attention score the second
which tells us that
if you have a better way to encode your trick structure
then the information can be sure for determine that the patient
the model is tend not the model tends to produce
right correct semantics in your in the generated sentences
meanwhile we can still meant and the nature of nature and s
of their generative sentences
so we wonder
where r
improvements coming from
what kind of as get what kind of example are more or model really performs
good
performs well
so we divide the task that into seeing and on things up that
subset
thing basic leanings
if the input semantics is thing during training
then it's belongs to sing subset otherwise is on thing
that's
let's see the
numbers from the fifty percent adaptation data
with this bunch of data
most of the testing example are thing
and all the model performs similarly
well as numbers are the
number of the wrong examples the model produces
and the lower the better
however
with very limited adaptation data
out of nine hundred on things semantics
that the semantics never think before doing training or that of the patient
those based the baseline system
but does
several around seven hundred
raw examples
wrong semantics in the generative sentences
however archery with attention can produce very low number
just around a hundred and thirty
so this a this implicitly tell us
our model my have the better tinnitus in a ability
to the on things semantics
okay so here's comes to my conclusion
by modeling the semantic structure
low information might be shared between domains and this is helpful for domain adaptation
and our model we use
especially with the with the proposed there was attention mechanics and
generates better sentences in terms of automatic metrics and the human scores
especially with the limit
very limited adaptation data
our model performs the best
so thank you very much for your calming
and the any question and feedbacks are welcome thank you
thank you very much so questions
you said that you're doing well with one point two five percent which sounds good
what's the number of training examples yes one point
yes
on
is here so for example when we adapt one restaurant or hotel during preach an
example is
eight point five k but if we are using only one percent
here it probably six under
is still song
yes
hi can't go to the plot for the tree
yes so you're for the full use
yes of to unseen so first use a the attention is on all fours on
wall for slot is not all the slots
but for a given example the only the for green nodes are like yes in
the data so why do we do need to attend to internet and from which
sorry actually is the slot within the all semantics
so only the slots in the semantics are activated
and will be what we use it for the outage in case of another question
is when you do domaintransfer what if the two domains have different sets of slots
and for those slots that only appear in one in the onscene domain it's never
trained in the in the data because in
on
because by the nature of this dataset
as you see we have restaurant hotel attraction we sure which we sure low three-dimensional
most of these slot or they have their unique slot relation most of false and
each line and taxi sure some slot so that's why when and when i have
that's lying or setup we have there's run to a hotel restaurant to attraction
and is trying to taxi
because we try to leverage the sure slots
hello great so i had a question about the evaluation that looks at the you
redundant and missing slots
yes that site error rate
my question is
conceptually why does not even need to be a problem because
you could have constraints
that ensure that each slot is produced exactly one time during the gender on
yes and it what depends on how you put your constraints on
if you put in on generation loss function loss function during training
that doesn't guarantee right down again to model still fall your constant
but if you put your constrained at the output like more after like a post
processing
you my few there are some slot that's good but you might have not
you might come up with a on the each row sentences right because you use
more too few it out something
you need to come up with small was to make it for one between
between the floor you figured out
so it is actually a problem and the we simply follow some prior work which
is which is my fix the use ice
just okay so i guess yes conceptually i get there'd be a tradeoff between naturalness
and coverage but if you know in advance that a requirement is coverage than
i guess you're only degree of freedom would be to give a constant natural
sorry i
so i had a miss your left and i just making the comment that if
you know in advance that you're requirement is that you need to generate all the
slots yes in your only degree of freedom
is to give up on naturalness
all right based on that nation if the scoring for the right in this task
thanks to
i have a question regarding this year that you show you have shown here i
picture yes and is thereby eigenvalues and somehow encoding in this year so that you
are only taking into account is not
only the slot we don't use
value because you don't need to nine
yes and also the value is there's too much actually i don't use the male
and then i have anything completion and have you thinking in a moment we then
it takes a condensation
without elicitation
on yes
on
any will come up with some questions for example your value will be pretty much
like open but vocabulary right from if you have for this dataset we have restaurant
then
attraction m and the hotel in n
and the train it
and the time slot
this will become very complex
it is a still challenging problem in analogy
okay
right i think we need to move to the next papers so let's think the
speaker again thank you very much