i everyone my name is attention from carnegie mellon university that i i'm going to
talk about working there was shot current generation with cross domain data actions
and the code and data are both available in the k
so like target was going to be about generative end-to-end dialogue system
which is perhaps one of the most flexible for remote we have nowadays to model
both task part scoring and non-cause cora conversations
and the basic idea i'm sure everybody already familiar with we have a dialogue context
and we have a new encoder that encoding whatever is available at testing time encoding
dialogue history on or the information i don't have it because the network
i can generate in the response
and for i do it a verbal response that sending back to human
or it can be a api request offended back to databases
so that the single model can handle pose the interactions between human too much in
and also much into by k databases
and
although this point what is more powerful and flexible
most of all kinds of the successful prior work has one assumption that
is a large training dataset
the exact same a task or domain that were interested so we can show me
model on them
and
a some trade off and not true in practice and the because dialogue system can
just be applied to so many different domains even just for slot filling we have
slot filling for bus
schedule a whether you know and
five and so many other domains
and in many times we don't have the exact data so that were interested that
we'll that for a domain that we're going to be able
and one human another can hear actual example here human is incredible a chance for
knowledge from domain to domain
so in managing a customer service agent a who is was in these should department
and if you can very quickly adapt to the
closing department just really some training materials without the need to up the training example
dialogues
so we want to achieve similar goals for this study
and to summarize
the goal of the first goal is we want to exploit
the flexibility of a generative model so that can simultaneously accurate knowledge from multiple domains
and then a second a more we wanted having the canyons and to being able
model to transfer knowledge from source to maintain you domain where we don't have data
and this is a new problems that we formalize as a learning problem we name
it was shown that a generation the c g
so the set up as follows
so we have source domain which means domain where we do have dialogue data and
that we have a set of target domain wherein the we the so may we
don't have dialogue data
and for domain both source and target we do have access to a domain description
which is can be any type of knowledge that describe the specific information about their
domain the and then given a set up the learning problem becomes follows so in
training time we the model can access information can be trained on
the source dialogue from the source domain and also ultimate destruction from both source and
target
and testing time we ask the model to directly generate responses in the target domain
whereas the target on a number of the in training that's why we called the
there are shown that estimation problem
and
just to show in the formula also the visual figures
so given snr is
very easy to see that the design of torment description is the most important factor
here because that cover all the domain and that can that's enable the possibility of
transfer knowledge from source to target and there could be many different type of them
a description and in this study we propose one type would call the cm response
so this
the assumption serious problems is the that between the source and target we assume that
there exist some sort of a shared related discourse patterns such a full page i
can also for policy and again given the assumption
what is the response
so as to response is a list of pupils and each triple contains elements acts
at
and axes example utterance the can be spoken from either user or system from this
domain and a is the annotation of that utterance that you're example i shows here
and d is basically the domain index
and then for each domain we have a table like this and having c responses
from each domain
so given the same response and also the dialogue from this also make how do
you can i suppose data to train model to actually the std
so in this work we propose a new class of algorithm can actually matching algorithm
and in this algorithm the most important a notion is the cross domain data collection
so introduce a new space basically the and in the latent space the and we
assume the only possible is the action from system the user can reside in the
latent space
and in actually match my when we try to learn still we propose to use
that of parameters the first one is are the recognition network and a function of
these are is basically mapping utterance so annotation from sentence from words td late actions
and now we have in cold and the text in the dialogue context and try
to predict what's an excellent an action
and these are the one is the decoder
because we do enjoy we can model so we expect you called a basic click
select an action any point the latent space and can map back to a sentence
so visual here shows all the possible
transformation between the for rebels utterance annotation late actions and it context
okay still now we have this free parameter want to learn
and we have to type of data so how do we optimize
so the first couple data we encounter is the response data
so basically a bunch of sentence from different domains and the objective here is we
want to make the later action from two utterances in from two domains them at
each other only one the annotations in with each other and
well we do here is
so the task the yellow is from one domain i think it's a bystander going
from movie
and we try to a introduce the first loss function is called domain description loss
and we basically minimize the distance from the as the access to the a
in this way so that an utterance from two domain the only close to each
other unless the annotation close to each other
and then the second type of data we're dealing with is about a better from
source domain so in here the objective here is what we want to make to
predict action did be accurate so one of the project action from a context d
v actually similar to the actual response that's been studied the data
and that we introduce the second last clause
so the bottom for task is the same as the previous slide
and we have the predict the action late an action that are and we try
to minimize the distance between a particular connection two d just to the late an
action of the arcs here
so to summarize the
to summarize
action matching i with the as its here
and it has is very simple and elegant solution so we only have to loss
function and we alternating between them so for a software and the we have atomic
description loss
so that's why we're dealing with data from the seat response
so we second fine we minimize the distance between
the energy is
and also the first and we trying to train the decoder to generate a response
from also and target
and the second author dialog lost we
this loss is actually about related to the latent variable model all the original encoder
and you can see that you timewise training decoder the other ten is trying to
minimize the distance that i just talk about
and training i within is basically taking data from two stream of a serious problems
the dialogue and we randomly pick you want and then optimize the corresponding loss function
so for the exact location for this study we using a bidirectional gru for the
recognition of to work and we have a then the hierarchical honesty an encoder for
the encoder
and afford a condo experiments cucumber decoder
one is a standard lstm decoder with a attention
and the second one is a lstm was the score pointers sentinel genital is actually
the decoder with caulking we can use and so you can copy what from the
context and iraq all put into the
the output the response
and it's been shown to be a pretty robust against out-of-vocabulary token in the language
modeling
and here we show the picture
but what we having a this model where we have been covered decoder the left
figure shows that how do we deal with dialogue data and the second figure shows
how do we deal with a c response data and the that we can optimize
three a network jointly
so that our method and we passed this framework a to the task that wine
cm esteemed i'll and second one is that for multi domain thought of that is
that
and signal is a new open-source multiple madonna generator with complex the control and it's
open so i'm gonna have and they have more menus instruction about how to use
it and that we use this generator to generate in dialogues from seven domains
so we take so we don't make as the source domain there are us from
bus and when a
each one thousand dialogues
and a target domain we have four and we cast in different perspective so the
first one is rostrum so this is in domain because also because in the training
and then a second one is a things not address for the rest of slot
so is the restaurant but we completely have a different set of slot values
and then the so the one is i think analogies list your a strong but
we user need if a complete different start time and not a template for both
user and the system
and the last one movie is a new domain but has joe nothing with anything
in this also may which is the most challenging one
and forty was files we take a hundred transform each domain addressee response and we
use the internal frame as the annotation
and the second type of data something that would do is the staff of data
a stand for the that is result and dialogue from so we don't main scheduling
whether and the navigation and with you to take one our approach by rotating and
use one has this talk target and other two at the source and we have
so that we have three possible configurations
and we use a hundred fifty utterance from each other domain as a serious problem
and we have an expert annotators and with semantic frames and the that's all we
need for the final domain so we only use a hundred utterance from the target
domain which training and don't use and dialogue from the domain
and for evaluation we and the left is also evaluation so because in the past
for instance then we invite of system from four different metrics
without bleu score energy
dialogue a and it database cory f one
and you
although quantify the overall performance we have a new score
for the bic score basically take the geometric mean of the four managers and having
a one number for each system so we get data
a overall performance manager
and we compare for different models the top to a baseline so that optimize its
no encoder-decoder was attention and the second one is the decoder with the company we
can use n
and that's the to propose a method is basically we add the action matching the
proposed actually match algorithm to did you baseline and see what happens or we adding
this action matching
so in the results so here the local formants and on the life as i
we show the peaks go on the thing died
and on the right we show the overall the performance is therefore data
and so here we can already see some interesting content
so we first can see that the two baseline the conformance but it's pretty well
on the in domain data which is the normal test in training a scenario
but why they moved to the one thing slot as the energy a new domain
a performance job significantly
and also we can see that the blue the green bar which is actually matching
cost the copy decoder
it has really strong performance in a in those target domain well it's were quite
different from the training data especially when you domain the going by is able to
achieve
sixty eight or performance
well as even in domain the
performance got a cap is about eighty two
so it actually learning something that
if one the from the by two baseline is significant improve performance
so we come up with for question that we once the in the last and
the in the later experiment the first one is well for everything only moving from
source and target and the second level
so is interesting see the kaldi decoder the roundabout
is that you're doing something pretty interesting compared to the baseline
so what does the cockiness all
and it's not question is what does actually much install and lastly the heart of
the size of serious problem affect the performance
so now let's go
in to each question one by one
and so first little fails on the domain
so the figure two shows the just the dialogue act f one performance
it surprising to see that all the mono a mono baseline our proposed one
the purple on dialogue i it's quite similar in different studies
so what happens is we found that the precise estimation failed to generate incorrect identity
as well as normal utterance
the novel words in domain
but dialogue acts as actually okay at least in this dataset
so one good example can see here
the reference it see you all model is able to generating so you next time
of the you something
so that kind of a short response across domains the no problem
but the bad examples let's go sample
so once this then the referent is that finally about what kind before you
this is then generating high this the russell system how can do for you
the hardest thing the current dialogue act secreting
but the words to compute here arabic i still think it's interest from
and not think about as the in the movie domain and estimate example for example
here the reference science fiction movie what times movie the baseline only generating focus by
what kind of rust right looking for all
so that's the problem that was for the way moving training on a restaurant in
casting movie
and then the question is what does common assault so here the most useful metric
is the energy score so we found that the copy decoder the decoder was coming
we can then
it into and it's got can continue because in ons to copy and it from
context and output it even if the audible can do not for
for this model
so what the problem solving the good example see
if the reference they something like audience i selection the contradict what it will be
able to generate in that science fiction and it by driving that was from the
user speech instead of putting piece
but the presence of all the problem the bad example can see here
the reference a
i want i believe use that comedy movie
and the system or generating something like
i believe use that come before
i grab the comedy but it doesn't generating a sentence
an example here we see
it was say something i would recommend rest of fifty five although fifty five years
and in the movie name it was it should be saying movie fifty five the
good choice
so
and the question is what does the proposed action matching solving so the answer is
the most relevant score his approval scores because we want to see if actual the
correct wasn't being generated in the new domain
so the we find that room actually being able to be called a to generating
overall a novel utterance
the never occurred in training not only entities
and so he also show some good examples
so in one example is only fifty five good choice and you will do we
make a choice and also from this more complex human data we can see this
data was say it was a scheduling remind afford no on friday in ten
which the only training why the and a navigation don't know which is the we
have a sense and distances but is still generating this novel utterance
and the last question is how to the size of s is effect performance so
this is the past on these data for the human data
we have found a fifteen the previous results from here we
that result from zero to two hundred and see the performance changed
so one thing the comfort that confirms it is before indeed increased what we have
been not the size of the response equally have a wider coverage about what's going
happen to data
but also we can see that
the performance becomes palatal while we going beyond about how to twenty five
a hundred fifty
and that validates the tracking progress of head using c whisper because we don't need
a huge size of zero files you get performance k
so to summarize yes what we propose the new problem "'cause" the std
and we propose actually matching this algorithm that performed pretty well in the for is
that under the assumption that the extra discourse better and also we do experiment divided
the performance of both human and synthetic dataset
and the last we also open source is the entire this multidimensional generator that can
be used to benchmark of the future experiments
and at the last i wanna say and this is a first step towards a
very big directions and their opens up many interesting problem that we can exploit the
future for example how do we quantify the relationship between domains
in most situations that it is possible
and also how do we
rely less on a human annotation because now we
yep and annotation to find in the relationship between utterance across domains
and also how do we started the official problem one assumption of c response fail
actually to the mainly can have different discourse calendar had to have different dialogue policy
how do we in which is for and last one is i know what are
the type of dormant description how we have in to enable yes
and i think you're much
which one
so i'll the laughs we have the discourse so here the ranges from zero to
a hundred maximum and this is a because the so this is a synthetic that
so it easier to achieve high performance
in this domain and also we intentionality a lot of compressed eli such as in
the rating as a role simulating different nonverbal behavior so
so that the range for that and here
i think that it was that the peace corps so it is the it's a
to match the meaning of the true and the t f one
so they impose also zero two hundred but is a human datasets much more challenging
so the rules goal is actually
mostly they're pretty low you can see about
can you in a zero two k twenty something so it job i jog the
number down so the range here is also zero two hundred
for the two score pages to it as that which the lab right one is
much more challenging
before
okay
okay
so this is for example a come from what we treat the scheduling as the
target and the why the and the navigation and the source coleman
and what we caff honest the scheduling domain
and
so the of the dialogue history so we went because a spacing and initial is
the history but
the actual system utterance is okay scheduling
try to denote with manual i mean
and the
so the generations is not perfect but the first time
the only model that is able to generate nickel coherent utterance to
obviously comfort in a it's a scheduling domain utterance and has
estimate a
i log are compared to the one shows and the than the baseline system we
just not generating a coherent utterance from scheduled a scheduling domain
it's more likely to generally something like all what's the weather all okay you an
allegheny to some rubber case of a strong bias
k is the transfer of from the source to target that's the only one that
able to
she split style completely from the source target
so clearly i think of the most challenging what is a navigation domain
and
i think because in like a scheduling the if you look into the conversation the
data list two should lead us to dialogues
and a schedule a usually that caught is not very long so is like schedule
probably with my for and eleven
and just confirmed that only about three to five four times before covers and finish
i think in navigation out as much longer and also the even more a detailed
information like i wanna check navagati from this case another place and
as much harder to get all phonetic arrives and the
sometimes they wanna change navigation places so it's
i think how to be more challenging domain comparative to all the other two domains
if you don't have c was muffled time domain then
you cannot do the chance but because all the knowledge we have about target domain
is from the cm response basically what companies which are into finding utterances thompson similar
function between domains those for example in the one of them may have missed a
tremendous a request so the model trying to find estimate utterance
that in a in the new domain just filling similar function so we can translates
knowledge about a policy to the new domain still you will you know it with
you i want your request when i'm not in the new domain still you will
finding the most match sentences in a target on anastasia
so if we don't have target domains the response and a hybrid and the little
work
so definition here is there was shown means that we don't have any dialogue data
from target domain
so we don't have any multi-turn conversation that a target domain
so as it was because it was only utterance is no dialogue so it doesn't
really it's not dialogue data so here
you know the overall definition here we will try to propose here is domain description
so it any
it doesn't use the bc was like any other type of them a description giving
the application but here we assume that that's the response is a what only description
about this domain
well you have some sort of the description
a knowledge about target
that's acting as a four inches suppose that you mean
we want to have a express the latent representation we can see
on inter interpreted so
so in so now he's or continuous so we tried to probably in the in
the bipolar to two d and the product and
i we can see some patterns that with the group similar sentence from different limited
no but
i think is interesting direction to see how can we get more explicit information
what about
for interpretation