Speech Transcript - Zero-Shot Dialog Generation with Cross-Domain Latent Actions

i everyone my name is attention from carnegie mellon university that i i'm going to

talk about working there was shot current generation with cross domain data actions

and the code and data are both available in the k

so like target was going to be about generative end-to-end dialogue system

which is perhaps one of the most flexible for remote we have nowadays to model

both task part scoring and non-cause cora conversations

and the basic idea i'm sure everybody already familiar with we have a dialogue context

and we have a new encoder that encoding whatever is available at testing time encoding

dialogue history on or the information i don't have it because the network

i can generate in the response

and for i do it a verbal response that sending back to human

or it can be a api request offended back to databases

so that the single model can handle pose the interactions between human too much in

and also much into by k databases

and

although this point what is more powerful and flexible

most of all kinds of the successful prior work has one assumption that

is a large training dataset

the exact same a task or domain that were interested so we can show me

model on them

and

a some trade off and not true in practice and the because dialogue system can

just be applied to so many different domains even just for slot filling we have

slot filling for bus

schedule a whether you know and

five and so many other domains

and in many times we don't have the exact data so that were interested that

we'll that for a domain that we're going to be able

and one human another can hear actual example here human is incredible a chance for

knowledge from domain to domain

so in managing a customer service agent a who is was in these should department

and if you can very quickly adapt to the

closing department just really some training materials without the need to up the training example

dialogues

so we want to achieve similar goals for this study

and to summarize

the goal of the first goal is we want to exploit

the flexibility of a generative model so that can simultaneously accurate knowledge from multiple domains

and then a second a more we wanted having the canyons and to being able

model to transfer knowledge from source to maintain you domain where we don't have data

and this is a new problems that we formalize as a learning problem we name

it was shown that a generation the c g

so the set up as follows

so we have source domain which means domain where we do have dialogue data and

that we have a set of target domain wherein the we the so may we

don't have dialogue data

and for domain both source and target we do have access to a domain description

which is can be any type of knowledge that describe the specific information about their

domain the and then given a set up the learning problem becomes follows so in

training time we the model can access information can be trained on

the source dialogue from the source domain and also ultimate destruction from both source and

target

and testing time we ask the model to directly generate responses in the target domain

whereas the target on a number of the in training that's why we called the

there are shown that estimation problem

and

just to show in the formula also the visual figures

so given snr is

very easy to see that the design of torment description is the most important factor

here because that cover all the domain and that can that's enable the possibility of

transfer knowledge from source to target and there could be many different type of them

a description and in this study we propose one type would call the cm response

so this

the assumption serious problems is the that between the source and target we assume that

there exist some sort of a shared related discourse patterns such a full page i

can also for policy and again given the assumption

what is the response

so as to response is a list of pupils and each triple contains elements acts

and axes example utterance the can be spoken from either user or system from this

domain and a is the annotation of that utterance that you're example i shows here

and d is basically the domain index

and then for each domain we have a table like this and having c responses

from each domain

so given the same response and also the dialogue from this also make how do

you can i suppose data to train model to actually the std

so in this work we propose a new class of algorithm can actually matching algorithm

and in this algorithm the most important a notion is the cross domain data collection

so introduce a new space basically the and in the latent space the and we

assume the only possible is the action from system the user can reside in the

latent space

and in actually match my when we try to learn still we propose to use

that of parameters the first one is are the recognition network and a function of

these are is basically mapping utterance so annotation from sentence from words td late actions

and now we have in cold and the text in the dialogue context and try

to predict what's an excellent an action

and these are the one is the decoder

because we do enjoy we can model so we expect you called a basic click

select an action any point the latent space and can map back to a sentence

so visual here shows all the possible

transformation between the for rebels utterance annotation late actions and it context

okay still now we have this free parameter want to learn

and we have to type of data so how do we optimize

so the first couple data we encounter is the response data

so basically a bunch of sentence from different domains and the objective here is we

want to make the later action from two utterances in from two domains them at

each other only one the annotations in with each other and

well we do here is

so the task the yellow is from one domain i think it's a bystander going

from movie

and we try to a introduce the first loss function is called domain description loss

and we basically minimize the distance from the as the access to the a

in this way so that an utterance from two domain the only close to each

other unless the annotation close to each other

and then the second type of data we're dealing with is about a better from

source domain so in here the objective here is what we want to make to

predict action did be accurate so one of the project action from a context d

v actually similar to the actual response that's been studied the data

and that we introduce the second last clause

so the bottom for task is the same as the previous slide

and we have the predict the action late an action that are and we try

to minimize the distance between a particular connection two d just to the late an

action of the arcs here

so to summarize the

to summarize

action matching i with the as its here

and it has is very simple and elegant solution so we only have to loss

function and we alternating between them so for a software and the we have atomic

description loss

so that's why we're dealing with data from the seat response

so we second fine we minimize the distance between

the energy is

and also the first and we trying to train the decoder to generate a response

from also and target

and the second author dialog lost we

this loss is actually about related to the latent variable model all the original encoder

and you can see that you timewise training decoder the other ten is trying to

minimize the distance that i just talk about

and training i within is basically taking data from two stream of a serious problems

the dialogue and we randomly pick you want and then optimize the corresponding loss function

so for the exact location for this study we using a bidirectional gru for the

recognition of to work and we have a then the hierarchical honesty an encoder for

the encoder

and afford a condo experiments cucumber decoder

one is a standard lstm decoder with a attention

and the second one is a lstm was the score pointers sentinel genital is actually

the decoder with caulking we can use and so you can copy what from the

context and iraq all put into the

the output the response

and it's been shown to be a pretty robust against out-of-vocabulary token in the language

modeling

and here we show the picture

but what we having a this model where we have been covered decoder the left

figure shows that how do we deal with dialogue data and the second figure shows

how do we deal with a c response data and the that we can optimize

three a network jointly

so that our method and we passed this framework a to the task that wine

cm esteemed i'll and second one is that for multi domain thought of that is

that

and signal is a new open-source multiple madonna generator with complex the control and it's

open so i'm gonna have and they have more menus instruction about how to use

it and that we use this generator to generate in dialogues from seven domains

so we take so we don't make as the source domain there are us from

bus and when a

each one thousand dialogues

and a target domain we have four and we cast in different perspective so the

first one is rostrum so this is in domain because also because in the training

and then a second one is a things not address for the rest of slot

so is the restaurant but we completely have a different set of slot values

and then the so the one is i think analogies list your a strong but

we user need if a complete different start time and not a template for both

user and the system

and the last one movie is a new domain but has joe nothing with anything

in this also may which is the most challenging one

and forty was files we take a hundred transform each domain addressee response and we

use the internal frame as the annotation

and the second type of data something that would do is the staff of data

a stand for the that is result and dialogue from so we don't main scheduling

whether and the navigation and with you to take one our approach by rotating and

use one has this talk target and other two at the source and we have

so that we have three possible configurations

and we use a hundred fifty utterance from each other domain as a serious problem

and we have an expert annotators and with semantic frames and the that's all we

need for the final domain so we only use a hundred utterance from the target

domain which training and don't use and dialogue from the domain

and for evaluation we and the left is also evaluation so because in the past

for instance then we invite of system from four different metrics

without bleu score energy

dialogue a and it database cory f one

and you

although quantify the overall performance we have a new score

for the bic score basically take the geometric mean of the four managers and having

a one number for each system so we get data

a overall performance manager

and we compare for different models the top to a baseline so that optimize its

no encoder-decoder was attention and the second one is the decoder with the company we

can use n

and that's the to propose a method is basically we add the action matching the

proposed actually match algorithm to did you baseline and see what happens or we adding

this action matching

so in the results so here the local formants and on the life as i

we show the peaks go on the thing died

and on the right we show the overall the performance is therefore data

and so here we can already see some interesting content

so we first can see that the two baseline the conformance but it's pretty well

on the in domain data which is the normal test in training a scenario

but why they moved to the one thing slot as the energy a new domain

a performance job significantly

and also we can see that the blue the green bar which is actually matching

cost the copy decoder

it has really strong performance in a in those target domain well it's were quite

different from the training data especially when you domain the going by is able to

achieve

sixty eight or performance

well as even in domain the

performance got a cap is about eighty two

so it actually learning something that

if one the from the by two baseline is significant improve performance

so we come up with for question that we once the in the last and

the in the later experiment the first one is well for everything only moving from

source and target and the second level

so is interesting see the kaldi decoder the roundabout

is that you're doing something pretty interesting compared to the baseline

so what does the cockiness all

and it's not question is what does actually much install and lastly the heart of

the size of serious problem affect the performance

so now let's go

in to each question one by one

and so first little fails on the domain

so the figure two shows the just the dialogue act f one performance

it surprising to see that all the mono a mono baseline our proposed one

the purple on dialogue i it's quite similar in different studies

so what happens is we found that the precise estimation failed to generate incorrect identity

as well as normal utterance

the novel words in domain

but dialogue acts as actually okay at least in this dataset

so one good example can see here

the reference it see you all model is able to generating so you next time

of the you something

so that kind of a short response across domains the no problem

but the bad examples let's go sample

so once this then the referent is that finally about what kind before you

this is then generating high this the russell system how can do for you

the hardest thing the current dialogue act secreting

but the words to compute here arabic i still think it's interest from

and not think about as the in the movie domain and estimate example for example

here the reference science fiction movie what times movie the baseline only generating focus by

what kind of rust right looking for all

so that's the problem that was for the way moving training on a restaurant in

casting movie

and then the question is what does common assault so here the most useful metric

is the energy score so we found that the copy decoder the decoder was coming

we can then

it into and it's got can continue because in ons to copy and it from

context and output it even if the audible can do not for

for this model

so what the problem solving the good example see

if the reference they something like audience i selection the contradict what it will be

able to generate in that science fiction and it by driving that was from the

user speech instead of putting piece

but the presence of all the problem the bad example can see here

the reference a

i want i believe use that comedy movie

and the system or generating something like

i believe use that come before

i grab the comedy but it doesn't generating a sentence

an example here we see

it was say something i would recommend rest of fifty five although fifty five years

and in the movie name it was it should be saying movie fifty five the

good choice

and the question is what does the proposed action matching solving so the answer is

the most relevant score his approval scores because we want to see if actual the

correct wasn't being generated in the new domain

so the we find that room actually being able to be called a to generating

overall a novel utterance

the never occurred in training not only entities

and so he also show some good examples

so in one example is only fifty five good choice and you will do we

make a choice and also from this more complex human data we can see this

data was say it was a scheduling remind afford no on friday in ten

which the only training why the and a navigation don't know which is the we

have a sense and distances but is still generating this novel utterance

and the last question is how to the size of s is effect performance so

this is the past on these data for the human data

we have found a fifteen the previous results from here we

that result from zero to two hundred and see the performance changed

so one thing the comfort that confirms it is before indeed increased what we have

been not the size of the response equally have a wider coverage about what's going

happen to data

but also we can see that

the performance becomes palatal while we going beyond about how to twenty five

a hundred fifty

and that validates the tracking progress of head using c whisper because we don't need

a huge size of zero files you get performance k

so to summarize yes what we propose the new problem "'cause" the std

and we propose actually matching this algorithm that performed pretty well in the for is

that under the assumption that the extra discourse better and also we do experiment divided

the performance of both human and synthetic dataset

and the last we also open source is the entire this multidimensional generator that can

be used to benchmark of the future experiments

and at the last i wanna say and this is a first step towards a

very big directions and their opens up many interesting problem that we can exploit the

future for example how do we quantify the relationship between domains

in most situations that it is possible

and also how do we

rely less on a human annotation because now we

yep and annotation to find in the relationship between utterance across domains

and also how do we started the official problem one assumption of c response fail

actually to the mainly can have different discourse calendar had to have different dialogue policy

how do we in which is for and last one is i know what are

the type of dormant description how we have in to enable yes

and i think you're much

which one

so i'll the laughs we have the discourse so here the ranges from zero to

a hundred maximum and this is a because the so this is a synthetic that

so it easier to achieve high performance

in this domain and also we intentionality a lot of compressed eli such as in

the rating as a role simulating different nonverbal behavior so

so that the range for that and here

i think that it was that the peace corps so it is the it's a

to match the meaning of the true and the t f one

so they impose also zero two hundred but is a human datasets much more challenging

so the rules goal is actually

mostly they're pretty low you can see about

can you in a zero two k twenty something so it job i jog the

number down so the range here is also zero two hundred

for the two score pages to it as that which the lab right one is

much more challenging

before

okay

so this is for example a come from what we treat the scheduling as the

target and the why the and the navigation and the source coleman

and what we caff honest the scheduling domain

and

so the of the dialogue history so we went because a spacing and initial is

the history but

the actual system utterance is okay scheduling

try to denote with manual i mean

and the

so the generations is not perfect but the first time

the only model that is able to generate nickel coherent utterance to

obviously comfort in a it's a scheduling domain utterance and has

estimate a

i log are compared to the one shows and the than the baseline system we

just not generating a coherent utterance from scheduled a scheduling domain

it's more likely to generally something like all what's the weather all okay you an

allegheny to some rubber case of a strong bias

k is the transfer of from the source to target that's the only one that

able to

she split style completely from the source target

so clearly i think of the most challenging what is a navigation domain

and

i think because in like a scheduling the if you look into the conversation the

data list two should lead us to dialogues

and a schedule a usually that caught is not very long so is like schedule

probably with my for and eleven

and just confirmed that only about three to five four times before covers and finish

i think in navigation out as much longer and also the even more a detailed

information like i wanna check navagati from this case another place and

as much harder to get all phonetic arrives and the

sometimes they wanna change navigation places so it's

i think how to be more challenging domain comparative to all the other two domains

if you don't have c was muffled time domain then

you cannot do the chance but because all the knowledge we have about target domain

is from the cm response basically what companies which are into finding utterances thompson similar

function between domains those for example in the one of them may have missed a

tremendous a request so the model trying to find estimate utterance

that in a in the new domain just filling similar function so we can translates

knowledge about a policy to the new domain still you will you know it with

you i want your request when i'm not in the new domain still you will

finding the most match sentences in a target on anastasia

so if we don't have target domains the response and a hybrid and the little

work

so definition here is there was shown means that we don't have any dialogue data

from target domain

so we don't have any multi-turn conversation that a target domain

so as it was because it was only utterance is no dialogue so it doesn't

really it's not dialogue data so here

you know the overall definition here we will try to propose here is domain description

so it any

it doesn't use the bc was like any other type of them a description giving

the application but here we assume that that's the response is a what only description

about this domain

well you have some sort of the description

a knowledge about target

that's acting as a four inches suppose that you mean

we want to have a express the latent representation we can see

on inter interpreted so

so in so now he's or continuous so we tried to probably in the in

the bipolar to two d and the product and

i we can see some patterns that with the group similar sentence from different limited

no but

i think is interesting direction to see how can we get more explicit information

what about

for interpretation

Zero-Shot Dialog Generation with Cross-Domain Latent Actions

Oral Session 1: Generation 1

Tiancheng Zhao and Maxine Eskenazi