how everyone a man in style i'm a student is the university and i'll be
discussing some joint work between my collaborators that you know p group and also the
for research and relations that are
and i guess before i actually get started i think this target can be pretty
deep learning happy so
before you kind of star trek may resume you know as like the harbinger of
everything is bad and bad in dialogue today i'm and you to learn as well
i wanna talk you know please lets visible about this so
just come isn't that is
before i get started i like to kind of take a step back
and discuss some of what i think like the larger motivations of dialogue research arm
and to do that i'd like to talk about a film her which somebody may
be seen
in it the protagonist played by walking unix essentially develops an infant relationship in the
sparse feature world with his super intelligent assistance amanda
and how do we estimate the so feeling is her charisma her really to
conduct very intelligible conversations
and why would you necessarily swell the details of the movie i would like to
say that i think it does a fantastic job of illustrating really what is at
the core of a lot of dialogue research i think on the one and we
do we are trying to build very practically useful agents try we're trying to build
things that people can use on a daily basis
but i think more broadly
i think we also should be trying to build a just edible compassionate and pathetic
relatable collaborative i think in doing so will learn a lot of ourselves what we
as humans are what makes as human what's the core of our of our humanity
and so i think this is that this is due motive is something that i
think she got a lot of dialogue research and certainly guys a lot of us
of the that i
well like to do
moving now into the actual talk itself
a quick roadmap i'm gonna be discussing some background to this work a i'll be
discussing the model that we developed a dataset we also developed
also the experiments that validated sort of the approach and some concluding remarks
so background
if we take this snippet of dialogue between a human asking a sort of fairly
simple query you know what time is my doctor's appointment
we would like an agent to be able to just to answer the query with
reasonable effectiveness and say something to be effective your appointment is a three point one
thursday
the traditional dialogue systems tend to have a lot going on in the back end
we have a number of modules that in various things including actual and understanding interfacing
with some sort of a knowledge base and then obviously a natural and generation
in tradition we have a separate modules that are doing all these things together and
and often times can be very difficult to make a smooth interaction between all these
different modules
and so i think the problem is of a lot of present the enrolled dollar
researchers will is that will be able to kind of an automated
some
really all of all these separate modules and with is affected and doesn't really
limit performance
more specifically i think that one of the big challenge is that a lot of
present in all dialogue systems suffer from is interfacing with the knowledge base itself
and so
really the kinds of things we would like to see is sort of a smooth
interaction involves heterogeneous components and we could replace these all these separate you know hardworking
the robot's with one make a robot i the end-to-end dialog system then
maybe we're getting some sort of progress
this is of course
i have a suitable it may be would like to work towards
so the purposes of this work i guess so first discuss
some previous work has been done in this in general in this general line of
enquiry
so some work from when it all has sought to essentially take the traditional modular
connected paradigm and replace some or all the components with the neural acquittal and
other work has tried to
kind of enhanced these soft the kb lookups and interaction of the kb through some
sort of soft operation that still maintained some sort of belief state tracking
there's another line of work the kind of tries to find a middle ground
that try seek the best of sort of the rule based heuristic systems
and the more neural that is still not able to neural training
and then there's some work that we kind of been pursuing in the past
that seeks to
bill some sort and then system that's builds of the traditional c to seek paradigm
and is able to enhance the paradigm with some mechanisms that actually one more effective
dialogue exchanges
the motivation then of our work is twofold
one we would like to develop some sort of a system that can interface with
the knowledge base in a more or less intense fashion without the need for explicit
training of believers like trackers
and i think a sample of that is then how we get a sequence the
sequence architectures this purported architecture to interact nicely with some intrinsically structure information you know
we're talking about it
sequential model
combining with this more like structured representation
and
getting there's to work together is something that i think is gonna be a challenge
going forward
some details on the model
so first steps i don't know what people's general material acoustic models but
the encoder decoder with attention framework is one is investigated a number of different works
and for the purposes of dialog evolves more less the exact same starting paradigm the
same general back on the encoder side we're basically heating in a single token of
of dialogue context one of the time through a recurrent unit highlighted in blue
and one travelling the recurrent for some number of times that's
and after some number of computations we get the hidden state that is initial that
is used to initialize the decoder which also the recurrent unit and is also relevant
for some number of time steps
at each step of the decoding we're gonna be referring back to the encoder and
essentially computing some sort of a distribution
over the various tokens of the encoder
and this will be used to generate a context vector that then is combined with
the decoder hidden state to form a distribution over possible capture tokens that we can
arg max over and essentially new our system response for
sewing with this general background i liked hypothesize that in principle we should be able
to just like take this decoder hidden state that we already computing at a given
timestep just move that one step further and say hey uses exact same decoder hidden
state to compute some sort of an attention over the rows of a knowledge base
so that the question is how do we actually represent the knowledge base in such
a way that this is actually feasible i mean we're eigen can talking about structure
information and we're trying to deal with in some more of a sequential fashion to
we are interested sequence
so again this is the question is really guarding the
the work is how can we were represent a cave effectively
to do so we draw information are inspiration from
the key value memory networks of millard all which essentially showed that he value representation
which
not only is kind of a nice
elegant design paradigm but also
can we can directly be shown to be quite effective a number different tasks
maybe something helpful for us
so the show how this actually would it play out for our purposes i mean
taking one row of a kb and show how were trying to transform into something
that is amenable to keep value representation
so consider this a single row of a look at here we're talking about a
calendar scheduling task
and we have some
these the structure information
and we want to convert that into essentially what is the subject relation object a
triple format
and so here what we're doing is we have some event the dinner
which is connected to a number of different items in a backs about the dinner
through some relation so you have some time which can be relations and data which
simulation et cetera et cetera
and everything is information that is originally represented in the in the role of the
cup knowledge base is now collapsed into triple format
and so this is the first sort of a operation that we're gonna work with
going from the subject relation object triple format
we then
make just one small change which converts into a key values store
taking the subject a relation and essentially concatenating it to form a sort of canonical
as representation that is our key
that is sort of exactly what we're trying to do
so
if you look the first row we had the simulation object with for the dinnertime
an eight p m
and
this subject relation essentially become this new not realised make a that's a make a
key called dinnertime for lack of about word and the object is just mapped one
to one to the value
and we do the same for every single other row in the original
is a row a triple format
and so because we're dealing with embeddings
the keys in this case and that being just the sum of these subjects relation
embeddings
so dinnertime this case is just litter the sum of the gender bending and the
time adding
and
an important detail is now one word doing some sort of decoding
we're all argmax sing over an augmented vocabulary
which includes not only the original vocabulary that we started off with but now also
these
these additional canonical as a key representations
when we put it all together we
have essentially again well we start out with which was the sink or decode with
the tension framework
but now we filled in this attention over the over the knowledge base
we compute some weight over every single role of the knowledge base
and so for example in the case of something like you know the football time
at two p m
that's visible
there's no weights that is there is used to await the appropriate entry in this
case the football time cannot representation in the distribution of you are mixing or
we do this essentially for every single row of the
of the of the new canonical eyes kb
for that
and this essentially is adjusted model
moving on
the dataset that we used because
i mean first off i guess a quick no data scarcity the obvious ignition a
lot about research especially when we're talking about the neural dialogue models that are that
a lot of people are dealing with you know it seems that more data often
helps but
given that are collaborations one with for which obvious is a for company is a
car company and hence
the really only interested in things really it requires
we had to go about building since the new data set
that would be
and then able to still being able as the same question that we want to
ask about knowledge bases but is kind of more relevant to their use case
so that in the being the in car virtual assistant domain
so here i three sub domains there were interested in our scheduling calendar scheduling whether
and then point of interest navigation
the way we wanna by collecting data set
essentially you see whether masking which is adapted from the work of one at all
and it essentially what we're doing is we have
crowdsource workers
that are playing one of human essentially they can either be the driver or the
car systems
and we progress dialogue collection one exchange of the time
so the driver basing interface looks like this
you have essentially a task that's generated automatically for the worker
and usually provided with the with the dialogue history but because this is the first
exchange of the dialogue there's no history again with
and then you have the to the worker is passed with essentially progressing the dialogue
a single turn
on the cars just inside
we also provide the history of the dialogue history so far
but
the car system is actually being asked to use some private collection information that they
had access to the user does not have access to and they are then supposed
to use then information to also progress
the dialog iteratively port exactly what the user ones
the dataset ontology
has a number of different
and three types and associated values across the different domains
and i guess that sort of lends itself to a fairly a large amount of
devastation types of things that people can talk about
what data collection was done we had a little over three thousand dialogues and it
was more or less split evenly across the three different domains
with an average number of like five or utterances per dialogue as well as
nine research tokens per utterance
now for some experiments
using this data set and the model we propose
the baselines that we used for benchmarking our model we're two
first we build a sort of traditional rule based system that uses
manual rules go to do not going to understanding as well as the naturalness generation
and to do all the interfacing with the k p
and then on the kind of the neural competitor that we put up against are
new model was the copy augment the c to stick model that we could build
previously in prior work which at its core is essentially also and encoderdecoder framework
with attention
kind of background but also daugman's that
with an additional copy mechanism over the entities that are the dimension of the dialogue
context
we chose this
because one it is an exact same classifier of models as the new one to
we're proposing iac to stick with attention
and i guess previous work also shown that this is actually pretty competitive with other
model classes including like the intent every network facebook
and also because the code was already there so one
so i guess for automatic evaluation we had a number different metrics and i'm gonna
say this and i'm the bite the bullet that we did provide some sort of
automatic evaluation
but i guess i know that indicates a dollar especially automatic evaluation is something that
is a little tricky to do and in that it really is a little dab
it is divisive of a of a topic
but there were some object but i guess some people have reported previously so we
kinda just follow the line of previous work
we use bleu which is of course of data from machine translation and there's some
work that says it's actually awful metric
no correlation human judgement and then there's some more recent work that says you know
it's like it's pretty decent the n-gram basis extraction not really all that had
and then we provided in into the f one which basically is a matter of
micro-averaged f one over the set of entities that are mentions the response as compared
to that in the target response that we're going for
so when we hit it all the models against each other we see that
first off be the rule based model is doesn't have a particular high blue which
again a binary too much but
that can simply be explained by the fact that maybe we don't write as many
diverse templates for natural image generation
but the idea of one is decent in the sense that
you know we kind of did target the models which way that was can be
pretty accurate of picking out
and accommodating search queries
the copy network is what had a pretty decently score which
can of course be true to the fact that i mean this acoustic models are
known to be good language modeling but the mtf one is pretty bad comparatively and
this is i guess a function of the fact that
essentially the copy number doesn't really making use of the kb directly instead relying totally
on dialogue context to generate entities
and then the cater to a network outperforms these on the various metrics
performed pretty well and lewis wasn't you have one but we also show human performance
on this and show that
they're still naturally japanese to be filled so well this is encouraging it's not i'm
not receive tentative and it's by no means suggest of the fact that quilt models
of the one model secure the other
but it is kind of it there for coarse grained evaluation
we also provide an human evaluation
where we essentially generate about a hundred twenty distinct scenarios across the three different domains
that we had
once the to never before been seen in training or test and then we hear
that the different more classes with amt workers in real time and had then conduct
the dialogue and then assess the quality of the dialogue based on fluency what we're
denis and human likeness on a one to five scale
here i mean this kind of this scheme evaluation tends to be a little more
but more consistent a little a little more seriously regarded and again it at which
you network actually outperforms
various and fetters
especially getting good gains over the copy network which is which is of course encouraging
here again we also have human performance which
i mean a sort of sanity check does perform does provide an upper bounds that
there still of really large margin between even our best performing system in human performance
so
the still gap to the to be filled there
i just as an example of a dialog one of these scenarios
we have here a sort of truncated knowledge base
and
in each data point of interest navigation
setting and
we have the driver asking for
a gas station with the shortest route from where you are
the car answers appropriately
you know the driver kind of all samples the next year's gas station the cars
is again
and string approach really with respect to the knowledge of its given so it's nice
to see that there is a reference to the knowledge base and it's handling stuff
appropriately
some conclusion and kind of final thoughts
so the main contributions of the work we're namely that we had this new class
of seek to seek style models that is able to perform a look up over
the knowledge base in a way that is that is fairly effective
and it does this without any slot or belief state tracking which is kind of
a nice and nice benefit
and it doesn't outperform several the baselines
on a number of different metrics
and the process we also created new data set of roughly three thousand dialogues in
a in a radically domain
and i heard a new domain
future directions i think one of the main ones a scaling up the knowledge bases
so
right now word not exactly only the scale of knowledge base of the people would
seeing how to relax applications and think that somebody's
typical google calendar or
any anything of that nature there is always the a disparity in the size of
these knowledge bases
and so we like to move in the actual feasible realm of types things that
people talk about and is the magnitude of types of things people talk about
we also like to can move away from
operating in the most at each range for gene and it's that kind of do
more rl by base things which we accommodate any deviations from typical dialogue tempos that
we may see
and i guess even further down the line it would be nice to see models
that
they can actually incorporate more kind of pragmatic reasoning
into the kinds of inference is that they're able to make so that simple query
like well i need to wear jacketed a the pragmatic reasoning the lousy to say
that hey wearing a jacket is indicative of some sort of temperature kind of reason
gonna have to do is a bic that also in the model
so that that's my presentation thank you be happy taking questions
question use
i think that that's a great question and i think right now for the predicate
this particular iteration of the of the model
i think it is
relatively dependent on the types of things there talked about because again with the kinds
of
the entire look up operation is depending on like embeddings and is embedding have been
trained right on the appropriate types of a database c and so naturally you're talking
about calendar scheduling for you know five hundred dialogues a listen you're talking about you
know ponies or something is gonna be hard to have well trained embeddings that are
gonna allow you to do that and so i think that certainly
this is something that is a subject of future work and i can think of
likes some ways you know using pre-trained embeddings mail it kind of circumvent the need
to literally train scratch again and kind of bootstrap a little bit more other kinds
of things you expect to see i think it's a regression deaf and something to
spoken further
and thank you for your presentation i just want to our during the experiment and
the training process
and the testing as well so i'll do you deal with unseen situations you know
if you show you can see our knowledge like used to meet
used in the nation's you in that and are talking to deal with that sorry
how can i do anything to deal with the situation you the task
also
in what particular sensor like you're talking about
what exactly so it's like if something that is entirely different what you've seen before
all maybe like just
some kind of like you just
like new p-values show you the task force not change
i mean i think in this case
it would have to be augmented a little bit more with some sort of a
copy mechanism by you over the
i mean i guess in this case
it is a little bit dependent on the kinds of things that it's seen
and i think that
i think that in general there have to be done
through
i mean right now is able to pattern only having seen these entities in training
as well
i think in general it's something that which kind of look at how they can
be done in a way that is
less dependent on the keys the keys as they and demonstrate and i think right
now it would probably come out as in which people a little difficult to handle
but
but solution to
last point that you had your site some future direction it just structure knowledge addition
right information system is that you can perform reason you can you can recently you
can probably be you have any and you like to incorporate that with the reading
and twenty five
right you mean allowing for these kinds of more complex styles of reasoning without
i mean that so that's a really good points and
i think the last one especially is right now a little bit of a long
time i mean in the sense that
even though it's and the kinds of things that are common it still is something
like that more less falls into the
one particular type of pattern with the slot you can well as the land and
act on that
i think that
right now the model would what's troubles would be famous probably with this kind of
things that obviously deal with like you know synonyms kinds of various like the use
of speech et cetera
and i don't have like a super gonna answer for with that would look like
because the model is very much of this like
slot filling but i think that
i think the interplay of chitchat systems and the kind of the more structure information
is one that should definitely be explored more we can i think that you know
really touched on that a lot as well
and you speaker