two
hello and the lightning
again and welcome to the next fashion and on policy and knowledge and we will
start this test set with the talk
on the reinforcement learning for modeling chitchat dialogue with this we actually it's
and that i did they are is by
seen that the right channel chi and c g rather and the presenter is a
g
i works
and you at the trial run
hi everyone
thank you for be here and it's pretty exciting to be at sig dull
i'm cg it's probably and let me give a little background intro
to what we do i'm from global a i'm a research scientist healthily multiple machine
learning groups my group is focused on dealing with a lot of deep learning problems
where you actually have to inject structure into deep networks like only combine graph lining
the traditional graph learning approaches
with deep learning so we've actually released like a bunch of things and doing semi
supervised learning at scale if you using any of the good products g mail so
to anything et cetera where you will actually be using stuff that people
we also do count as a actually i so i'll show you want example of
that on detecting intends but also like multiple times
board for language and also for revisions of using state-of-the-art vision
technology
misnomer
people might think google
large companies have a lot of resources we label all the data sets that we
have
do you actually able to set of god recognition image recognition system that you using
google photos and cloud
we have less than one percent
annotation
and the reason it works is
in like two words
semi supervised
thus
deep learning and a lot of other optimisations that are going on under the hood
to my group is responsible for some of these things
and finally
a lot of the problems that we have to do with
actually require a lot of compute on the cloud
my group is also looking at things like how to do things on device
imagine you have to build a dialog generation system
or a conversational system that has to fit on your watch that cannot actually have
access to gigabytes of memory or even you know a lot of compute unlike you
know the cloud where you can do cpus gpus and all the latest generation hardware
so with that
hope is gone just mapping of things we work on
this is joint work with
my fabulous interns to know the right who couldn't be here is from y'all are
from us images lab
the talk is gonna be about deep reinforcement learning for modeling chitchat
dialogue with discrete attribute if that's quite a mouthful
all it means as
we try to do dialog generation but controllable semantics
and i will give you an overview of what we are talking about here so
first off
like for any generation system you have to predict responses
here to applications where we have to predict responses and these are not more data
and but equally hard
at the order of like millions or even billions of predictions per day
one s market by which our team double up
several years ago
i mean if you're familiar with smart reply
okay quite a few if for those of you who don't know
if a using g e mail
on your phone
if you see those blue suggestion box that pop up at the bottom that's exactly
what it is
so
if you have any email or chat message it actually contextually generates responses that are
relevant for you and if you notice these are actually very different responses that all
the three suggestions and not necessarily the same so this is the smart reply system
and for free folks who think that this is a simple
and coder decoder problem
i can sure you that
to get it to work
it's definitely not there's a lot more things going on you can either paper from
ktd
but out that someone some of these attributes later in the talk today as well
but you can take this to the multi modal setting as well so we all
really something called for a reply after the initial smart of like version
where now you lead to you receive an image and you have to understand the
semantics of the visual content
and generate an appropriate response so if you look at the picture
and it shows a baby
the system would say so cute
and you probably send it unless probably you don't have a hard
right
or if you see like other favourite things that would like if you see skydiving
video or a image it'll actually suggest how brave
i always been a very good the start
one more suggestions how stupid should come at the end of it as well but
b control for those set of things
so these are just examples of generation systems but
like the task that we're trying to solve in this paper is well basically we
try to model open-domain dialogue so everybody here i don't need to introduce
task-oriented dialog systems are available in everyday systems i mean you're talking about booking reservations
like you know playing music et cetera there is a task and all the you
know prediction a system that you bill
parameters are optimized towards solving the task
open-ended dialogue is much harder
and one of the common way that people's all this is the standard
sequences sequence model
but you try to modeled as a machine translation problem so you given a history
of dialogue utterance sequences
and then you're trying to translate
some representation of that encoded sequence
into
you know decoder sequence in this case an utterance that you're going to
like send
what's the problem
almost every system especially the neural systems
that you have today
like doesn't matter which over time when you use seem quite repetitive and they sound
very redundant right so the problem as a like from and ml perspective
the unlike the task oriented dialogue the we cover is much larger and
there's a high entropy that you have like few responses that are very commonly occurring
but then of this long tail off like red responses so
given a choice most of these systems are trying to maximize likelihood in some form
of the other
ill actually pretty big to generate responses and give you the maximum
likelihood or the lowest perplexity
so this is a common problem of course it's not a new problem like anyone
who's
both systems would have realised this and there are many ways to address this like
people afraid doing adding like you know loss function objective function extending the loss functions
you basically by sir system to produce longer sequences you know non-redundant responses
adding an rl layer on top of the you know the deep learning system so
that you can actually optimise your policy to do something that is non redundant and
even injecting knowledge it's from sources like we need but a et cetera
so
in our work
what we propose is instead
do
conditional model where we're trying to condition the utterance generation that the dialog generation
based on interpretable and discrete dialog attributes
so
i will unpack each of those phrases like it within the next you slide but
here the building block for the model
so we use the standard
encoder-decoder model but this is a hierarchical encoder-decoder model like originally introduced in serving at
all
and
you can think of the says like to levels of and rnn recurrent neural network
where the first layer is actually operating over words in the utterance
at any given time step and then that generates a context eight
and then you have another rnn that operate over a sequence of
timestamps
so basically that operates over the multiple turns in the dialogue
simple enough of course
training these things a never ever simple enough is like you know all kinds of
hyperparameter tunings et cetera but we're not gonna talk about that
instead what our model does as we propose a conditional response generation model
where we trying to learn a conversational network that is conditioned on interpretable and
compose able dialogue attribute so
you have the same the first layer of rnn operating over be what in the
utterance
but instead of actually using just the context it to start decoding and generate a
response we now going to model attributes
dialog attributes in a tell you what does dialog attributes are
these are interpretable and discrete attributes
just not like there's been what do not like latent attributes where you have continues
representations like the model a dialog state et cetera but here we can use discrete
attribute
which are predicted
and model
during the generation process
and now want to predict the attribute at a given time stamp
that last the context state is
together used to generate the decoding state that means then you're gonna start generating the
utterance after that point
so what is a dialog attribute
so we chose intentionally chose things like
dialogue acts
sentiment emotion speaker persona these are things that be actually want to model about a
dialogue
so the reason is we want to get control the semantic so
it's not just about
saying that hey does it look fluent or not
but imagine what i want to if i want to say that
make the dialogue sound more happy
or
for example
and that the specific speaker style
or a specific emotion
or in the extreme and this is like
for their along if you want your dialogue systems to start becoming empathetic et cetera
like first of all quantifying what that means is also hard problem like there's i
we don't have a whole talk and just that
and
this is that
crucial part here
so we are trying to force the encoder not to just generate the con contextual
state but instead use that also degenerate a latent but interpretable representation of the dialogue
at that particular time stamp and together use it to start the generation process
now these are composed of lies has said
so it's not just one single dialogue act or dialogue act to be that you
would predict you can actually predict multiple ones of them so you can have a
sentiment and a dialogue act
and any motion and a style all being represented in the same model and in
a few slides will be tear why this is useful
so
this is pretty much the just of the model
so the
but that you change are now you wouldn't model the attribute sequence
and predicting the attribute itself is a simple mlp multilayer perceptron you can have more
fancier things
but this is integrated with the joint model
and then used are the generation process
during inference the best part about this is you would say that now you're complicating
model even more
you just introduce another bunch of parameters there
obviously is gonna do better perplexity but
what are you going to do for annotation like do you need another system just
to give you manually labeled annotated data at the attribute level now for your dollar
the good news is that you don't need it so here's how you do the
inference
so you start predicting be dialog attributes of the dialogue context so at any time
to time you use the context vector to predict the attribute
now condition on the previous attribute
you actually predict the next
i'd view that means that time stamp i use that attributed i minus one to
predict that you know the dialogue act
combine it with the context aided i minus one
to start the generation process
and as i mentioned the
attribute annotation is not required during inference you just user during training
now there is a whole
bunch of things you can do together we even from the actual adaptation during training
time for example
you need to say that
i need my training data also to be tied with semantic labels or like you
motion labels or dialogue acts
you could learn
an open-ended
set of things like for example open-ended topics of the dialogue
and i want getting to that and the startling it but if a person to
be happy to answer that you to
so
this is the crux of the model
of course it doesn't stop there
for most dialogue systems we also have to do in a rl reinforcement layer on
top of that where you try to optimize a policy gradient
usually these objectives a slightly different from the maximum likelihood objective that means you're trying
to bias along responses or some other goal
use the standard reinforce
and usually the policies are initialized from the supervised pre-training so the
attribute conditional the hierarchical recurrent
and coda model is the one for screen and then you initialise the rl policy
parameters
from that state
in standard works the this is how it looks like
you formant formally the policy as a token prediction problem so this database is basically
represented by the context at that means the encoder state
and the action space is you trying to predict the token vocabulary one at a
time
what's the problem with this
besides the double countries large for open-domain
usually what ends up happening is these
policy grading methods exhibit high variance and this is basically because of the large action
space
and
the rl which is actually introduced to actually buys this surprise learning system some you
know away from what it was supposed to line and like printers
do meaningful dialogue
instead tries to step away be linguistic and that's language phenomena
simply because
certain words are more frequent than others
again
the policies friend
big
those words
from the vocabulary that will actually maximize its reward or utility function
so
of course
training and convergence is another issue in this
setting as well
instead would be say is like
instead of doing be
token generation be formulated policy as a dialog attribute prediction problem the state space now
becomes
a combination of the dialogue context
and the contextual attribute and these attributes of the dialogue at with the dimension in
the previous slide
the action space is
the set of dialog attribute
something more latent
something more interpretable
and
in fact
think about it like if you capture some aspect of a semantics of a sentiment
you need all the words possible
in the english vocabulary or any language vocabulary to generate that specific sentiment i mean
as soon as you gotta that just
the generation can actually downstream do much more interesting things so you're elevating the problem
from the lexical level to the semantic level
so
there's a reason why this so people might say okay you introduce another attribute or
like another set of parameters a latent layer there this is interpretable it's great
of course this is gonna improve perplexity
i'll show you that it's not just about complexity what ends up happening is even
from the
learning theory perspective
because you're introducing these
latent models and interpretable discrete variable models
it actually converges better and learns to generate much more fluent and smooth responses
and explore parts of the search space that it wouldn't the before
simply because as an on almost every problem in the space is nonconvex so here
we start with that but
so here you're actually using the semantics or the user not language phenomena to guide
it in a better
what was it speaks
so the experiment results conform the same like so we runs on a bunch of
datasets like there's a perplexity and the table shows basically
the columns are how much training data was trained on
obviously if you go from left to right
the more data trained on the better the perplexity of the generated dialogue that it's
e
and here are the attributes that we use a to model the dialogue
now
like sentiment means you're actually incorporating sentiment in the dialogue attribute stage of the model
prediction switchboard is basically the dialogue acts frames is not a set of dialogue act
so
this can all be mutually exclusive all to be complementary or even overlapping
and what we know what is this it's actually even beneficial to compose them of
these attributes so they provide very different information so
the fact that you model sentiment is not the same as you fact that you
model
dialogue acts the fact that you model dialogue acts from one particular
john does not the same as modeling
dialogue act from a different drawn so you can actually compose these attributes in very
flexible fashion and in fact it actually improves the generation
but the means the perplexity goes down
so overall would be c is that the
both the attribute conditioning and the reinforcement learning part
generates like much better responses and more interesting in diverse responses
so one we obviously
as i said i keep repeating perplexity because every time you see a deep learning
system i mean it's easy to improve perplexity try to me you add more parameters
the system i mean
the
the weight works is like more parameters means and you add more data you can
actually improve perplexity by optimising towards better state to the other parameter settings configurations
now we also in addition
did you many bows on the generated responses to see if it actually makes sense
i mean because as a whole goal of generation i believe every generation system should
do
human about some setting if at all possible
and what we notice is like
a standard sequences sequence model compared with the attribute conditioning
obviously the i could be conditioning actually helps the varsity and also relevance
better that means it has much more winter loss ratio compared to this baseline model
now in addition
when you add the rl conditioning on top of that the means like we do
the policy optimisation from this implies pre-training step
it does even better
so the rl as i said is actually knew
move or nicely supervised training states from that initialization state to a better is good
a lot about a policy but instead of learning it over at the token level
now it's actually gonna learned that the attribute so we injecting attribute conditioning both the
b r a level and also this approach training model
if you compute the score is already but see discourse and their standard ways to
do these based in the literature
look at the responses and you can do automatic
you know computation of the about metrics like
compute the number of you know n-grams
that are overlapping et cetera
a how many distinct phrases or you know generated in the system
overall the
sequences you can model is worse than the attribute condition model and the other one
is actually even better than both of that
in addition
if you take like the said
of the response space that means like the most likely responses
and you look at the percentage of them generated in the new systems
the percentage goes down significantly how many times have you seen a chat or anything
or any of the voice's systems you ask a question says i don't know right
so the goal is
that's a default you know fallback mechanism but the goal is like instead of that
can be model something about for example
emotional responses or other things just sort of engage the user in a better fashion
what this allows to do is like you don't get the
standard frustrating i don't know instead you get something mourn once it may not be
the answer directly but it'll probably d the quantisation a much better five
or direction
and you're some examples which are one go through but like
for standard inputs or not the standard either from read it so that never standard
you get like interesting responses instead of think saying things like
you know i don't know or you know leaving i don't want to have no
idea used are getting like longer responses but also things that like mitch you know
probably make more sense like for example i'm honestly bit confused
why
no one is brought me or my books any k might but it should be
box i think at kick
i don't think i don't think anything that's with the sequence a sequence model would
even but that you conditioning
voices are all say i can't wait to see in the city
some of the context is missing from this example because the previous dialogue history it's
been cut off here but there's something about the c d being mentioned there that's
why it's to see
okay just to summarize i-th
we propose a new approach for dialog generation with control the link opposable semantics i
think this is a super important then interesting topic because
it's very easy to
begin or what can generation we can do jans and all kinds of things like
that but
making it actually interpretable uncontrollable in this fashion believe also how that these in our
empirical experiments tell the learning process as well it's not just about saying that this
is a good knots language for non that we wanna model
both the rl and look at the conditioning
gender improves the baseline model by generating interesting and it was responses
their number of things that b
you know are looking at in the future
in addition to incorporating multimodal but
what is the impact of debriefing
classifiers like for example as is that like we didn't use pre-trained classifiers as the
attribute prediction problem there
and how do we like
measure the interpretability via modeling this during the training process
audrey dialogue data generated actually
respecting the semantics of the attributes that it actually predicts i mean there's that even
makes sense
and then like how do you know do this for
speaker persona an extended to more open-ended concepts
these are
questions in like you know thoughts
if you have any questions related to any of these things hundred runs of them
i am residuals from start of five am i was very interested in your training
corpus size of the examples you gave for the dialogue model training we've had up
to two meeting million training examples obviously in a situation assume you're not a manually
generating them are you getting them for me to give examples or where else you
get it's a user some of them are from
that dreaded and the open-set i was corporas these are available
as it is said
the attributes
themselves i'm not necessarily always manly annotated for example for so which but i believe
first part of that behind it
a for one of the dataset but what we ended up doing is like you
can take the
standard lda or any other you know tool
actually label them with the center so you can have a less a high precision
classify image actually do
a runaway training corpus so these can be single label for instance
and interesting part is that
after modeling all this like the it's not necessary the accuracy of the dialogue act
to be prediction will go are in the latent system
even though that might be really eighties or something like that it still is good
enough for the generation system
it so there is a so there's something work to be done about like
how good can we get like i mean should be bumped up to like to
ninety nine percent then whether that have an effect on the generation
things that we are looking at
i am adding more german research lab just had a question about i guess did
you look at speaker persona at all i was only curious maybe you can speculated
about it do you think with enough data
with the conditional model you could model individual users
maybe like to read it user names or something like
there is a joke when we really smarter clapping after the first
further for version assume
i think it was a some professor from universities it
this modifies and getting seem very snotty to me
as like
it's training on your own data i mean we don't look at the data but
you know it's basically reflect in yourself
so show an answer is yes but of course you want to do this what
you know data right and you also want to do it in the privacy present
manner which i haven't talked about here at all right part of my group focus
on like
how do you do this all in the privacy preserving manner right for example you
can build a general system
but then
all the inference and things can happen only on-device are in like sort of like
your data is like silent off from everybody else
and the question is again
deep really do you feel like you have a specific personality or what you feel
was is what you actually right
might be very different right so that their aspects of that to be considered
i'll be here if you want