i a good afternoon everybody and can send out from carnegie mellon university so they
we're going to present our work including and you and a dialogue system using reinforcement
learning
to get started out of my talk is going to focus on task driven a
dialog agents
so those are agents that can have exposed to go to achieve such as providing
information from a database even user's preference
and traditionally people using for this kind of pipeline to building such a system so
first well we have some user inputs and the those inputs are annotated by hum
level annotation format you know you and that of impose a pipe into a state
tracker that can what is information or work on a same kind of the target
also able to
i providing a sufficient information to vertical rate okay interface with some structured database of
external
and condition on this information would have a dialogue policy that besides the next action
to do and of thing speaking back to the user
so all a project going to focus on how to replace the three highlighted module
here by a single end-to-end model
of people that getting to how we do such a model to talk about
why we want do this
so there are some limitations about traditional pipeline so the first one is the core
test on a problem so the system is they brought into the real word and
we get feedback from and user
those users that texas at the good dialogue about dialogue
but is not clear which module is responsible for the six that all the failure
of this of those feedback signal and you mixing you know what the arrow can
propagate between modules so you bargain can be even more challenging
and the second problem is that scalability or you got of the representation
so
you won't before we have the ldc challenge
but you knows case we use the command to the chart to estimate the value
of life that of dialog state variables
are those of arable the handcrafted
and how to design discuss that up like of our both a require a lot
of a extra knowledge and by design usually handicap the performance of the quality because
it simply not providing a sufficient information to make a good deletions
and i think time it's also challenging to building an end-to-end posture agents so we're
going to have some challenges
so the first challenge is it's and
a customer agents and used to know some sort of a strategic plan or policy
you achieve the and go on uncertainty from asr nlu and a user which is
beyond you know i think we'll supervised learning
also i think hind
a task it isn't interface with some external structure knowledge outside that only except
i think baltic already like as q or language or i call
that someone in during chapel and dimension of words but those you're not model the
only have continues intermediate representation and it's not a easy to get e symbolic already
from the
so all the challenge here is our contribution of this project so we introduce a
the rainbows learning based end-to-end dialogue system that should result in
do you perspective
the first one is we show that we can jointly optimize the state tracking
a database interaction and apology together
and the second one is
we can we shall we provide some prove that
and the deep learning a recurrent nets can learn some sort of that of the
representation automatically
so that speaking
so first of all we follow this intuition so we want to have a minimum
symbolic about state so that all state is defined at all the you know sufficient
information that in a dialogue
and we thought we can note that the or symbolic representation is only needed for
those paul that's
related to the database
so on any two main things that express the values for those thing of all
those variables and the for the rest of information such as the discourse structure the
intent of the user the goal the user what has just means that
we can more than other continuous vector
so that the by a high level integration
and then we propose this architecture
so we still follows the a typical home bp approach they have agent and they
have environment
and agent can apply some action to the environment and environment will respond with observations
and the reward
and you know case environment the comprised of two elements the first element is the
user
a second i'm in the database and also for asian that had it can apply
two types of action the first type of action is the verbal action it's similar
to the previous one d e a dialogue manager so
the agent can choose an action and say something back to the user
and the second type of action league or and our hypothesis action so we maintain
a external piece of
memory to hold the value
of a bounded according to the database
and agent can apply some sort of a high parts of action to modify this
is a memory and this bit on that we can be parsed into the database
which were built on
some matching entities and a given instant feedback
so
with this architecture the entire process of that all state tracking and a dialogue policy
is formulated as a single a sequential decision making process so we can using reinforcement
you know this policy in end-to-end fashion
and so basically we want you
approximate some argue about a function which is the expected future at and we want
to be able to not this about a function
okay so here is the implement a neural network structure
so it has several layers i will explain them from the bought into account
so the bottom layer is the observation we got from every time so they have
three elements though
we have the action that isn't to in the last that
and observation from the user and observation from the database and those two information all
mapped into a low-dimensional embedding us willing a transformation and the embedded in the past
into the recurrent nets which is we hope we can maintain a of the temporal
information over time
so we call a dialog state they
and it is the outputs from i don't is
leading you to a decision networks
with the actors fully connected feed-forward neural networks and one of them is
used to model the iq value function for verbal action and the other one is
used to model the q about a function for the
the hypothesis actions and you can see and you can do not focus our own
time so every time you make addition the action overview piping to the next action
and we can use a new observation from environment and it just keep going
ideally the proposed architecture can be trained and un only using the dialog success paper
which is come only come from the end of session is that success cooperative dialogue
but if it gets kind of us past reward calico cat results into value of
very slow learning
so we all of the four we i think once the fact that would be
okay so sometimes we have the hypothesis oracle label like we got from the dstc
so how can we including those label to speed up the learning
so we're describe a tech a single technique and by results into a significant speed
up
in terms of the convergence of algorithm
so that the that the trick is you know we modified up the remote model
in this upon db so we assume the correct hypothesis action follows them multinomial distributions
so there's only one single correct arms the at the time so we can add
okay is to the reward
in the reward which is just simply the probability that this hypothesis action
is correct at the time
and the second a trick it
because of the environment a comprise up to par the user and of the in
a database
the user is difficult to model but the database is just the program is a
dynamic is known
so we can usually get a general indifference tempos of this kind of a state
action the next they
by applying all possible hypotheses action at a time
so we can add of those generic a sample orange you the
experience table
and the this experience table will be used to update the parameters at
the next a large and this is a kind of similar to the final iq
learning
that introduced by certain but
the difference is that they use that have separate model to
estimate the transition probability but here we have many more kind of this that we
just you know trying to you to generate sample from the database which isn't has
known
dynamics
okay so the training were used in the state of a value based people are
important learning which is the traumatised w q and so the quantized experience reply allows
the that i wasn't to focus on sample that has
a more important information which is beloved on
and the second one is we use of w q and which can so the
bias that in the in the q-value a function of the time but the estimation
and the loss function is simply the that particular motion of the temporal difference a
role
and the square loss and we minimize the loss function by a stochastic gradient descent
okay so to pass all well
our proposed model we choose a conversational game is game is called twenty question game
so this game is simply that the agent aghast important was and that this is
a user you think you know
so it had you know two player the user and agent
so the agent can have access to a famous people database
it also can ask select from a list of yes or no question to ask
the user
and then use then used on the on that the to this question with yes
no i don't know intend in any possible a natural way
and the agent can also making guesses noted lu i think about bill gates or
and then to ring
and if it happens to against the correct person the k eight it will consider
waiting for agent otherwise eighteen were lost
so to using the to make you know to have to be able to do
experiment we view construct a uses the narrator so the senator a first one construct
a very famous people database and database we select a hundred people from the freebase
and each person is associated with about six attributes
and we manually designed a few just on a question to ask about
every attribute
so you can see some
examples you like what he in
trying to or is he won't be phone i six the influence of words
and the second of all
i we also on the use and to reply that intense in
a memory in a different possible ways so we collecting a different natural ways of
saying that three intense from the
switchboard dialog act corpus
and the we
and eventually we care about a few hundreds of different
unique way all the thing that intends from the as the ready a corpus and
also we maintain the frequency count of each expression so we can sample from the
so you can you know some expression can more often be
reply from a more real expression will be only
a location or example
and also here is the just found final piece of configuration for the pomdp
so
the game is terminated only if one of the four
condition l two
and we basically either the game like the agent guess the correct person already took
too long for the stochastic of then
or to make too many wrong guesses
or it you know you all but at
i a core hypothesis that is not consistent with any people database
and only if the put the agent against the correct person
we consider that as a texas otherwise it of area
and the if the agent when you look at thirty points other not otherwise inactivity
point
and you can buy at the most making five also examine most making five ten
wrong guesses and everyone guesses were induced inactive i don't see
so he tries to
and make it and making a more correct guess it more careful guesses
okay so we have described our model that and the we
trying to model and let's and how to use them analysis
so we analyze the model from three different perspective
so from dialogue policy high quality tracking and dialog state representation on
the first analysis is done on the
a dialogue up a policy analysis
so we compare three models so the first not in the baseline so
and we train basically we are the only difference is that we train the state
tracking and dialog policy
separately without a joint
right a joint training so they don't know the error come from each other
and the second one is the proposed model that only use the end of session
reward been used excess of area
and the last one is we use the ipod the label that we talk about
only a
with a different reward function
and the table shows the results
so we can see that of the proposed model the bowl was proposed model outperforms
by large margin compared to the baseline
and of the hybrid approach
even performs better than rl
so do you have a no more deep understanding about what's happening we also plot
the learning of pollution during training
so the
horizontal axis is the numbers that a number of parameter update
and the vertical access if the success rate and the green line use rl
the right the light is a hybrid approach and the purple light years baseline
so you can see they have a quite distinct behaviour so for the baseline model
because the state tracker is simply it ring either supervised learning so accomplish much faster
so it but its performance actually it's quite pretty quickly
because then they note kind of the data points in a state tracking of finding
with each other they don't know the information from each other
and
and of what i'll post it takes a very long time to have good performance
but it and it gets a good performance and the hyperbolic kind of kind of
benefits from both sides and a loss
relatively faster
in the beginning and then converges they probably to the path components weaker
so if eigen analysis it were trying to that you somehow cross-track analysis so to
do this we do you in the past remodel for
all the baseline i don't hybrid i'll
and we got a
a ten thousand k samples from each
and we report the precision recall for each one and he represents the pricing is
that we can see the base actually from those score is much greater than the
proposed approach
so what happened
so we look at some example dialogues that we can see
thoughtfully found for the eight that the laughter why the agent that the baseline and
the right one of the proposed model so the agent ask or is that wasn't
from america and the user utterance i would i don't think about issue is kind
of difficult for the model that you can survive
and the for the baseline because it's just you know it's a
classification doesn't take into a kind of the future it has to make a hard
addition right now so that each
it shows yes which is a route on so
well in a second case
was model
it will find okay that's kind of ambiguous i will even though no for now
and ask
another question
and use time use a set was no which is much simpler to classify so
now it to that the relative you know
so the main difference is that baseline is
it doesn't take into the future so it hasn't decomposition every time
for the proposed model
it because it has kind of you are kind of training was reinforcement so
if the models the future so you can have a long time planning
well at the last we doing some dialog state representation analysis
so we wanna see how well the lstm hidden layer is known about of the
representation
so we get to task the first task is okay c we want to see
if we can reconstruct some important variable from the dialogues the embedding
so we do is we took the model that showing a twenty k fifty k
and hundred k steps and we're trying to train nothing coding of a question
to predict
and the to predict the number i guess it has band
and we took eighty percent for training twenty percent forecasting and the table shows the
ask well on the testing set
so cloudy week as the former one of the better training set was more
beta it's easier to reconstruct
this us they variable from this allows the embedding so we can kind of come
from a hypothesis that the model is employed in trying to encode of this information
in is hidden layer although in our politics person to do so
and the second pass with the l is we're trying to do a retrieval task
so because we brought us to not it also actually know the true state of
this one so we have many palace like we have a dialogue is the embedding
with the corresponding similar the internal state
so have as it is it actually that atoms the embedding is a learning the
choose the internal status the narrator those two pass they know that must be very
correlated
so we can do average of also do a simple nearest neighbor
based on cosine this than in the embedding space and then we compare the similarity
in the retrieved to state
so you've their recall rate the rich you know retrieve choose they should also be
very similar to each other because the echoes that very close in embedding space
so we do this then experiment and that the horizontal axis here in the basically
the this the
the wearable index in a scenario
and the vertical axis is the probability are there
the retrieved five nearest neighbor that different from each other
and
and also we can we compare the performance of twenty fifty eight hundred k models
so
again we come from that for model that's better training
the probably the different is much i keep in decreasing
so that also means kind of means that
the that was the embedding it right become one or more correlated
with the internal state has the narrator
so it's actually learning
the internal dynamics of this particular work environment
in conclusion
we first show that it is possible that we can jointly long
the have all the tracking i don't dialogue policy together and it also to outperform
the baseline a modular approach
and the second level we show that
recurrent neural networks
the here that it is able to long continuous a factor a vectorial representation of
what dialog state
that's also cost driven so it only lost in important information
that is useful for it make the decision to achieve the goal and the other
all relevant information alone
and it's a standard we can also we show that we still we show that
purely reinforcement approach with a very sparse reward still suffer from a slow convergence
in the deep reinforcement is starting
and how to improve this kind of learning speed
is left to our future work
and not so thank you
okay something to so was time for questions
i so the user observation that is input to the marketing is a i mean
how to encode input utterance to you just send the tokens are you'd expect any
features
so the question is how do we encode the user observations so we do actually
simple so we just take up to about a bigram back to the use of
the observations
so it just the vector and you and that's factors and with the will be
in unit transformed into a us imply a user utterance embedding and then concatenate with
other vectors from the database and of action
you go
okay
the first question is do we try some a simple decision tree so i tried
with the max temporal setting that user also with yes or no on don't know
who is no ambiguity should just you know just a bc three where labels and
addition works pretty well
but i didn't write on the large scale because the point is to compare for
this model is a decision tree in this impose additional work well but in more
complex adding the kind of is also into a model will be more obvious
and the second one is what's questions
actually somewhat so actually the baseline here is obvious that of the baseline has a
state tracker that string of the three weak classifier and of that have another that
of
model separately just to select a verbal action and every time so
but no the question the problem is they kind of they don't know each other
and the they make mistakes that so that they are not aware of the pilots
in of the that just distribution difference in training and testing
here in this did you a trade having soft outputs from a state tracker to
the point is in the confidence scores non-zero probability distributions
no we use hard decision so because in the or use hard addition
as a really nice paper really interesting but
i guess i'm not being slightly sceptical but you this actually scales
you've actually got three possible user intends we know you know no elaborations of that
is known or use
this uncertainty because you're allowing different ways of saying yes and were essentially there is
no noise
and i want the
two things what a why you haven't writing with real users
up to so you'd have real kind of
the noise you get from an is all system and b
have you done anything which suggests this would scale
to a system where you know
of the type that the question answer type system the general personal agent where the
user is expressed as some really rather rich intents
so express and then you need to be encoded in the state that you're tracking
rather than is very simple at least from the use of the state as far
as i can see is basically a three state tracker
so this model within this very preliminary at the point and so the what
the okay so if i go back to the proposed architecture
so how we define this hypothesis action a reader comments the scalability of this approach
at this point because only have very intense we can have just through action that
change the value of this hypothesis
in that's true that if the there's many at are involved in this system that
we need to tracking complex
a dialog state and that's have to be symbolic then
the proposed this next able remain obvious fit
well this proposed architecture we
still work we do you have i don't have a way to design to have
a selection and how to we
maintaining this it's an app is a memory that holds those in about important about
of the interface to database and the i think that's that will be in part
of future research
okay so thing
i have time so that since again the speaker