right
for the last talk of the session
i try to keep it fun that a lot of videos it's on again it's
gonna be
find an engaging
so this work is
on using reinforcement learning for modeling incrementality in the context of a fast paced dialogue
game
this might work with my advisers david you want and can right hand i most
wanted to see how to
so
incrementality that's what this work is focused on
a human speech as incremental right
so we process the of be processed the content what whiteboard and sometimes and subword
but we try to process it as soon as the things are available
so incrementality helps us model are different natural language
natural dialogue phenomena such as that record a game
speech overlaps barge ins backchannels so modeling these things is very important to make a
dialogue systems more natural and efficient
so the contributions of this work
can be kind of grouped into three points
the first one is the white of reinforcement learning method to model incrementality
the second one is we provide a detailed analysis and what does it need a
so in our previous work we have a very
you know like state-of-the-art carefully designed rule based a baseline system
which interacts with humans in real time
and it performs nearly as well as humans right so it's a it's a really
strong baseline
the actual the videos and you get more context in the slides to calm
the reinforcement learning model introduce your actually outperforms
the cd a baseline but please keep in mind it's offline also we don't have
a real system yet but it's a lot of forms
and we also provide some analysis of the fourteen time it took the double up
to six
selected domain so the domain as a adapted dialogue again we call it already image
it's a two player game
it's an image matching collaborative game between two players so each person
is assigned a little farther detector or the matcher
the detector sees the eight images as you see on the screen you're one of
them is highlighted with the red border the data that is supposed to describe this
one
no matter sees the same eight images in different orders and to match it is
supposed to make the selection based on the description given
right and the goal is to get
as many may just as possible in a lot of fine
so that's but you know like it's fast and incrementally
so let's look at an example as to how this gain works so we also
so here you see are two players human begin with one another other person on
the top is the detector so that the listing one of the image highlighted with
the red border sure despite the highlighted image other person below is the matcher
should try to guess the image based on the description
and there's time and score so depend i o for that
part
our classes that
it offline such a while it might here we got it
it's the line classes but
okay so in this in this particular a been used for like
the dialogue is very fast an incremental there's lot of rapid turn taking
so it's very is it's a fast-paced game it's that's fun
so we got a lot of data in the from the human conversation and then
we design an incremental a agent
so we been employed eve
she's high performance of baseline system right
so she's trained on the human conversation data and provide more details in the coming
slide
we evaluated this but one hundred twenty five users
and this performs nearly as well as the humans
so it should work
so this medium a show you how the interaction between the even the human goes
this of your pain again right
so the top
or you will see eight images what the human but c and on the bottom
you see he's
eight images and using clean boss going up and all its basically high confidence
in each images and their changing based on our the human descriptions
okay
it's a yellow bird
time
so
okay
and sleeping black and white cat
okay
sounds which one
bernstein
clean
by which can handle bars and c
okay
alright so that so just playing the game but humans it's real time and she
is not a just one begin with
so
how does she what so basically we have the user's speech coming and we have
an incremental asr which is quality
we just providing like it's one best hypothesis every hundred milliseconds
we use this hypothesis and we compute are the distribution of confidence distribution of course
all the eight images that's on the screen
and then the dialogue policy uses the distributions and then she decides whether way
or to select or to skip
the wait action is
she silent then she's the listening
selection is bad
she has enough confidence to you know like make the selection and skip this where
she's thinking hey i'm not getting much of you know like information maybe i just
skip and go to the next one and hope i get that right
the thai language generation is very simple it's templatebased of innovation select
she says got it has you heard in the video
if you skipping she says that someone
now the focus of this work is the dialogue policy so the dialogue policy in
the previous work you see the a task is to design rules
a fairly wide we call it carefully designed rules in the minute
we thought that can we do better
than the current baseline
right
so we use reinforcement learning and we try to see if either that perform better
so
the carefully designed baseline has these things we just be start b is the first
one which is how yes
probability a sign
do you know like a
are two one update images
and then there are two values which is identification threshold and the give-up threshold so
identification threshold is
the minimum confidence that should hit for at any given image able which if can
say got it
and give-up threshold is
the maximum time she rates after which
rc say skip so any time in between she's waiting right
so this is what is the carefully design a hurdles baseline system is
why do we call it carefully designed a rules right
so in the published comparisons of auditing policies and the rule based one thing that
of a nokia like how much fine was actually
on designing the rule based a systems
so in this work
you know we do that so
the information so identification threshold it and the give-up threshold
g is actually
not some random value that depict it's actually train i mean it's tomb from the
field human conversation data
we use something called eavesdropper a framework and we use this to get it send
g d's
for more details please refer to a paper in this is in fact that really
fifteen
so we spent almost one month and you know like trying to find
not like what's the best way to design this policies
so predicting the next word is one such examples so
so it what it looks something for we designed this rules and actually performs nearly
as well as human so it's a really strong baseline
but even though it is carefully designed rules she still
you know a few limitations so it's group best case one case twenty gives three
so in this
in this particular slide you see on the x-axis is fine right as the bangles
the y-axis is the confidence
of assigned by the nlu
so each one of the points is each partial that's coming again
from the asr so confidence is actually changing
in the case one
eve is very eager to skip
right
in the case to she's very eager to select
so sometimes what happens but postures incremental speech recognition is that we have a lot
of you know like unstable
a hypothesis and it softens leading to kind of spoken to that and comment
and the case to be as bad
actually save time
by maybe selecting you know like much with your right
so these other three cases where it's hard you've can actually perform
so we use reinforcement learning
so the state space
is represented by a people that is used r t which is the highest confidence
in any one of the eight images
and then p c is the and consumed
right so that what happened in nature
the action is basically is it select
is it skip or is it weight
or maybe transition probabilities and analyses hundred factors and the reward is very simplistic
that is if you gets
the image right
she gets a reward of plus hundred if you gets a problem it's a negative
hundred
i the weight is like a very small epsilon value
it's very close to zero
and she gets more reward for skipping
so
the data that we use for this experiment
how comes in three flavours
the human data in the lab that we collected
the human web interaction data collected in another experiment
and then the eve interactions with other humans
the one twenty five that i was talking about so there are a thirteen thousand
but more than thirteen thousand subdialogues here
so we split them up
based on the user's the ninety percent of the users
of used for training and the ten percent of for testing
for reinforcement learning the user this be a by a describe what is iterations and
user question
a radial basis functions for representing are the features
so how so how does how does it operate
so every time hundred milliseconds asr is given out the partials
we start is assigned by the nlu
and the policies deciding whether actual rate
or select or skip
if its weight
the next it is i can always samples the next time step that is two
hundred millisecond what happened after the second
day sidebar to the
the new value for the nist rt and the new policy of our decision
so this keep happening until we see a selection of the skipped if it selections
of the scale
we know the ground truth so we can do like a fine on the values
based on that
so this is a snapshot as to how the things able right
so on the x-axis you see the partials
so each one of those more i need things of the partials that's coming in
from the asr
and on the y-axis you see the confidence assigned by a the nlu
so in this example you see the baseline the agenda skipping at the point
and then the rl agent that strangers actually waiting for a longer time until she
sees like very high confidence and hands should get stuck image right
okay so i want a text
a little time to explain what describe
it's a not so instead of
so on the horizontal axis you see three groups
wait actions
in the middle you see skip actions
and on the right you see the select actions right
so this graph shows the complete state space it's everything in the state
the red dots
indicate the baseline agent decisions
the blue dots is what was learned by the reinforcement learning policy
on the vertical axis you see that i'm going from zero to fifteen
and on the data taxes you see the confidence going from point to one
so
the red dots can see that actually fit together it's you know like it's a
rule based system right so we can be determinized are deterministically no you know like
what action any changes taking
the blue is the actions that's learned by the reinforcement learning so she's kind of
the learning similar things but there's some difference you
that is the reinforcement learning policies learning actually select an image for very high confidence
for extremely high confidence that is one point zero
if the time consumed is low
so if the time consuming is not solo
she's actually learning wait more
so by creating more she's actually getting more partials that since you like she's getting
more boxes as a result of which are she has a chance of performing better
in the game and hence quality more points
so this so this graph is kind of i showing that
this is more simple
so on the x-axis you see the average point score for one of the subset
and on the y-axis you see that underpins you
so the blue one is the reinforcement learning the red one as the baseline agent
so you see the agent is actually waiting for longer time
and she's coding more points and on the vertical axis you have you know like
the baseline system which is very hard to know like a skit or you know
like make a selection
so here we have that is so here we have a religion significantly scoring more
points than the baseline
and
there's a trend which is actually performing she's taking more fine to make the selections
so why
did the cd a baseline couldn't loan or reinforcement learning don't right
so
you see that all is if you're gonna come back to the policy that we
used for the cd a baseline
it's actually independent the time and the confidence values of the copy start be
you know independent of each other
but what reinforcement learning is doing is it's actually learning to optimize
the policy based on nist rt and the time consuming and jointly
and back results in
the reinforcement learning agent performing much better than the baseline agent
so this shows like
you know how much points
she's score
and b s is the points per second
like it kind of combines both the points and dot fine you know like aspect
in one particular
a table
so you can think consistently you know like i just putting much higher in terms
of points you know like across all the image sets
but the points per second a something that you're can so that that's of you
know like interest
have and how it twice
so in the by cc the points per second as zero point zero nine and
in the rls zero point one for that means
by scoring more points
she's actually don't better in the game
because her points per second has been lot higher
and in the necklace
subset
we see that even if the baseline agent has scored much less points
the points for second is very high
that's because
the baseline agent is very you got the one i just one some points by
chance
but rl is getting what points
basically by rating more as a result of which are b bs s one
so i want to discuss a little bit about if for and
the time
so that they systems are often criticised as being laborious and time-consuming the bit
they are but they actually have doesn't perform nearly as well as human so i
don't know if it's favouritism
and you know it nearly the same time the better the cd a baseline asked
to reinforcement learning policy no this is of course excluding
the data collection and the intersection building efforts
but
the advantage that we get is that rl approach is more scalable
because adding features is more easy and
so the future work exactly one
two
actually investigative are the best improvements transfer to the interactions which means we want to
you know like
put the policy of the reinforcement learning policy to the agent and see if you
can actually perform better
in the real user study
and then we want to explore adding more features or to the state space and
then
the reward function one alone how from the data using other in four the inverse
reinforcement learning
and finally thank you wanna time mike about the
and anonymous reviewers for their very useful comments and nsf and additive for supporting this
work
and this people for providing a images a second using this a particular paper
thank you very much so they questions
very much and now time for questions
take here so
i think you very much for a nice talk and just a clarification question regarding
a room for reinforcement learning setup c four i'm correct your learning from a corpus
right yep but here the using least squares for quality duration
but easy and onpolicy method which requires learning from interaction righted in learning from corpus
alright so
so this is the one of which expand so we kind of three this as
a real interaction
that is even though it's you know like
so for every hundred milliseconds as it would happen in a real infractions but user
or subdialogue
be kind of sample like based on each time-step rate for every for the first
hundred milliseconds we have a partial and for the first partial we have a small
we have the probability distribution that the fine and we have that and consume
so here we just use the probability distribution and the time
not like is in a like as a feature
and then
the next time sample hasn't happened in a real interaction the next thing that's happening
is
the next question is coming in
and the next part show that you know like is something that the user task
actually spoken you know like in the data that you're collector
and
you know like it keeps going on so basically it's train
but subdialogue
a image
but i still think you would gets improvement if you actually used something like importance
sampling
count the fact that you're tree very seeing a trajectory that happens in corpus project
and in an online exploration method which on policy reinforcement learning
i
that that's a good question i mean i have explored a bar
i guess that's according to you know like have explored
include
fix for talk all two questions first one can you explain a little bit more
how you can point you know you work with image recognition for the same using
some cnn model
we fix the vision
what we fake vision
okay so the nlu a sign
so the way the nlu strain does
we have the human data that we have collected that humans are actually describing the
corpus right
so we had other descriptions from the human examples where to humans was speaking and
they were describing the target image
so we had
the words that's associated image
so we
that's that i mean like that something that we really want to do that is
you know like user you images and then of get like
learn from the image rather than fig that we should but in this particular one
bit just learning from
doctor and did you play around with setting like actually do the work for we
do actually to be negative so you might speed up the so we tried like
a lot of different things so one thing is very and start of that lspi
in the beginning like the start of but you know like all the different algorithms
like
one of the example is we try to q-learning but you know it
lot more us samples like that if there was prosody and really trial negative it
what's for the weight actions but that would mean the agent this actually
in a like penalized weight but we don't really want back rate we want agent
to be assigned with higher rewards for doing well in the game rather than waiting
or you know like
the specific reward function manipulation we just one
i mean
the reward function is kind of
reflective of what's happening in the game
more points for
and flexible that well i just one the let us try switching the roles of
human the most we need the game like what would happen i
the machine have has to describe the actions so we v so currently
the agent is only in the matcher role
it's not playing the role of the director it becomes much more complex because we
have to incrementally generate the descriptions but that's something that we really want to know
like in
in the future work
we don't know how
thanks room is talking or just a quick question about this is the representation so
and four for purely from is the so you're putting the portals in this the
yes
the partials
no the state just have those the you know you just an idea okay so
you're not you're not like a morning instability
right okay should not being on the captured okay this portals like always talk a
bicycle but it's not it's like this you know be or something like that so
we can be faster if you put it was able you know the colonial wasn't
as the you're okay the partials right even its ability to learn you could learn
the most because in the original consistent
that is right i wanna shall one small taking your with the instability in the
case to you know like and then use course in a later let's because of
this instability
and what we actually want is and what actually happened in the game is
that is not like the nlu confidence
is actually you know like fluctuating
part i and nlu confidence you know like all these blips of
these things
you know like it's kind of
lower as a joint way of like probability and the fine so it's actually waiting
a lot more giving the chance to kind of the last but that's of a
question i mean we i mean i think if you if you had a more
information use the nothing you warning would probably be more successful "'cause" i think that
it's possible your maybe why we use of the dp weak assumptions little bit
so
adding more features to spit that's and
right
thank you thank you think we want to thank him speaker once again