only one source statistical
and like to start the third and final invited talk
so we decided to use to actually you know
calculated from computer science and mathematics from the university l right
and since then she's been at cambridge university course you received your m fill that
her phd in statistical dialogue systems research associate and most recently has become a lecture
it's open dialogue systems in the department of engineering
and she is also a well like to the fellow of one of the colleges
the cambridge university
she's extremely well known i'm short everyone in this community because she's very well published
including a number
a for winning papers including classic style and she's coauthor of one of the nominees
of our for nominated papers at this six dial and have
after her talk if you still wanna do you can even more into her and
her colleagues research they have to posters at the afternoon poster session this afternoon
please welcome relief
and everybody here to sling
thank you don't really comes address is there was lining of getting the ski boats
here right
i
once in their sick not really clears one big family and if a family member
to do something kind of signal
so
thank you very much
a i will be talking about
soundness there are needed
that's what we're a building next conversation
and a deep learning can help us along that they are in some effort that
we've done between the dialysis this group in cambridge to achieve that
while i'm sure that we all agree
spoken conversation and in particular dialogue is one of the most natural rates of exchanging
information between q
we can be a book and be able to talk about what we just right
machines of the other hand there are very scoring huge amount of information okay not
so good share this information bit as in actual in human like right
so i'm sure and get lots of companies will have the virtual personal assistant sorry
privately locked loop and how they generate billions of calls
then the current models are very unnatural
no in domain and frustrating users
so in the research question that one to address is
how to be a continuous labeling
dialogue system capable of natural conversation
machine learning is very attractive for solving this task
one of the machine learning very high if i had to summarize machine learning the
just three words this would be data
model and prediction
so what they are in our case
okay is simply driver
or some parts of dialogues like
transcribed speech on a okay user intents what providing user feedback
the model is the underlying statistical model that lets us explain a time we use
of i've never directly model
once we train the model
we can make predictions
what is unusable
what to say back
to the user
that was
you just the building statistical dialogue systems has some three d so you assume this
the following structure
i guess is called dialog system consists of speech understanding unit
no management unit
and speech generation you
but it user speaks their speech is being recognized very speech recognizer
and a system a coherent state tracker that produce
dialog states it's of the
that is currently
these a policy makes a decision what to say back to the user
and very often are more or less nature of some kind of evaluated which vibrates
how good base decision well
second experiment we generate your
which reduces the textual output that is then presented to the user like text-to-speech synthesizer
i don't mind all these model of modules
is the ontology structured representation of the database that the dialogue system can talk so
this is
the structured it's obvious you
in goal oriented that exist
that is not to wear in the last
okay on automatic speech recognizers use size
some researchers go as far as saying that are known to reach the performance of
organs you want
in a speech recognition
i wouldn't say that but i would just like to point out that queries you
want more to be done
in the rest of the structure
of a spoken dialogue system
no this modular structure
there is no loss of information between these modules
and then the question is what can you to alleviate this loss of information
what can you want you
actually i
probability distributions between these modules
to our help alleviate the problem of loss of information
and the other when you do that
you the that's the map decoder and the state tracking becomes the belief tracker module
it produces a distribution over possible
a dialog states are described in
but before i go further into explaining how these systems where it's just take a
step back and see how many systems are currently prissy by its you are there
is
i have a personal assistant or microphone and a user pretty much everything i think
at what times map like tomorrow and it says you're like to note is that
there it is a p a shown with importance of doing a key and it
says
well as to not be a what time is tim's waiting to more you're waiting
is it five you know into a knocking tim's wearing that is you weighting is
the type ea in okay never mind we communicate by data alarms that four five
am
see it takes care of pretty much everything that i could probably due my so
but just kind of or for having someone to forming it did not however prevent
me from spilling coffee all over my trousers just before the ceremony i set up
at a i've got a problem and everything up
you're waiting using one hour and twenty five minutes james wedding i've got coffee all
over my trousers but has there is a nineteen ninety three stop okay intermediate for
l d a r any department stores double your favourite department store is no but
i don't think there do not p a find me a shot into a not
sure that
okay
do any stores after answers that's interesting question okay "'cause" call me a taxi to
the centre i'm sorry i can help axes into an okay
what you mean you can call it actually i'm getting a no i'm sorry i
cannot help you with that and so how am i supposed to get to the
weighting you're waiting using one hour and twenty for the last time came sweating your
wedding using one our work and twenty four minutes
well thanks be a you're really helping me out it is a should be thinking
you
so it should just a and resembles the actual personal assistants real virtual is really
going through that
okay so we'll
so let's try to problems that we address
the most obvious one from this means here is obviously phonemic awareness this is a
personal assistant was completely unaware of the user emotion and their state but there are
some things we need to address before that
so that the problems is closed it's
sure that can still not scale
and often maybe tool for long time to dialogue system it struck context
this problem is that each voice all
action of response is not pretty good
and the reason for that is the learner response choose between the very small set
of actions
and think to build an actual conversation unless thinking a lot of our systems to
choose between of a wide variety of actions
and finally systems their own or stick to different user needs
and this can be interpreted in many different raise but it is clear that we
need to more the user back there
if we want to achieve a better dialogue system
so first start with the for a bit explaining why we need to track one
do what is going straight fine
this is going to that of the dialogue system
it can talk about restaurants
the user said i am looking for a time restraint
i and how very acoustically similar so there is very likely to be a misrecognition
and we have high restaurant the fact that both
no extra dialog state do that based on our culture so the ontology for our
domain which was a restaurant or something else one and slot value pair
i
you hear that the system is sorry about the choice may but not so sorry
about this the slot where would that the system asks request with or what kind
of july
i
i which again gets misrecognized
as i there is high
and i don't do any ne extraction at this point this is mainly what happened
before
so
the i have a very small
and then system has no option but asking the same question again
what kind of that you lack
and this is what is particularly annoying to users asking the same question i get
i know what happens if you tracking
i don't really but in this pair
do you remember that was annotated with time within the previous turn you know the
probability of i based it is very low or overall probability they'll be actually higher
and always the same as the third option which is fair
this is not have the option of staying used a higher order fish
it is much better action
to be completely uncertainty free systems but the question is
how do we managed it's
i think about this is actually a very simple problem
all you're doing is matching does over the concept that you have the ontology with
the input because the user set register the user's flat side users that are
problem is not simple because we all know it
there is still many domains you can relate to a particular concept natural language
and then what you have to do is build a belief tracker for each of
these concepts at for
and that is something which doesn't scale
if you want to build an actual that exist
so that the i-vector about scaling vocal tract
note this solution to this problem is all to reuse knowledge you have for one
one-step
two four hundred constant
because we cannot hope to have labeled data for every kind of concept you want
a dialogue system to be
and real humans are very widely known that new situations and they need very useful
to do that
so it is actually ingredients for a large scale tracking are semantically constrained we're vectors
and are like to share parameters
so that i explain what we need what we mean by semantically constrained word vectors
more tolerant you have some close set
was used for the main like restrooms with a process for slots
like price range of values like chi question
and this do you should know what is that is that's
a very good here
in america one
are semantically similar
it should to some extent but also make sure you don't know what kind of
application a
so for instance you can say here is that you have stated in his head
of state in this case the queen or king are semantically still there but if
you have a dialogue system you in the analysis user said you for something in
the north wind looking for something you the set here
well north and sentence error in this context really want to my this technique
so what limitations the former phd students from are blue it used semantic
a second understandings and
synonyms to this is a vector space
so it in a here what will change and x can be very far away
but surely as a marking inexpensive will be close well
in other stand for
and
and
our
g expensive a sector are concepts from the ontology and i'm sure that our debate
is if the user may refer to be scores
so we use this to scalar tracking
you need to try and you have two times are typically three crash
another question is
that's what the system is saying
e referring to what we have the ontology
the second question is how what the user is a is there are three for
what we have
and your question is what is the onset
but the context of this of the conversation well
so i don't through the first question
you use it is in fact how may i help you or anything else can
be a vector embeddings region
i feature extractor
in here is to make this feature extractors you have
so in our case we have to be treated as but this could be any
kind of feature extractors like bidirectional
and one would be for domain a generic one for the main a generic one
for slot in a generic one guy
what we have an ontology
so we have and begging for restaurant name and price range by which e
so then maybe
actually we calculate the similarity between what our feature extractor for the main state
what are your right
the same process be the input that you got real user
you actually needs to be and i'm into analysis and or an rnn or a
cheer you anything each entry which hasn't requires it can how you keep track of
on
and then what you get
probability for the k and then you that the same procedure probability for slot value
and then when you're a five is to use a probability for the main or
particular slot in particular that and then you do this for all
and in your in your topology
you the belief state
and the current turn time
so what we
i is evaluated this
this tracker but how can you invited belief tracking you need a touch these labels
so in cambridge we have another works to create a be labeled datasets and u
is in the wizard of all set
so you have the i'm serious
one
who is represented representing the system so has access to the database and then not
clear that he's representing the user and has access to do the
task was provided to complete the user goal
so the tools to each other and channel i would you in a text actually
and
also the states and part of the system and user i eight is what the
user is setting so we get directly be a
we have used actually that one is very small have
one thousand two hundred dialogues with only one of them at a small number of
slots and model with a small number
recently collected a much larger a dataset
which have almost a thousand dialogues across domains
and the great thing here is that the means
the change of the main is not only happened on the dialogue level but also
on the turn
it is much longer dialogues it's much more slots and that is
so that it is where
well we hear this model to a high dimensional be a neural belief tracker
it was again developed by
but which doesn't do this knowledge sharing between different don't be different colours
and you very small on the smaller
and i think i'll performed
a mural belief tracker in every slot the user can be quite
that no what's happening one new on the larger scale dataset
no problem is a bit more complex because you or tracking domains and learn the
neural five was not able to track the main things that also we compared to
just the single lane
and here outperforms the
as well known looking at numbers for these are generally lower it shows that this
will release date that original over which shows this dataset is much richer and more
difficult to
to track
knowing full well as a set of things that have another class baseline but just
to show you how difficult this task
you or get only and percent accuracy where is then you knowledge sharing with nine
three point two
no this is the number of it my view is also ramadan to have a
general i
and if you're here next week for eight
or someone will talk about
this is more the
i am going to move
two variants
dialogue policy
one difference between v and policy optimisation
o
why dialogues are here
and i'm at this point in dialogue
tracking accumulate everything that happened so far in the dialogue which is important for coarse
age
i really tracking summarizes the past
but what else policy to
well there will always this point yes
okay the action in such a count of these dialogue act
bill be the best or when the user will be satisfied at the end of
this time
so the policy has to low future
is the one that
and what is the machine learning framework which allows us to perform live
well that uses reinforcement learning
reinforcement learning we have our dialogue system it is interacting with our user
the system is taking actions
and the user is responding results patients
based on these observations we create the state
the user is occasionally giving us the board
no here and i say user may be real user controls maybe simulated user has
really need to be to have i really exist
notable is applied
that's these states to actions
and
you want to find a policy it gives walter and user satisfaction
so there exists
once you
remind you of some of the concepts in you know reinforcement learning that here are
that we have
so that and the most important in the concept of every tear
so here at this point in that in the features that are going to and
at this point the reader is the random variable which says what is the overall
we were from this point that are
no because it's a random variable
maybe the estimate we can only estimate the next page
and they expectation return starting from a particular believes eight is divided function
and if we take the expectation start from a particular belief state updating a particular
action it's q function
estimating by the function q function or policy is equivalent if we find the optimal
q function will also be able to find
the optimal policy
i reinforcement learning by function or q function or policy or approximate it is the
network
this is good because neural networks give us more here approximation
which is preferred drug reinforcement learning was not of these functions are functions
the automated it's the optimization over the years that's local optimal
no probably the most famous people deep reinforcement learning algorithm using you network
well as you network do
i
approximates q function as a neural network parameterized parameters
and here we have a great in open lost
which is the difference between what our parameter a parameterized function is a setting and
maybe more your pain and what are
there is
one feature vector
no problem to me is it is used as a biased estimates
they are how that are correlated and targets are nonstationary
which is all the reason why you is a very unstable algorithm it can often
happen you can imagine give you good results
that sometimes it doesn't work tool
i think is all you want to optimize policy using a network
i assume parametrization policy with parameters only
and then what is greater here that's what the gradient of the object here want
to maximize the by the initial state that is given by
you only got
and this is what
what policy gradient is what it's
why is it i don't have here to prove here
but not just say it is directly used in reinforce algorithm also not complete
however it is okay so that it is not is like the one from the
un
but has a very high variance which again is not something that
three four
you know
it's also use the not clear creek to connect the search is going to give
you a diagram of what an actor critic a cow are clearly frame looks like
so this is our user this is our policy optimised there
model that has to part one after this is actually out are all the steepest
eighteen actions
and i'm is critical that criticises this actor
so make some action user wants be rewarded and belief state and then i think
that we define how words
our after what's
that's a dialogue system does not apply these methods to dialogue systems like to modeling
four or the policy of analysis we often find it takes too many iterations to
train
so we resort to using a summary space
so what is me
we can estimate of our state
and i there is i
it only choose you know
and full of action
and we had some heuristics which they you what this may be okay
it uses that this actually belongs to
a much larger master action space that hasn't toward are typically toward greater magnitude or
actions in the summary space
but this is obviously not good with
i really want to build an actual conversation you want to buy a any kind
of
here is explicitly flights and choose between
much richer actions
so the problem is it's too many interaction need and
this solution in this case is you experienced replay
and i don't know i'll
however this produces a much larger
allows us to learn a much larger space
so it is algorithm which is called a server it's and i critic algorithm uses
it is played
e s q function off policy
and uses be raised to compute the heart skipped also uses trust region policy h
so that it just briefly go through these point
now more experienced reply
have interaction with your dialogue system you're generating something that is cool
now in order to maximize the value of you at a
it's not times you can also go through that they and we played experi
no it is that a point not the system has learned something and all its
on actions are not particularly good so we should be exactly the same reward
there for you importance sampling ratios to this
it's a piece
is that they have it was generated in principle is not what we have right
now
and how we're
our gradient
well it is important issues
now if you that's for q function
do you will inevitably have to four
the whole trajectory to model keep it is important sounding ratio
multiply small number x
they're the irish
or if you marked by very much better
a explode
and this is funny truncate the importance of a
and also add bias correction utterance just to acknowledge that you're actually making
it is what's
retrace algorithm allows us to do
remind
we want to use actor critic framework so we want to estimate for policy and
q function
resulted you you'll and providing biased estimates for q function
so in that it one for hardly hear
for our for our lost for q
and given by retraced all agree
you and we
and when you as work from the one on why this provides one of is
that it small area but i just give you case why is you don't have
this and school clay
the thing is merely multiplying our
importance sampling rate issue
but we are trying to say that
so it is that they don
they don't vanish
and if you know these errors here
you know what we had in our in our
you
but there is no
right here which we shall with this
this is employed
is not
by s
and then manually think that we do is a trust region policy optimisation
now the problem is that there are all i think probably steve directly in a
reinforcement learning framework in the be proportional planning framework and small changes in parameter space
can result in very large an unexpected changes the policy
this solution is to use natural gradient but it is expensive to compute the natural
gradient gives you the direction of the speakers this
but
it is natural gradient can be approximated as kl divergence between the policies of all
subsequent parameter
we have here and then the transmission policy optimization expert or approximate that kl divergence
with the first order taylor expansion so that is to see between subsequent all densities
small so that you don't have i mean i
here you know how policy this is particularly important if you want to their be
a promising interaction is really going one to a really
afford to say i'm expected
so no and we want it is to a to a dialogue system that one
are directly in master space
we have to have adequate architecture
all the neural network
no i
a critical mass that the set so we are making the point and q function
at the same time
and in order to make the most of it you share a feature extractor apart
from our belief state
and that we want to learn a master space we have to choose between very
maybe
so that it will for policy and the q function
will have a part just using the summary action or if you think that this
is the dialogue
and a part
choosing which slot should complement this that go after this
and then we have a greater for policy which is given by just three policy
optimisation and the gradient for the
would you function which is given by a cell
okay so how does this work and you know datasets we apply this in the
cambridge restaurant domain
we have a really very large belief state
hence foundry actually is very large number master actions and me about it is simply
using its operating on my lap
and here are the results
so that system showing training
the y-axis is showing success rate so we'll
that would be
successfully completed or not
and all this model is learned and mastered state at all
and the other learning one summary actions
do you hear its on the policy is expected policy this learning the summary space
ease faster because the parents between much smaller number of actions
but actually menu
it mister actually
space has to wonder why don't you or actions it actually only twice this little
so this is good news
so as it were actually has these policies interaction with real users amazon mechanical turk
we use
and see that the performance in terms of success rate
are almost the same
but actually master actions case policy or position
this is the right is to gather why we have a regional in a house
to it and it has just been accepted i transactions with speech and language
only speech and language
okay so one thing
which you probably her about a
i hear that a my student for basic talk about
it would be addressed
the problem of having the user models
so when we optimize dialogue management you need to actually the simulated user we often
find we assume you can use a simulated users that are hand-coded
or not very realistic as the ones we have role
as
interaction if you have you have users
this solution here is to train
user model in an end-to-end fashion
and you have outcome is potentially have more natural
a conversation this simulated use
signal
what you will stimulate users and you simulated user consists of three
the first part of the goal generator and you can think about this is a
random generator to generate
what goals that the dialogue
at the other real user can't hack
and this and
is the feature extractor so in the feature extractor
extract features from what is in this state
that relates to what users well it's
and then you have to see this is because you will never do not features
history to the user utterance
so here it is how it works so
i some features so for instance if the system that i'm sorry there is no
such i-th turn and it would be speech is what else out that
a whole like now with what the user will and this user goal can then
potentially change
we have and un
a human life story off
this feature
in that it layer
and the sequence the sequence would
we each we start with the start of sentences
and a with the word that the simulated user in spring you see
so we should s simulator on his easy to see that because they sit there
real users to without systems so you want to model how
how the real users are correctly
and i simulated user in a fly unorthodox way so
but there is not so interest in into how well we were interested how the
user simulator ones sentences but also how can help us in training
a dialogue system
so which well five for each user simulator
we
but our
in which we train policies
so what one
policies were trained with neutral user simulator which is completely statistical and another one be
the agenda-based user simulator which is having which is based on rules
and for which a user stimulate the best performing policy on the other on the
other user simulator so that will become probably
ally clear in the next line
and then be and why these policies i'll to interact with users on the camcorder
so
a user simulator training
for all that is used for policy training so one neural user simulator another one
is
okay
and that we know how well a policy performing on neural user simulator
and what the best performing policy that's train a user simulator
and performing well on agenda-based
similarly for the agenda-based
no i mean what is the results show that your policy on agenda-based user separately
and then performing really agenda-based user simulator it's not going to one particular well on
real users
so rule based
approach is to build a user simulator is not particularly
but you always knew real user simulator and if you are wanting rate real users
a vector
we use
but the best performing you want to actually train neural user simulator and best performing
the agenda-based user c
each is that it's the learning is promising for modeling use
but see i with the h in it or you hear if we want the
best
performance
okay so well i lost five minutes i would like to talk about something that
probably closer to this all
in our community
how do we effectively
evaluate dialogue models
and how do we compare however
how can we print use good style
similarly was pcs here only a handful of loops around the world had access to
or at a time axis
and this is something that you want to change in cambridge
because you we want kicks not really and also allow people to easily compared to
each other
so we that mine are toolkit for building the house a statistical dialogue systems open
source
well i
if we use a i'm not sure a simulated environments
i algorithms
it can compare so you want to test the new
a new policy
you can more easily compared to the two sets the state-of-the-art
the collected a large corpus that i've just described
in school monte was we are making this open access
and this work was funded a by my faculty of four
so just a few works the bar titled i'll
so i know where is a implementations of statistical approaches to dialogue system
and it's more similar so you can vary you see a exchange your module for
the currently available functional in the two q
it can very easily be extended each other much closer domain and as if you
a four hundred dollars you would use it to your dialogue system
it offers not domain conversational function
and you the coherent also subscribe to our at this
and this was reading that words from or not just the card numbers but also
from the previous members a off the have systems group
and he's constantly expand
so in terms of benchmarking
you want to have a way of comparing algorithms in a fair way
so for freebase we define the main is different user settings and also different noise
levels in the user input so at
by the total
and
state and number of state-of-the-art parser optimization algorithms including the acre a brief digest of
about
so this initially it was let me because in the ema and probably chernotsky and
with present it needs symposium on the last year
so it's basically you the end of my talk
it's i mean you it's machine learning
allows us to solve any problems in on the rear facing in building natural conversations
you are you married it allows us share concepts between really tracking and in so
that we can have
so that we can
the operational system
in the same they come up with us to know to build a more always
the optimization modules use between a wide variety of a only of action
and also allows us to build more realistic models of users
so that we can train more accurate
policies
but there is a lot to be time to actually achieve a goal of an
actual conversation
and this is just the input the high score
so some of the i'm years are how we want to talk about
i'm structure they can we need to their a knowledge base
if you want system for very long conversation we need more accurate
and more sophisticated reinforcement learning more models
and finally we need to achieve sentiment various have more nuanced
we weren't fun function
to take into account when we are building
i don't have exist
so that's can will bring us closer to the a long term vision which is
have a natural conversation goal directed that
if you very
so would be compared with the so there is a statistical version of the agenda-based
imitate their
but you
realise on hands on having the structure
or all the conversation in this free that you first
asks some so some parts of it are hand-coded and then it has
pockets which are trained so this is done on it
the overall problem solving the overall problem of natural conversation would not be applicable because
we still have
that structure which is fixed so we have compared to that but actually this neural
stimulator was trained on very small amount of data so
i don't know if i have exact numbers dstc two is only i think one
thousand dialogues
so that's it's not a lot
no because of that didn't do not parameters were kept really small so for instance
if i go back
so we don't actually have in which the user the system
here we have in what's the semantic form of the of the user senses
so then this feature extractor is a fact i is very easy to build
i otherwise you would need a cnn or something more sophisticated here so that it
is and it would expand the number of
also how many
how we uk these vectors to be useful then implies how many parameters you have
analysis
so in this model everything this cat very small
just to account for the fact that you have a very small amount
so
so a lot
we carry you mean if you want to start from scratch or if you want
to use some of the models
do you want to start from scratch
then basically everything
everything is domainindependent in that sense so
in particular belief tracking
there
so maybe tracking takes input
and the ontology
so this is very this is just and the additional inputs to the belief tracker
and you in back
the word vectors with you have been your ontology to begin with
so traditionally you want okay and whether it appears in the user send
here we take where the vector of that
and compare the similarity of that word vector with our feature extraction
and we have
three or generic feature extractor which is the main slot and value
so
so there is crazy this should were as it is to a different domain
right
so in an accidental is that there is a more difficult problem in that sense
so you would need to redefine and then system i six
in forty two forty two work
i
and then
it will
however knowledge base looks like so whether
maybe you embed the where it's already been back maybe a particular constraint
so
i read in that the two
so that works very unhappy one stage of not requiring
label they tell the intermediate step
and that is a huge amount which because if you're generating millions of coals every
week
you don't have asked to twenty eight
so it certainly work investigated process inside because of batteries
but the downside that actually work
is about
and the reason for it is that it is still not able to figure out
how to train these networks do not require additional separation
so a lot of their own that is
along that line goes stable about
having and tool and differentiable neural network that you can propagate gradients true but you
still need at some kids the supervision to allow you to actually have
a meaningful output
and i'm not a problem is the evaluation obsessed is so
the research in this area have six hundred and many people or not originally from
dialogue are doing research in this area and they take this is a translation possibly
have
system input in user out and this is really not the case yukon
and why respect to be the bleu score that doesn't say anything about
the quality of these dialogue and it doesn't take into account the fact that you
can have a long-term conversation
three
yes
right
you raise or so that
say
so that there would be i mean
so the one in speech recognition have looked at this problem of having to iterate
over huge number of all works for instance in our in language modeling and there
are some cases like using things can on contrastive estimation do not have to do
a softmax but rather have normalized
output so that
it's one thing with for this work we need to have some similarity metric between
and some confusability between different
different elements of the ontology
so i don't know whether i have a we can answer to how to actually
do that
value
we have as a whole the ontology and then
sponded
for a non backed
having
because all you actually want to have is a good
is this would space representation
so you can almost
you can almost i'm from it and then noted here that would be a particular
work but that's very difficult to okay
so i think some interesting problem
it sometimes a really difficult then we actually have addressed this in this work
so
that
that doesn't
produced a good and bad
and in here is used
you use it in the sense that you know it is that okay consists or
something to do this slot summary action is something to do with slots
that's you know i
how many slots you will talk about
you like the system learn a to do that
so you know especially if you don't have enough training data you can always equal
rate at the system but once we were interested here was mostly to see whether
we can be you because
if you look at the reinforcement learning tricks which are really a use a reinforcement
learning for problems which can be da five simulated and it's often discrete space it's
it's the setting of joystick what to what the action space a to choose between
a very small number of actions
and if you want to apply the time period sticks without seriously you will inevitably
have to learn a larger
state action spaces this is really what we were interested in here but obviously you
always equal rate
which you just
described