Speech Transcript - Online learning and transfer for user adaptation in dialogue systems

0:00:15	i'm not like a
0:00:17	and my dog adviser a woman devilish and that he picked him
0:00:22	and i want to talk about the user adaptation
0:00:25	in dialogue system
0:00:28	so most of the state of course
0:00:33	dialogue system and most of the production dialogue system
0:00:36	are adapting
0:00:39	gender equality generic strategy
0:00:42	so we have the same behavior
0:00:44	for any user
0:00:46	users
0:00:47	and what's going to do is to learn one strategy
0:00:51	for each of these users
0:00:55	the propose a problem with a learning strategy from scratch
0:00:59	is one to do some expression
0:01:04	and expression lead to
0:01:08	very bad
0:01:10	performance is far directions
0:01:13	so we want to design
0:01:17	a framework
0:01:18	which is
0:01:20	i very good during the course starts of face
0:01:24	and it must also be good during the as i said
0:01:29	concept that interface
0:01:31	so we propose
0:01:34	for processes for user adaptation
0:01:36	and who can composed of upright faces
0:01:41	and it goes of this way
0:01:44	so let's say we have a bunch of robot's we present think a dialogue system
0:01:49	and each of these robots
0:01:52	a learning strategy versus use a specific users
0:01:57	and they also giver
0:01:58	or the dialogue was done with the this user
0:02:04	so all the knowledge of this well but
0:02:08	is represented
0:02:09	by the dialogues
0:02:11	so we want to elect
0:02:15	some representatives
0:02:16	all the database
0:02:18	and for example gives a little bit and i did one
0:02:22	and it's a it's a novel we have a target user
0:02:25	and we don't have a system
0:02:27	two dialogue you'd of these target user so we want to design a system from
0:02:31	scratch
0:02:33	and what's going to do is to transfer the knowledge of one of the we
0:02:37	present that you to the system
0:02:39	so i'd first we want to select the best representative to dialogue we have or
0:02:44	target user input
0:02:47	and we will try it should be represent the t one by one
0:02:51	and at the end
0:02:52	we select the better a dialogue system which is blue lines the you use
0:02:58	so now we natural for all the knowledge
0:03:01	to the new system
0:03:03	so let's say we have
0:03:06	scrunch system
0:03:08	and we're gonna know the strategic thanks to the knowledge transfer and also
0:03:15	we all the dialogue don't during the source selection face
0:03:19	so we gonna use this new this can they have system
0:03:23	to their with this user
0:03:25	and we collect more dialogues
0:03:28	and then we can learn new system morse a more specialised
0:03:32	to this target user
0:03:34	and we repeat this process and to be which
0:03:37	a very as busy writers the spectral is
0:03:41	general system to be a target user
0:03:46	so in the end we are then you
0:03:48	and you wanna target dust into the two sources
0:03:53	so i will detail each of these a face
0:03:56	so the sources are dialogue manager
0:04:00	so they have manager components of dialogue systems
0:04:04	and this manager take as input a repetition activities
0:04:09	for example i would like to book a flight suit on then
0:04:13	and the dialogue manager with the connection
0:04:16	for example a good field or a good nine
0:04:21	and the usual way to design their manager
0:04:27	is to a task than a reinforcement learning problems
0:04:31	so we first but only programs
0:04:35	and you with one engines
0:04:38	interaction with no agreement
0:04:40	so for example are agent is a dialogue manager
0:04:44	and the environment will be a target user
0:04:48	so the engine can take
0:04:52	interaction
0:04:53	and the environments we'll react
0:04:57	and we can also it's a reaction
0:05:01	so prime is an observation and we can also are but we are we want
0:05:08	so amp right
0:05:09	and even in this observation and no also the action taken
0:05:14	be an agent can a date
0:05:17	it's a joint state
0:05:19	so we got here we go to a far from is to a sprite
0:05:24	so we conducted that
0:05:27	or the knowledge of the environment is contain
0:05:31	in the top l is a
0:05:35	a sprite and
0:05:37	our prior
0:05:39	so this is
0:05:41	the mentioning you know reinforcement learning
0:05:43	so we have knowledge of the environment
0:05:47	taking the form of the samples
0:05:49	and we want to design a good the strategy for the nao manager
0:05:56	and have used that this is good policy so this is a function mapping
0:06:02	states to a collection
0:06:04	and we want to find the optimal policy
0:06:06	so the optimal policy
0:06:08	is a policy which maximizes
0:06:10	at the community we weren't
0:06:12	during in the direction
0:06:14	between the dialogue manager and the target user
0:06:19	so no
0:06:22	i of the there is an equivalency between the dialogue manager a time stamp
0:06:26	robots and a policy
0:06:28	so we want to find the best
0:06:32	what d c two represents all the database
0:06:36	so this is this will selection phase
0:06:39	and we introduce in this is the main contribution of the paper
0:06:43	we introduce bodysuit raven distance
0:06:47	so this is a matrix
0:06:48	which computes
0:06:50	the have you or differences between what is
0:06:54	so
0:06:54	we some state and we look at which edge action is taken
0:07:00	in a each of these distinct
0:07:03	and for example one can see that the third one
0:07:07	is very close to populate one
0:07:10	and the yellow is very different to the to the little
0:07:15	so one can see this at least relevant distance
0:07:19	as a binary vector
0:07:22	and where the ones
0:07:25	we present the action taken in a given state
0:07:29	so for example
0:07:31	we will but take these actions
0:07:34	and the been every vector will look like
0:07:37	and it if we combine of using every vector
0:07:41	to the gender and all
0:07:43	we have a unique button see
0:07:45	with the which is greater
0:07:47	train a distance
0:07:49	so this allow us to use a clustering algorithm called k-means
0:07:56	so can means will give our or the skewed or a dialogue manager
0:08:02	as clusters
0:08:04	and since we want to represent the gmm
0:08:07	we will have to learn one policy by clusters
0:08:12	so we give a working knowledge of each cluster and we learned policy with that
0:08:18	but we can also use an of our algorithm
0:08:21	code that come into its
0:08:22	and i'm in the winter thanks to the police drama distance
0:08:26	we finish directly free representative
0:08:31	okay so no we want to select the best
0:08:34	policy to dialogue with the target user
0:08:39	so this is association or
0:08:41	so for that we cannot use a bounded algorithm
0:08:44	corn use into one
0:08:45	so usually one will test
0:08:48	each of the representative one by one time
0:08:51	so you would deal with when one and two score is to with a one
0:08:56	and then the with one
0:08:58	and no is the next dialogue other the next system that the user will dialogue
0:09:04	with
0:09:05	is as a system which maximize the be value so
0:09:09	now we will deal with the blue one
0:09:12	and the u w is to the best
0:09:15	so we keep the earring with the blue one
0:09:17	and to which a very but school
0:09:20	and at these points
0:09:22	the red system at the better value so we switch or robots
0:09:27	and we would be this process and to me which are maximum timing it
0:09:31	for example one hundred the time step
0:09:36	and so we know that on this is as the system or maximizing the them
0:09:42	so the point of using a c d one is that the summaries and take
0:09:46	into account the high variability
0:09:49	of the dialogs
0:09:53	okay so knowledge transfer the knowledge of this you know to a menu system
0:09:59	so is also face
0:10:01	so let's saying we have to the edge of samples the source image and the
0:10:05	target image
0:10:07	and we want to remove
0:10:09	where the sample from the source badge
0:10:11	already played present in the target image
0:10:14	so for that we use those two base
0:10:18	so this is a filtering algorithm
0:10:20	it will consider their each some part of the source of h
0:10:24	so let's say we start with this one
0:10:26	and it would what's there are some kind with the same action
0:10:30	so these two
0:10:32	and sees us israel states is very different to the red state in the two
0:10:37	states
0:10:38	we can have a the source better
0:10:40	to the funeral image
0:10:43	no we because the obvious something
0:10:46	and we can see that the light red state is very close to the right
0:10:51	state
0:10:52	so we don't at this simple to the pitch
0:10:55	and we keep the we continue this for each sample of just a bench
0:11:01	and in the end that we have but target image
0:11:05	and we will use it really was this
0:11:08	for learning a new policy
0:11:11	so the other so that only
0:11:13	is don't thanks to we the did you
0:11:17	so if you did you is a reinforcement learning algorithm which take of any goods
0:11:23	a bunch of samples
0:11:25	and it would computes the optimal policy for this some pairs
0:11:31	to think issue is
0:11:33	and i resign coming from fitted value iteration and this specific algorithm can also from
0:11:41	body recognition
0:11:42	and value iteration is a very famous algorithm to solve a markov decision processes
0:11:51	so if we combine as a filtering in the running
0:11:54	one can see that we learn a
0:11:58	a system
0:11:59	which is a mix between when diesel together and the real users
0:12:04	so we're gonna use this new
0:12:07	this new system
0:12:09	to dialogue now
0:12:11	we target user
0:12:13	so we a new dialogue to the target bench
0:12:16	and you can see that the free software that at the bench are very similar
0:12:20	to the sampling this was image
0:12:23	so in the enter
0:12:25	it remains only is about as a as a sample from the target image
0:12:30	so when we going out on the then you put it
0:12:34	we will on the very special specialised system to this a target user
0:12:41	so this is the overall the additional process for
0:12:46	for users
0:12:48	and what we want to test are
0:12:51	our framework on some experience
0:12:54	so we gonna uses the negotiation that okay
0:12:57	so we focused on a negotiation because
0:13:01	we have two actors
0:13:04	having a different be have your
0:13:07	so we want to adapt to this year
0:13:09	so in the negotiation there again you want to appear
0:13:12	and they are given some time slots
0:13:17	and preferences
0:13:18	for each time slot
0:13:20	and averaged around a
0:13:22	each agent
0:13:25	we're the proposed a slot
0:13:28	for example kenny proposed a this drinks but
0:13:32	and the wheel but we shoes and propose it's one utterance but
0:13:37	so since as negation again is an obstruction of a yellow
0:13:42	dialogue we introduced a noise
0:13:45	in communication channel
0:13:47	and the form of switching sometimes but so for example we replace the previous times
0:13:54	right with the yellow one
0:13:56	and can you will result we will assign a new information
0:14:01	as a form of an automatic speech recognition score
0:14:06	and you want this information it can continue the dialogue
0:14:10	are you can ask to deal the origin to repeat the proposition
0:14:14	or you can and does the data
0:14:16	so for example you yes to repeat
0:14:21	and be able but repeats
0:14:24	and at some points
0:14:26	can you can accept the proposition
0:14:29	are you can also deny and the dialogue
0:14:34	in the end of the dialogue where the users are rewarding
0:14:39	we have a score
0:14:41	and this court is functions you'd
0:14:44	with the
0:14:46	we are all the time slot and read
0:14:50	so i four went to say that the point of the game
0:14:53	is to final than agreements
0:14:56	between at experts
0:14:58	so can you really ugly well the less buttons here the all but see so
0:15:03	that estimates is
0:15:04	is smaller
0:15:07	so now we want to test the this again
0:15:10	we use the and there is a under the user interacting with the system so
0:15:15	we designed a similar to users
0:15:17	with a very difference profiles
0:15:21	and so we have for example the determinized each user
0:15:24	we will you will
0:15:26	proposed is a certain slots in decreasing order
0:15:30	and we have also this one now proposing instance
0:15:34	taking a random actions
0:15:37	this wonderful whereas propose it's a base the best start
0:15:42	and this one accept as soon as possible and finally
0:15:46	this one and the dialogue as soon as possible so this is very different be
0:15:52	a if you are and we want to adapt to these vehicles
0:15:55	we also design you want models
0:15:59	so each one model is
0:16:01	is a model of you man thanks to everything off
0:16:08	one and read the dialogue by men so for you man
0:16:13	and we model it is these
0:16:17	is that so we used results
0:16:19	with a k-nearest neighbor algorithm
0:16:23	and you can scenes in the table
0:16:25	the distribution of action for a feature we really humans
0:16:31	so you can lead to that we'll and at x are very similar
0:16:36	and you go and no one are pretty difference
0:16:41	so now we want to design the system
0:16:43	which we don't directly with this these results
0:16:48	so that won't have the same action and the of the users to simplify the
0:16:52	design
0:16:54	as a set of function is received restricting
0:16:58	and we don't know as we so previously this system with a few
0:17:04	and a morse wire and that's one really agrees them to do some exploration
0:17:10	so the in this tell the isn't sure of the dialogue system the dialog manager
0:17:17	is a actually to commit a combination of the costs of the automatic speech sure
0:17:22	regression recognition score
0:17:23	and also the number of the
0:17:25	of that are during the key
0:17:29	so before test susie
0:17:32	men framework we want to show that running one system by a user is a
0:17:39	good thing
0:17:40	so here we have a bunch of system so v s u one two three
0:17:45	extra and each of the system learning strategy
0:17:49	with the this users so obviously when we don't know
0:17:53	the strategy against a pu one
0:17:56	and you can not is that the board values
0:18:00	actually indicate that
0:18:02	as a bit so the bit the system to dialogue we've a given user is
0:18:07	the system we should on the strategy
0:18:09	we this user
0:18:10	so there is a real we need to adaptation
0:18:16	we can share the same with you'll and when they're users
0:18:19	the t and the difference is that well if you
0:18:23	and actually it is the especially for is a screen and thus use alex
0:18:30	the
0:18:31	the both
0:18:33	one point or seventy four in one way or seventy three
0:18:38	a very close and you can do sources and the thing for the line we
0:18:43	will
0:18:45	so
0:18:46	no we can test the main framework for adaptation
0:18:50	so for that we introduce two new methods
0:18:55	one using
0:18:57	and without the scratch so is quite sure it's just go down just learn to
0:19:01	make the system from scratch without
0:19:04	transferring in english
0:19:05	and the other one is a limited so this is the generic
0:19:09	generic midi the
0:19:11	each way on the policy we all the knowledge of the database
0:19:14	so we generate too slow system database one for the user's stability and once for
0:19:20	the human model users
0:19:22	and each new system is it on things to
0:19:25	we one that thousand two hundred dialogues
0:19:29	and each means that there is this two
0:19:33	we to two hundred dialogues
0:19:37	so for simulated users
0:19:40	alternate alternative is intent on the other show a significant better result than i don't
0:19:45	know and scratch for the two metrics
0:19:48	the scores and task completion
0:19:51	but in an upper hand for your money they results
0:19:54	our method are it is better
0:19:56	but not that much and
0:19:59	the reason for that is negotiation that again is a two simple for humans
0:20:04	and i actually most of the human have the same behavior on the game
0:20:10	so there is no points of learning
0:20:14	i don't that you strategy
0:20:15	since all the people have the same behavior
0:20:21	so we have to conclude we provide the framework for a user adaptation
0:20:26	and the we introduce a prescription distance which is a way to
0:20:32	compute the everywhere differences
0:20:35	and we validate the framework on both
0:20:40	this unit user and human with a user setup
0:20:43	and finally we show that the overall
0:20:47	dialogue quality is a hands
0:20:50	based on two metrics of the task completion and the score
0:20:55	so thank you
0:21:23	i wasn't sure what you squirt for your cross comparison
0:21:28	i we want to see this way
0:21:33	next table so what is numbers and what's good
0:21:39	well which
0:21:42	each for represents the score
0:21:44	of each is then given the user of the whole
0:21:48	so the system is
0:21:50	and the other thing we the each user
0:21:54	so
0:21:54	so for example a dispute to have a score of zero point forty four
0:22:01	we the b one
0:22:03	what is that score
0:22:05	score is a score is
0:22:07	is the mean we while of is a diagonal
0:22:10	g i at the end of the dialogue there is a we want okay and
0:22:13	we do some you know g though
0:22:15	on the register maximum rate is the maximum score
0:22:22	yes actually it's
0:22:24	it's too
0:22:26	higher
0:22:28	sorry the higher better that's
0:22:48	okay
0:22:49	the question could you
0:22:51	more details about a reinforcement learning
0:22:56	i e c
0:23:00	the key
0:23:02	you want you are
0:23:13	i
0:23:15	speaker once again

Online learning and transfer for user adaptation in dialogue systems

Joint Special Session on Negotiation Dialog

Carrara Nicolas, Romain Laroche and Olivier Pietquin