Speech Transcript - Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks

0:00:18	so
0:00:20	as we all know turn taking is one of the most fundamental aspects of dialog
0:00:26	and it's something that dialogue systems are struggling with
0:00:30	if we look at you monument dialogue we know that humans are very good at
0:00:33	turn taking they can take the turn with
0:00:36	barely any very little overlap
0:00:40	at the same time
0:00:41	and people make posters within speech without the other person interrupting them
0:00:50	and this is accomplished by a number of turntaking cues
0:00:56	ask many researchers and have established
0:00:59	so syntax twice
0:01:01	you yield the turn typically when you are syntactically complete
0:01:05	a few look at prosody pitch is normally rising or falling when you're yielding the
0:01:10	turn
0:01:11	the intensity might be lower the duration for mean duration might be shorter
0:01:16	you might read out
0:01:19	gaze you look at the other c
0:01:21	to yield the turn and also gestures might be used
0:01:26	we also know that the more cues
0:01:30	we combine the stronger the signal is
0:01:35	and of course for dialogue systems to properly handle turn taking this is something they
0:01:39	have to take into account
0:01:43	and in dialogue systems there are number of decisions that have to the main that
0:01:47	are
0:01:48	related to turn taking so maybe the one most common one that have been address
0:01:51	this given that the user stops speaking
0:01:54	and should the system take the turn
0:01:58	of course it would be the nice with systems because i think is the user
0:02:02	assumed yielding the turn so that the system can start preparing a response
0:02:08	another decision is given that the user has just started to speak is it just
0:02:12	the beginning of a brief backchannels
0:02:14	or something that m to take the turn before that affects what the system should
0:02:17	do
0:02:19	also if the system
0:02:21	is gonna produce an utterance and want to produce a pulls it would be good
0:02:26	to know how likely is that the user would try to take the turn depending
0:02:29	on the cues that the system produce
0:02:34	so
0:02:36	before and these different questions have been address with different models basically
0:02:42	and the problem of course also is that rounding is highly context-dependent
0:02:47	and
0:02:48	dialogue context with all these different factors this of course very hard to model
0:02:54	so what
0:02:56	if we would like however i would like to have at least
0:02:59	is the model that is more general where you have a model that can apply
0:03:02	to many different turn taking decisions
0:03:05	that is continuous so you can apply to continuously not just for specific
0:03:10	events that happens
0:03:12	it should also be predictive so you shouldn't just classify the current state but be
0:03:16	able to predict what will happen in the future so that the system can start
0:03:20	preparing
0:03:21	and it should also be probabilistic not just the binary decisions
0:03:26	so what i propose is that we
0:03:29	a use recurrent neural network for this and the model that i have been working
0:03:35	on words like this we have that to speech channels from the two from two
0:03:39	speakers
0:03:41	which can be to you may as if we are predicting between two humans but
0:03:45	it could also be human and the system speech
0:03:48	we segment the speech of the slices which are fifty milliseconds low so twenty frames
0:03:52	per second
0:03:54	we do feature extraction and with v it into a recurrent neural network using lstm
0:04:02	to be able to capture long a little differences and at each frame
0:04:08	we make a prediction
0:04:09	for the next three seconds
0:04:12	what is the likelihood of
0:04:15	yes
0:04:15	bigger
0:04:17	is a weaker zero here
0:04:19	speaking in this future time window
0:04:24	so we see that would both speakers but we make prediction for one speaker here
0:04:29	and then we train it with the what's what what's actually happening in the future
0:04:34	so that's training labels
0:04:37	and when we do this we of course want to will be able to model
0:04:40	both speakers so we first train it with if we have speaker a and b
0:04:44	we first train the whole thing with a being speaker zero and b as a
0:04:48	speaker one and that was switched them around so a speaker one these experiments we
0:04:52	traded from both perspectives
0:04:57	at the application time we run two neural networks at the same time it to
0:05:01	make predictions for both speakers
0:05:05	the features that we have been using is voice activity we use pitch power
0:05:11	normalized for the speaker we don't do any sort of that was the
0:05:15	adult thus or anything we because we think that the network should figure this thing
0:05:19	so
0:05:20	we use a measure of spectral stability to capture the for a particular lengthening
0:05:24	we also use part-of-speech tags
0:05:28	so at the end of each word we feed in a one hot representation of
0:05:32	the part of speech that has just been produced
0:05:36	we compared to model is available that use all this lattice or one without the
0:05:40	inputs
0:05:43	and also prosody model that use everything but the part-of-speech to see how much the
0:05:46	parts which actually helps
0:05:48	we use the deep learning for data or toolkit
0:05:52	we have used the of web corpus for this which we are divided tonight a
0:05:57	six friend dialogues of the two test dialogues
0:06:00	that gives us about ten hours of training data
0:06:04	we use the manual labeling voice activity which should be set to expect where we
0:06:08	with the automatically
0:06:09	and the manual labour what speech on the prosody expected with respect to fit
0:06:15	we can show you have video what the production for predictions looks like when we
0:06:19	run it
0:06:21	continuously online so these are the predictions
0:06:24	the red is the point the prediction we are now
0:06:28	and the green is the probability so the higher the curve the more likely it
0:06:32	is that the parsable speech in this future time window
0:06:37	after of course is you will see the future what was actually gonna happen also
0:06:45	style if you can extend from keynote is just the so for me key to
0:06:50	chain link fence at sixteen to illustrate how this is right there is a more
0:06:55	the sources in the more likely model based on speech
0:06:59	i just don't seasons from now exactly determine the least tendency to distinct okay so
0:07:03	we i have looked at two different tasks that we can use this model for
0:07:09	one is very common talk is to predict
0:07:11	given of course who was the most likely neck speaker sound this is an example
0:07:17	where you can see that
0:07:20	here one person that's just a stop speaking and we can see that makes a
0:07:24	fairly good prediction in this case
0:07:26	it's not
0:07:27	it will take some time and things it for this person will continue
0:07:31	but it's quite likely that this person will produce a response but it's not gonna
0:07:35	be a very long so it makes very good prediction
0:07:40	there is another prediction
0:07:42	so that was the turn shift that was predicting here is a predicting that the
0:07:46	speaker will actually continue speaking
0:07:48	fairly high prediction but is not very likely that the other person produced response
0:07:54	so to make it easy i made it is into a binary classification task so
0:07:59	we debated basically asked at
0:08:01	the average prediction over the two we compare say and a is it a key
0:08:08	or shift
0:08:10	or hold
0:08:11	and then we can yes compute an f-score a see how well it thus we
0:08:15	can compare it with other methods for doing this
0:08:18	this is the number of training epochs and the blue is the full model the
0:08:21	red is the prosody model
0:08:23	we consider the prosody model which is stabilises where is the full model continues to
0:08:27	learn
0:08:32	so the best prediction we get for this
0:08:35	it's for the prosody only you can see the numbers here
0:08:40	a
0:08:41	for features are points
0:08:42	some to six it's not hard to know of course is this is good or
0:08:46	not good
0:08:47	it's
0:08:48	impossible of course to get hundred percent because turn taking is highly optional is not
0:08:52	always the case that it's obvious will take will continue speaking
0:08:57	of course if we have compared to the majority class baseline always hold the turn
0:09:02	is much better but that's not very interesting so we let humans listen to these
0:09:07	dialogue
0:09:09	to this point and dialogue and try to estimate who will be the neck speakers
0:09:12	speaker
0:09:14	using the crowd power
0:09:17	and they didn't performance well we also tried
0:09:21	more traditional modeling where we just
0:09:25	trying to model as good as possible the features we have at that point and
0:09:28	make one shot position and the best classifiers
0:09:34	did not perform as well as we can see also
0:09:36	this is also comparable what we find the literature where people have done similar terms
0:09:41	with more traditional modeling
0:09:45	we also compare what happen if we look at different balls nice the so how
0:09:49	quickly into the portal post we make the decision
0:09:52	and we see that what we're gonna have to fifty miliseconds into the pos we
0:09:55	make a fairly good prediction you will be the next week
0:09:59	it doesn't really matter what's proposed mentally as
0:10:03	so the next task will if that was the prediction at speech onsets so this
0:10:07	is interesting
0:10:09	someone has just started to speech as we can see here
0:10:12	and we want to know is this like its be very short utterance
0:10:17	backchannel or is it likely to be a longer happens if is a long rappers
0:10:20	maybe if of the dialogue system which is stopped speaking wrap select the other person
0:10:24	take the turn if we want to otherwise continue speaking
0:10:29	he would makes also very fairly lewd a prediction and you see the slow is
0:10:34	going down very quickly so it's gonna be cool short utterance whereas here it makes
0:10:39	prediction
0:10:41	all the more low reference we are here yes
0:10:46	at the same
0:10:47	points into the utterance as you can see that the predictions about different
0:10:52	to make the task binary again we divide between short and long process that with
0:10:57	finding in
0:10:59	i in the test data
0:11:02	social to process we in both cases we are one half second in the speech
0:11:08	sure that process not allowed to be more than half a second more as all
0:11:12	have to be more than
0:11:13	two and half second
0:11:17	and then we average the
0:11:21	speaking probability that is predicted of the fusion time window
0:11:24	and this is a histogram showing for the short utterance is what the average predicted
0:11:29	it speaking probabilities and for the longer utterances
0:11:33	so you can see it may give fairly good for separation
0:11:36	and just using this very simple method it can be more sophisticated of course
0:11:41	and f-score
0:11:42	zero point seventy six
0:11:45	again if we compared to the majority class baseline or
0:11:48	a more traditional modeling we get
0:11:53	a better performance also if we compared to similar terms
0:11:58	being done before
0:12:02	okay so then this looks very promising of course the question is can this be
0:12:07	used for
0:12:09	spoken dialogue system
0:12:12	so we took a corpus we had of human robot interaction and we tried to
0:12:18	built which was already annotated at the end of each user speech segments for whether
0:12:23	this was a good based take the turn or not
0:12:25	and within the network with the cysts it is synthesized speech from the system the
0:12:30	user speech and we compare the predictions us like we did
0:12:35	before
0:12:36	and of course since these are
0:12:40	very different type of dialogue the map task dialogue and the human computer dialog direct
0:12:46	application we use the prosody model didn't give a very good f-score it's better than
0:12:52	baseline but not very useful
0:12:54	so what with what was that
0:12:57	well maybe at least we can use the recurrent neural network is a feature extraction
0:13:01	as a representation of the current turn taking dialog state
0:13:06	so we take the lstm layers and we
0:13:10	training with supervised learning a logistic regression that is to predict whether this is the
0:13:16	best detect on
0:13:21	and then we get the fairly good
0:13:24	results with the right determine cross validation
0:13:29	but it also but only well if we just printed with twenty percent of the
0:13:33	a lot of the data
0:13:35	so that's problems
0:13:40	so of course to it for future work
0:13:44	we think we need more by boris interaction like that
0:13:49	map task is highly specific also of course
0:13:53	it's not very similar to human
0:13:55	machine interaction so we could for example training a wizard-of-oz data
0:14:01	also the way we have used it now it's very coarse we i just average
0:14:06	these two predictions
0:14:09	and compare them and it doesn't really make justice to the model which has a
0:14:13	much more fine grained
0:14:16	prediction also what's interesting is that has to go along you're in these polls the
0:14:20	predictions updates during the poles so we can make continuous decisions while was is unfolding
0:14:29	and also make use of the probabilities of course for example in the decision directed
0:14:34	the framework
0:14:36	multimodal interaction of course we have data from
0:14:41	from
0:14:42	face-to-face interaction
0:14:45	and of course we know that gaze and gesture and so on a very important
0:14:48	so that should be highly useful
0:14:50	and also multi party interaction of the model applies very well to multiparty since each
0:14:54	user where each speaker is modeled with its own and that work
0:14:59	so we could apply to any number of speakers
0:15:02	thank you
0:15:29	so we are trying to feed features a feature for that what's happening during this
0:15:34	fifty milliseconds if we have pitch for example take the average pitch in that small
0:15:40	window
0:15:45	sorry the
0:15:49	so that we is the
0:15:51	as soon as a word is finished we take a one up representation
0:15:55	with a pause tag and feed it into the network
0:15:59	at
0:16:00	at the frame
0:16:02	as soon as soon as the words and with the adapting to it and then
0:16:06	with its zeros again into to the pos tags
0:16:09	so it's just for one frame you get the while for that part of speech
0:16:19	thanks for the top is more clarification question so the two task that you representing
0:16:24	the two prediction task with it separate networks that you were training or using the
0:16:29	same network with two output layers is the same network
0:16:35	that is trained
0:16:37	so it's not for the to sort of roles or anything at that we rounded
0:16:41	instances of the same network
0:16:43	okay so i just kind of multitask learning
0:16:45	i mean you just having two different ways to prediction but the latent representations same
0:16:52	not application at application time they're completely different the two networks both the word skip
0:16:57	what from both speakers
0:16:59	it says yes that each network makes prediction for
0:17:02	for one of the speakers
0:17:04	right but the model itself the parameters that you learning
0:17:09	are there are completely the training in isolation or to train that the same time
0:17:13	for the two prediction task
0:17:15	no other prediction task is i mean the prediction is used to predict what's happening
0:17:20	at each frame
0:17:22	and then we can apply the same model to different tasks
0:17:25	so we can see what does the model predicts that speech onset what does the
0:17:28	model predict at the beginning of balls
0:17:31	okay that's what that's so that's why i wanted to general model that it's the
0:17:34	same model that is implied by the different tasks
0:17:47	so the thanks for great talk so
0:17:50	the model includes temporal information in the project
0:17:54	so i wanted to ask if you could talk a little bit about
0:17:59	how you imagine systems could use that kind of temporal information
0:18:05	i talked about long versa short utterances i think
0:18:09	i should say okay this is right time for a short utterance or the more
0:18:13	detail the not what are you subtree
0:18:16	predictions are come
0:18:18	so if it's it if it's for the user utterance if it if it's a
0:18:23	short utterance typically if i expected to use the short utterance i don't stop speaking
0:18:28	i might continue speaking for example because it's okay and turn taking for someone to
0:18:32	have a very brief utterance
0:18:34	whereas if you all are initiating margaret rose
0:18:38	i might have to stop speaking and we'll that jointly for example so that's
0:18:44	temporal aspect
0:18:53	such a way back to the past and what is that with intuition for including
0:18:58	that as a feature
0:18:59	so what the pos tag what exactly with the intuition including that feature vector understanding
0:19:03	spectral and english but it has a lot remainder to modeling and because the same
0:19:11	is a strong cues of typically if you and if i and if i say
0:19:15	and then i want to go to
0:19:18	you have that i i'm gonna continue because
0:19:21	that was a preposition last autumn usual way to understand and samples where say i
0:19:25	want to go to the bus stop
0:19:28	a noun that is typically signal that i
0:19:44	and it is a part now we
0:19:54	so in general we tried to give it that sort of much lower level information
0:19:58	as possible and help that it will figure things out
0:20:02	and typically i don't think you need
0:20:04	a much more complicated i mean i think i think it's the last house text
0:20:08	that is gonna influence the decision and
0:20:11	my in my intuition is that a more deeper syntactic analysis would help that much
0:20:17	okay thank alignments listening to make a speaker

Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks

Oral Session 2: Turn-Taking and Real-Time Interaction

Gabriel Skantze