so
as we all know turn taking is one of the most fundamental aspects of dialog
and it's something that dialogue systems are struggling with
if we look at you monument dialogue we know that humans are very good at
turn taking they can take the turn with
barely any very little overlap
at the same time
and people make posters within speech without the other person interrupting them
and this is accomplished by a number of turntaking cues
ask many researchers and have established
so syntax twice
you yield the turn typically when you are syntactically complete
a few look at prosody pitch is normally rising or falling when you're yielding the
turn
the intensity might be lower the duration for mean duration might be shorter
you might read out
gaze you look at the other c
to yield the turn and also gestures might be used
we also know that the more cues
we combine the stronger the signal is
and of course for dialogue systems to properly handle turn taking this is something they
have to take into account
and in dialogue systems there are number of decisions that have to the main that
are
related to turn taking so maybe the one most common one that have been address
this given that the user stops speaking
and should the system take the turn
of course it would be the nice with systems because i think is the user
assumed yielding the turn so that the system can start preparing a response
another decision is given that the user has just started to speak is it just
the beginning of a brief backchannels
or something that m to take the turn before that affects what the system should
do
also if the system
is gonna produce an utterance and want to produce a pulls it would be good
to know how likely is that the user would try to take the turn depending
on the cues that the system produce
so
before and these different questions have been address with different models basically
and the problem of course also is that rounding is highly context-dependent
and
dialogue context with all these different factors this of course very hard to model
so what
if we would like however i would like to have at least
is the model that is more general where you have a model that can apply
to many different turn taking decisions
that is continuous so you can apply to continuously not just for specific
events that happens
it should also be predictive so you shouldn't just classify the current state but be
able to predict what will happen in the future so that the system can start
preparing
and it should also be probabilistic not just the binary decisions
so what i propose is that we
a use recurrent neural network for this and the model that i have been working
on words like this we have that to speech channels from the two from two
speakers
which can be to you may as if we are predicting between two humans but
it could also be human and the system speech
we segment the speech of the slices which are fifty milliseconds low so twenty frames
per second
we do feature extraction and with v it into a recurrent neural network using lstm
to be able to capture long a little differences and at each frame
we make a prediction
for the next three seconds
what is the likelihood of
yes
bigger
is a weaker zero here
speaking in this future time window
so we see that would both speakers but we make prediction for one speaker here
and then we train it with the what's what what's actually happening in the future
so that's training labels
and when we do this we of course want to will be able to model
both speakers so we first train it with if we have speaker a and b
we first train the whole thing with a being speaker zero and b as a
speaker one and that was switched them around so a speaker one these experiments we
traded from both perspectives
at the application time we run two neural networks at the same time it to
make predictions for both speakers
the features that we have been using is voice activity we use pitch power
normalized for the speaker we don't do any sort of that was the
adult thus or anything we because we think that the network should figure this thing
so
we use a measure of spectral stability to capture the for a particular lengthening
we also use part-of-speech tags
so at the end of each word we feed in a one hot representation of
the part of speech that has just been produced
we compared to model is available that use all this lattice or one without the
inputs
and also prosody model that use everything but the part-of-speech to see how much the
parts which actually helps
we use the deep learning for data or toolkit
we have used the of web corpus for this which we are divided tonight a
six friend dialogues of the two test dialogues
that gives us about ten hours of training data
we use the manual labeling voice activity which should be set to expect where we
with the automatically
and the manual labour what speech on the prosody expected with respect to fit
we can show you have video what the production for predictions looks like when we
run it
continuously online so these are the predictions
the red is the point the prediction we are now
and the green is the probability so the higher the curve the more likely it
is that the parsable speech in this future time window
after of course is you will see the future what was actually gonna happen also
style if you can extend from keynote is just the so for me key to
chain link fence at sixteen to illustrate how this is right there is a more
the sources in the more likely model based on speech
i just don't seasons from now exactly determine the least tendency to distinct okay so
we i have looked at two different tasks that we can use this model for
one is very common talk is to predict
given of course who was the most likely neck speaker sound this is an example
where you can see that
here one person that's just a stop speaking and we can see that makes a
fairly good prediction in this case
it's not
it will take some time and things it for this person will continue
but it's quite likely that this person will produce a response but it's not gonna
be a very long so it makes very good prediction
there is another prediction
so that was the turn shift that was predicting here is a predicting that the
speaker will actually continue speaking
fairly high prediction but is not very likely that the other person produced response
so to make it easy i made it is into a binary classification task so
we debated basically asked at
the average prediction over the two we compare say and a is it a key
or shift
or hold
and then we can yes compute an f-score a see how well it thus we
can compare it with other methods for doing this
this is the number of training epochs and the blue is the full model the
red is the prosody model
we consider the prosody model which is stabilises where is the full model continues to
learn
so the best prediction we get for this
it's for the prosody only you can see the numbers here
a
for features are points
some to six it's not hard to know of course is this is good or
not good
it's
impossible of course to get hundred percent because turn taking is highly optional is not
always the case that it's obvious will take will continue speaking
of course if we have compared to the majority class baseline always hold the turn
is much better but that's not very interesting so we let humans listen to these
dialogue
to this point and dialogue and try to estimate who will be the neck speakers
speaker
using the crowd power
and they didn't performance well we also tried
more traditional modeling where we just
trying to model as good as possible the features we have at that point and
make one shot position and the best classifiers
did not perform as well as we can see also
this is also comparable what we find the literature where people have done similar terms
with more traditional modeling
we also compare what happen if we look at different balls nice the so how
quickly into the portal post we make the decision
and we see that what we're gonna have to fifty miliseconds into the pos we
make a fairly good prediction you will be the next week
it doesn't really matter what's proposed mentally as
so the next task will if that was the prediction at speech onsets so this
is interesting
someone has just started to speech as we can see here
and we want to know is this like its be very short utterance
backchannel or is it likely to be a longer happens if is a long rappers
maybe if of the dialogue system which is stopped speaking wrap select the other person
take the turn if we want to otherwise continue speaking
he would makes also very fairly lewd a prediction and you see the slow is
going down very quickly so it's gonna be cool short utterance whereas here it makes
prediction
all the more low reference we are here yes
at the same
points into the utterance as you can see that the predictions about different
to make the task binary again we divide between short and long process that with
finding in
i in the test data
social to process we in both cases we are one half second in the speech
sure that process not allowed to be more than half a second more as all
have to be more than
two and half second
and then we average the
speaking probability that is predicted of the fusion time window
and this is a histogram showing for the short utterance is what the average predicted
it speaking probabilities and for the longer utterances
so you can see it may give fairly good for separation
and just using this very simple method it can be more sophisticated of course
and f-score
zero point seventy six
again if we compared to the majority class baseline or
a more traditional modeling we get
a better performance also if we compared to similar terms
being done before
okay so then this looks very promising of course the question is can this be
used for
spoken dialogue system
so we took a corpus we had of human robot interaction and we tried to
built which was already annotated at the end of each user speech segments for whether
this was a good based take the turn or not
and within the network with the cysts it is synthesized speech from the system the
user speech and we compare the predictions us like we did
before
and of course since these are
very different type of dialogue the map task dialogue and the human computer dialog direct
application we use the prosody model didn't give a very good f-score it's better than
baseline but not very useful
so what with what was that
well maybe at least we can use the recurrent neural network is a feature extraction
as a representation of the current turn taking dialog state
so we take the lstm layers and we
training with supervised learning a logistic regression that is to predict whether this is the
best detect on
and then we get the fairly good
results with the right determine cross validation
but it also but only well if we just printed with twenty percent of the
a lot of the data
so that's problems
so of course to it for future work
we think we need more by boris interaction like that
map task is highly specific also of course
it's not very similar to human
machine interaction so we could for example training a wizard-of-oz data
also the way we have used it now it's very coarse we i just average
these two predictions
and compare them and it doesn't really make justice to the model which has a
much more fine grained
prediction also what's interesting is that has to go along you're in these polls the
predictions updates during the poles so we can make continuous decisions while was is unfolding
and also make use of the probabilities of course for example in the decision directed
the framework
multimodal interaction of course we have data from
from
face-to-face interaction
and of course we know that gaze and gesture and so on a very important
so that should be highly useful
and also multi party interaction of the model applies very well to multiparty since each
user where each speaker is modeled with its own and that work
so we could apply to any number of speakers
thank you
so we are trying to feed features a feature for that what's happening during this
fifty milliseconds if we have pitch for example take the average pitch in that small
window
sorry the
so that we is the
as soon as a word is finished we take a one up representation
with a pause tag and feed it into the network
at
at the frame
as soon as soon as the words and with the adapting to it and then
with its zeros again into to the pos tags
so it's just for one frame you get the while for that part of speech
thanks for the top is more clarification question so the two task that you representing
the two prediction task with it separate networks that you were training or using the
same network with two output layers is the same network
that is trained
so it's not for the to sort of roles or anything at that we rounded
instances of the same network
okay so i just kind of multitask learning
i mean you just having two different ways to prediction but the latent representations same
not application at application time they're completely different the two networks both the word skip
what from both speakers
it says yes that each network makes prediction for
for one of the speakers
right but the model itself the parameters that you learning
are there are completely the training in isolation or to train that the same time
for the two prediction task
no other prediction task is i mean the prediction is used to predict what's happening
at each frame
and then we can apply the same model to different tasks
so we can see what does the model predicts that speech onset what does the
model predict at the beginning of balls
okay that's what that's so that's why i wanted to general model that it's the
same model that is implied by the different tasks
so the thanks for great talk so
the model includes temporal information in the project
so i wanted to ask if you could talk a little bit about
how you imagine systems could use that kind of temporal information
i talked about long versa short utterances i think
i should say okay this is right time for a short utterance or the more
detail the not what are you subtree
predictions are come
so if it's it if it's for the user utterance if it if it's a
short utterance typically if i expected to use the short utterance i don't stop speaking
i might continue speaking for example because it's okay and turn taking for someone to
have a very brief utterance
whereas if you all are initiating margaret rose
i might have to stop speaking and we'll that jointly for example so that's
temporal aspect
such a way back to the past and what is that with intuition for including
that as a feature
so what the pos tag what exactly with the intuition including that feature vector understanding
spectral and english but it has a lot remainder to modeling and because the same
is a strong cues of typically if you and if i and if i say
and then i want to go to
you have that i i'm gonna continue because
that was a preposition last autumn usual way to understand and samples where say i
want to go to the bus stop
a noun that is typically signal that i
and it is a part now we
so in general we tried to give it that sort of much lower level information
as possible and help that it will figure things out
and typically i don't think you need
a much more complicated i mean i think i think it's the last house text
that is gonna influence the decision and
my in my intuition is that a more deeper syntactic analysis would help that much
okay thank alignments listening to make a speaker