0:00:18 | so |
---|
0:00:20 | as we all know turn taking is one of the most fundamental aspects of dialog |
---|
0:00:26 | and it's something that dialogue systems are struggling with |
---|
0:00:30 | if we look at you monument dialogue we know that humans are very good at |
---|
0:00:33 | turn taking they can take the turn with |
---|
0:00:36 | barely any very little overlap |
---|
0:00:40 | at the same time |
---|
0:00:41 | and people make posters within speech without the other person interrupting them |
---|
0:00:50 | and this is accomplished by a number of turntaking cues |
---|
0:00:56 | ask many researchers and have established |
---|
0:00:59 | so syntax twice |
---|
0:01:01 | you yield the turn typically when you are syntactically complete |
---|
0:01:05 | a few look at prosody pitch is normally rising or falling when you're yielding the |
---|
0:01:10 | turn |
---|
0:01:11 | the intensity might be lower the duration for mean duration might be shorter |
---|
0:01:16 | you might read out |
---|
0:01:19 | gaze you look at the other c |
---|
0:01:21 | to yield the turn and also gestures might be used |
---|
0:01:26 | we also know that the more cues |
---|
0:01:30 | we combine the stronger the signal is |
---|
0:01:35 | and of course for dialogue systems to properly handle turn taking this is something they |
---|
0:01:39 | have to take into account |
---|
0:01:43 | and in dialogue systems there are number of decisions that have to the main that |
---|
0:01:47 | are |
---|
0:01:48 | related to turn taking so maybe the one most common one that have been address |
---|
0:01:51 | this given that the user stops speaking |
---|
0:01:54 | and should the system take the turn |
---|
0:01:58 | of course it would be the nice with systems because i think is the user |
---|
0:02:02 | assumed yielding the turn so that the system can start preparing a response |
---|
0:02:08 | another decision is given that the user has just started to speak is it just |
---|
0:02:12 | the beginning of a brief backchannels |
---|
0:02:14 | or something that m to take the turn before that affects what the system should |
---|
0:02:17 | do |
---|
0:02:19 | also if the system |
---|
0:02:21 | is gonna produce an utterance and want to produce a pulls it would be good |
---|
0:02:26 | to know how likely is that the user would try to take the turn depending |
---|
0:02:29 | on the cues that the system produce |
---|
0:02:34 | so |
---|
0:02:36 | before and these different questions have been address with different models basically |
---|
0:02:42 | and the problem of course also is that rounding is highly context-dependent |
---|
0:02:47 | and |
---|
0:02:48 | dialogue context with all these different factors this of course very hard to model |
---|
0:02:54 | so what |
---|
0:02:56 | if we would like however i would like to have at least |
---|
0:02:59 | is the model that is more general where you have a model that can apply |
---|
0:03:02 | to many different turn taking decisions |
---|
0:03:05 | that is continuous so you can apply to continuously not just for specific |
---|
0:03:10 | events that happens |
---|
0:03:12 | it should also be predictive so you shouldn't just classify the current state but be |
---|
0:03:16 | able to predict what will happen in the future so that the system can start |
---|
0:03:20 | preparing |
---|
0:03:21 | and it should also be probabilistic not just the binary decisions |
---|
0:03:26 | so what i propose is that we |
---|
0:03:29 | a use recurrent neural network for this and the model that i have been working |
---|
0:03:35 | on words like this we have that to speech channels from the two from two |
---|
0:03:39 | speakers |
---|
0:03:41 | which can be to you may as if we are predicting between two humans but |
---|
0:03:45 | it could also be human and the system speech |
---|
0:03:48 | we segment the speech of the slices which are fifty milliseconds low so twenty frames |
---|
0:03:52 | per second |
---|
0:03:54 | we do feature extraction and with v it into a recurrent neural network using lstm |
---|
0:04:02 | to be able to capture long a little differences and at each frame |
---|
0:04:08 | we make a prediction |
---|
0:04:09 | for the next three seconds |
---|
0:04:12 | what is the likelihood of |
---|
0:04:15 | yes |
---|
0:04:15 | bigger |
---|
0:04:17 | is a weaker zero here |
---|
0:04:19 | speaking in this future time window |
---|
0:04:24 | so we see that would both speakers but we make prediction for one speaker here |
---|
0:04:29 | and then we train it with the what's what what's actually happening in the future |
---|
0:04:34 | so that's training labels |
---|
0:04:37 | and when we do this we of course want to will be able to model |
---|
0:04:40 | both speakers so we first train it with if we have speaker a and b |
---|
0:04:44 | we first train the whole thing with a being speaker zero and b as a |
---|
0:04:48 | speaker one and that was switched them around so a speaker one these experiments we |
---|
0:04:52 | traded from both perspectives |
---|
0:04:57 | at the application time we run two neural networks at the same time it to |
---|
0:05:01 | make predictions for both speakers |
---|
0:05:05 | the features that we have been using is voice activity we use pitch power |
---|
0:05:11 | normalized for the speaker we don't do any sort of that was the |
---|
0:05:15 | adult thus or anything we because we think that the network should figure this thing |
---|
0:05:19 | so |
---|
0:05:20 | we use a measure of spectral stability to capture the for a particular lengthening |
---|
0:05:24 | we also use part-of-speech tags |
---|
0:05:28 | so at the end of each word we feed in a one hot representation of |
---|
0:05:32 | the part of speech that has just been produced |
---|
0:05:36 | we compared to model is available that use all this lattice or one without the |
---|
0:05:40 | inputs |
---|
0:05:43 | and also prosody model that use everything but the part-of-speech to see how much the |
---|
0:05:46 | parts which actually helps |
---|
0:05:48 | we use the deep learning for data or toolkit |
---|
0:05:52 | we have used the of web corpus for this which we are divided tonight a |
---|
0:05:57 | six friend dialogues of the two test dialogues |
---|
0:06:00 | that gives us about ten hours of training data |
---|
0:06:04 | we use the manual labeling voice activity which should be set to expect where we |
---|
0:06:08 | with the automatically |
---|
0:06:09 | and the manual labour what speech on the prosody expected with respect to fit |
---|
0:06:15 | we can show you have video what the production for predictions looks like when we |
---|
0:06:19 | run it |
---|
0:06:21 | continuously online so these are the predictions |
---|
0:06:24 | the red is the point the prediction we are now |
---|
0:06:28 | and the green is the probability so the higher the curve the more likely it |
---|
0:06:32 | is that the parsable speech in this future time window |
---|
0:06:37 | after of course is you will see the future what was actually gonna happen also |
---|
0:06:45 | style if you can extend from keynote is just the so for me key to |
---|
0:06:50 | chain link fence at sixteen to illustrate how this is right there is a more |
---|
0:06:55 | the sources in the more likely model based on speech |
---|
0:06:59 | i just don't seasons from now exactly determine the least tendency to distinct okay so |
---|
0:07:03 | we i have looked at two different tasks that we can use this model for |
---|
0:07:09 | one is very common talk is to predict |
---|
0:07:11 | given of course who was the most likely neck speaker sound this is an example |
---|
0:07:17 | where you can see that |
---|
0:07:20 | here one person that's just a stop speaking and we can see that makes a |
---|
0:07:24 | fairly good prediction in this case |
---|
0:07:26 | it's not |
---|
0:07:27 | it will take some time and things it for this person will continue |
---|
0:07:31 | but it's quite likely that this person will produce a response but it's not gonna |
---|
0:07:35 | be a very long so it makes very good prediction |
---|
0:07:40 | there is another prediction |
---|
0:07:42 | so that was the turn shift that was predicting here is a predicting that the |
---|
0:07:46 | speaker will actually continue speaking |
---|
0:07:48 | fairly high prediction but is not very likely that the other person produced response |
---|
0:07:54 | so to make it easy i made it is into a binary classification task so |
---|
0:07:59 | we debated basically asked at |
---|
0:08:01 | the average prediction over the two we compare say and a is it a key |
---|
0:08:08 | or shift |
---|
0:08:10 | or hold |
---|
0:08:11 | and then we can yes compute an f-score a see how well it thus we |
---|
0:08:15 | can compare it with other methods for doing this |
---|
0:08:18 | this is the number of training epochs and the blue is the full model the |
---|
0:08:21 | red is the prosody model |
---|
0:08:23 | we consider the prosody model which is stabilises where is the full model continues to |
---|
0:08:27 | learn |
---|
0:08:32 | so the best prediction we get for this |
---|
0:08:35 | it's for the prosody only you can see the numbers here |
---|
0:08:40 | a |
---|
0:08:41 | for features are points |
---|
0:08:42 | some to six it's not hard to know of course is this is good or |
---|
0:08:46 | not good |
---|
0:08:47 | it's |
---|
0:08:48 | impossible of course to get hundred percent because turn taking is highly optional is not |
---|
0:08:52 | always the case that it's obvious will take will continue speaking |
---|
0:08:57 | of course if we have compared to the majority class baseline always hold the turn |
---|
0:09:02 | is much better but that's not very interesting so we let humans listen to these |
---|
0:09:07 | dialogue |
---|
0:09:09 | to this point and dialogue and try to estimate who will be the neck speakers |
---|
0:09:12 | speaker |
---|
0:09:14 | using the crowd power |
---|
0:09:17 | and they didn't performance well we also tried |
---|
0:09:21 | more traditional modeling where we just |
---|
0:09:25 | trying to model as good as possible the features we have at that point and |
---|
0:09:28 | make one shot position and the best classifiers |
---|
0:09:34 | did not perform as well as we can see also |
---|
0:09:36 | this is also comparable what we find the literature where people have done similar terms |
---|
0:09:41 | with more traditional modeling |
---|
0:09:45 | we also compare what happen if we look at different balls nice the so how |
---|
0:09:49 | quickly into the portal post we make the decision |
---|
0:09:52 | and we see that what we're gonna have to fifty miliseconds into the pos we |
---|
0:09:55 | make a fairly good prediction you will be the next week |
---|
0:09:59 | it doesn't really matter what's proposed mentally as |
---|
0:10:03 | so the next task will if that was the prediction at speech onsets so this |
---|
0:10:07 | is interesting |
---|
0:10:09 | someone has just started to speech as we can see here |
---|
0:10:12 | and we want to know is this like its be very short utterance |
---|
0:10:17 | backchannel or is it likely to be a longer happens if is a long rappers |
---|
0:10:20 | maybe if of the dialogue system which is stopped speaking wrap select the other person |
---|
0:10:24 | take the turn if we want to otherwise continue speaking |
---|
0:10:29 | he would makes also very fairly lewd a prediction and you see the slow is |
---|
0:10:34 | going down very quickly so it's gonna be cool short utterance whereas here it makes |
---|
0:10:39 | prediction |
---|
0:10:41 | all the more low reference we are here yes |
---|
0:10:46 | at the same |
---|
0:10:47 | points into the utterance as you can see that the predictions about different |
---|
0:10:52 | to make the task binary again we divide between short and long process that with |
---|
0:10:57 | finding in |
---|
0:10:59 | i in the test data |
---|
0:11:02 | social to process we in both cases we are one half second in the speech |
---|
0:11:08 | sure that process not allowed to be more than half a second more as all |
---|
0:11:12 | have to be more than |
---|
0:11:13 | two and half second |
---|
0:11:17 | and then we average the |
---|
0:11:21 | speaking probability that is predicted of the fusion time window |
---|
0:11:24 | and this is a histogram showing for the short utterance is what the average predicted |
---|
0:11:29 | it speaking probabilities and for the longer utterances |
---|
0:11:33 | so you can see it may give fairly good for separation |
---|
0:11:36 | and just using this very simple method it can be more sophisticated of course |
---|
0:11:41 | and f-score |
---|
0:11:42 | zero point seventy six |
---|
0:11:45 | again if we compared to the majority class baseline or |
---|
0:11:48 | a more traditional modeling we get |
---|
0:11:53 | a better performance also if we compared to similar terms |
---|
0:11:58 | being done before |
---|
0:12:02 | okay so then this looks very promising of course the question is can this be |
---|
0:12:07 | used for |
---|
0:12:09 | spoken dialogue system |
---|
0:12:12 | so we took a corpus we had of human robot interaction and we tried to |
---|
0:12:18 | built which was already annotated at the end of each user speech segments for whether |
---|
0:12:23 | this was a good based take the turn or not |
---|
0:12:25 | and within the network with the cysts it is synthesized speech from the system the |
---|
0:12:30 | user speech and we compare the predictions us like we did |
---|
0:12:35 | before |
---|
0:12:36 | and of course since these are |
---|
0:12:40 | very different type of dialogue the map task dialogue and the human computer dialog direct |
---|
0:12:46 | application we use the prosody model didn't give a very good f-score it's better than |
---|
0:12:52 | baseline but not very useful |
---|
0:12:54 | so what with what was that |
---|
0:12:57 | well maybe at least we can use the recurrent neural network is a feature extraction |
---|
0:13:01 | as a representation of the current turn taking dialog state |
---|
0:13:06 | so we take the lstm layers and we |
---|
0:13:10 | training with supervised learning a logistic regression that is to predict whether this is the |
---|
0:13:16 | best detect on |
---|
0:13:21 | and then we get the fairly good |
---|
0:13:24 | results with the right determine cross validation |
---|
0:13:29 | but it also but only well if we just printed with twenty percent of the |
---|
0:13:33 | a lot of the data |
---|
0:13:35 | so that's problems |
---|
0:13:40 | so of course to it for future work |
---|
0:13:44 | we think we need more by boris interaction like that |
---|
0:13:49 | map task is highly specific also of course |
---|
0:13:53 | it's not very similar to human |
---|
0:13:55 | machine interaction so we could for example training a wizard-of-oz data |
---|
0:14:01 | also the way we have used it now it's very coarse we i just average |
---|
0:14:06 | these two predictions |
---|
0:14:09 | and compare them and it doesn't really make justice to the model which has a |
---|
0:14:13 | much more fine grained |
---|
0:14:16 | prediction also what's interesting is that has to go along you're in these polls the |
---|
0:14:20 | predictions updates during the poles so we can make continuous decisions while was is unfolding |
---|
0:14:29 | and also make use of the probabilities of course for example in the decision directed |
---|
0:14:34 | the framework |
---|
0:14:36 | multimodal interaction of course we have data from |
---|
0:14:41 | from |
---|
0:14:42 | face-to-face interaction |
---|
0:14:45 | and of course we know that gaze and gesture and so on a very important |
---|
0:14:48 | so that should be highly useful |
---|
0:14:50 | and also multi party interaction of the model applies very well to multiparty since each |
---|
0:14:54 | user where each speaker is modeled with its own and that work |
---|
0:14:59 | so we could apply to any number of speakers |
---|
0:15:02 | thank you |
---|
0:15:29 | so we are trying to feed features a feature for that what's happening during this |
---|
0:15:34 | fifty milliseconds if we have pitch for example take the average pitch in that small |
---|
0:15:40 | window |
---|
0:15:45 | sorry the |
---|
0:15:49 | so that we is the |
---|
0:15:51 | as soon as a word is finished we take a one up representation |
---|
0:15:55 | with a pause tag and feed it into the network |
---|
0:15:59 | at |
---|
0:16:00 | at the frame |
---|
0:16:02 | as soon as soon as the words and with the adapting to it and then |
---|
0:16:06 | with its zeros again into to the pos tags |
---|
0:16:09 | so it's just for one frame you get the while for that part of speech |
---|
0:16:19 | thanks for the top is more clarification question so the two task that you representing |
---|
0:16:24 | the two prediction task with it separate networks that you were training or using the |
---|
0:16:29 | same network with two output layers is the same network |
---|
0:16:35 | that is trained |
---|
0:16:37 | so it's not for the to sort of roles or anything at that we rounded |
---|
0:16:41 | instances of the same network |
---|
0:16:43 | okay so i just kind of multitask learning |
---|
0:16:45 | i mean you just having two different ways to prediction but the latent representations same |
---|
0:16:52 | not application at application time they're completely different the two networks both the word skip |
---|
0:16:57 | what from both speakers |
---|
0:16:59 | it says yes that each network makes prediction for |
---|
0:17:02 | for one of the speakers |
---|
0:17:04 | right but the model itself the parameters that you learning |
---|
0:17:09 | are there are completely the training in isolation or to train that the same time |
---|
0:17:13 | for the two prediction task |
---|
0:17:15 | no other prediction task is i mean the prediction is used to predict what's happening |
---|
0:17:20 | at each frame |
---|
0:17:22 | and then we can apply the same model to different tasks |
---|
0:17:25 | so we can see what does the model predicts that speech onset what does the |
---|
0:17:28 | model predict at the beginning of balls |
---|
0:17:31 | okay that's what that's so that's why i wanted to general model that it's the |
---|
0:17:34 | same model that is implied by the different tasks |
---|
0:17:47 | so the thanks for great talk so |
---|
0:17:50 | the model includes temporal information in the project |
---|
0:17:54 | so i wanted to ask if you could talk a little bit about |
---|
0:17:59 | how you imagine systems could use that kind of temporal information |
---|
0:18:05 | i talked about long versa short utterances i think |
---|
0:18:09 | i should say okay this is right time for a short utterance or the more |
---|
0:18:13 | detail the not what are you subtree |
---|
0:18:16 | predictions are come |
---|
0:18:18 | so if it's it if it's for the user utterance if it if it's a |
---|
0:18:23 | short utterance typically if i expected to use the short utterance i don't stop speaking |
---|
0:18:28 | i might continue speaking for example because it's okay and turn taking for someone to |
---|
0:18:32 | have a very brief utterance |
---|
0:18:34 | whereas if you all are initiating margaret rose |
---|
0:18:38 | i might have to stop speaking and we'll that jointly for example so that's |
---|
0:18:44 | temporal aspect |
---|
0:18:53 | such a way back to the past and what is that with intuition for including |
---|
0:18:58 | that as a feature |
---|
0:18:59 | so what the pos tag what exactly with the intuition including that feature vector understanding |
---|
0:19:03 | spectral and english but it has a lot remainder to modeling and because the same |
---|
0:19:11 | is a strong cues of typically if you and if i and if i say |
---|
0:19:15 | and then i want to go to |
---|
0:19:18 | you have that i i'm gonna continue because |
---|
0:19:21 | that was a preposition last autumn usual way to understand and samples where say i |
---|
0:19:25 | want to go to the bus stop |
---|
0:19:28 | a noun that is typically signal that i |
---|
0:19:44 | and it is a part now we |
---|
0:19:54 | so in general we tried to give it that sort of much lower level information |
---|
0:19:58 | as possible and help that it will figure things out |
---|
0:20:02 | and typically i don't think you need |
---|
0:20:04 | a much more complicated i mean i think i think it's the last house text |
---|
0:20:08 | that is gonna influence the decision and |
---|
0:20:11 | my in my intuition is that a more deeper syntactic analysis would help that much |
---|
0:20:17 | okay thank alignments listening to make a speaker |
---|