0:00:13 | yeah so high uh and daniel angle percent these work that the to get with a cheap |
---|
0:00:18 | you are still trying |
---|
0:00:20 | so |
---|
0:00:21 | hmmm |
---|
0:00:23 | oh |
---|
0:00:27 | okay so uh like it's a conversational systems task to be developed using these kind of uh uh talkie talk |
---|
0:00:32 | at turn taking part down |
---|
0:00:33 | is is means that one guys talking at a time and response times or or long |
---|
0:00:39 | make a mostly do because the use pause duration threshold for and electrons protection |
---|
0:00:43 | but |
---|
0:00:44 | uh uh |
---|
0:00:45 | and you must talk |
---|
0:00:47 | a a more than fifty percent |
---|
0:00:49 | a well uh speaker ships |
---|
0:00:51 | a in this two situations that is in the part will that |
---|
0:00:55 | or |
---|
0:00:55 | yeah in a gap up to two on the milliseconds |
---|
0:00:58 | which is supposed to be the minimum response times to polls |
---|
0:01:03 | so if you want to do are taking |
---|
0:01:05 | in uh uh a a uh and you talking to a computer |
---|
0:01:08 | uh you can do why these into two cases the first one this when you having a long gap |
---|
0:01:13 | a longer than two men seconds |
---|
0:01:15 | okay handle these by and and the kinds |
---|
0:01:17 | predict does |
---|
0:01:19 | uh the second case here nice when you adding a little overlap |
---|
0:01:22 | or short gap |
---|
0:01:24 | this is our target for uh this study |
---|
0:01:27 | that we |
---|
0:01:29 | uh introduce a simplify a approach for this so by introducing this acknowledgement most with |
---|
0:01:34 | basically of |
---|
0:01:36 | backchannel channel type of dialogue act |
---|
0:01:37 | this is means that |
---|
0:01:39 | but people say it hmmm yeah and so on |
---|
0:01:42 | so you want to dial of they system they had to do |
---|
0:01:44 | uh these two things that is |
---|
0:01:46 | it should be a bit to continue to talk in income than uh these signals transmit |
---|
0:01:50 | windy G in the in to complete overlap |
---|
0:01:53 | or or to compute it should be a to that speech |
---|
0:01:56 | well you you still a training one of |
---|
0:01:59 | um |
---|
0:01:59 | things |
---|
0:02:00 | so we talking about a uh a a lot about to response times here |
---|
0:02:04 | uh this is the corpus that use the classical the adding or map task corpus is got in these |
---|
0:02:09 | a a the for face-to-face dialogues |
---|
0:02:12 | and that the task is the map does of there's one guy a space to another |
---|
0:02:16 | and it the it has provided a shows |
---|
0:02:19 | uh among these are the acknowledgement both |
---|
0:02:23 | and which more to do this under a yeah |
---|
0:02:26 | in in to a talk spurt |
---|
0:02:28 | there are the fine here as i i mean them and voice activity |
---|
0:02:31 | a duration to actual on fifty milliseconds |
---|
0:02:34 | a a a a separated by a a durations two on milliseconds |
---|
0:02:39 | this but makes the provide a some patient more |
---|
0:02:42 | perceptually relevant and uh |
---|
0:02:44 | more will more closely or someone on online condition |
---|
0:02:48 | so uh this or on the uh |
---|
0:02:51 | twenty to most frequency according board |
---|
0:02:53 | you can see that to top five here are right okay K uh and yeah |
---|
0:02:59 | uh so |
---|
0:03:02 | this might actually be the by their lexical content |
---|
0:03:05 | so how that is a in the overlap them well i to the corpus and one it in ten miliseconds |
---|
0:03:10 | frames |
---|
0:03:11 | so given that the frame is norm of that |
---|
0:03:13 | is a five percent probability there is an acknowledgement mode |
---|
0:03:16 | while a a if you is in the wheel that this it's the five percent probably there and the occlusion |
---|
0:03:22 | so |
---|
0:03:23 | uh this um to be seems to be more common in the without that |
---|
0:03:26 | so what is going on here |
---|
0:03:28 | i try to lit the goal a bit deeper uh by computing uh the between speaker and to well |
---|
0:03:33 | is defined by the partial with that |
---|
0:03:35 | and the gap |
---|
0:03:37 | so uh what are actually going for a are the target to used an assumption didn't look at or assumption |
---|
0:03:43 | of them all this month mode |
---|
0:03:45 | but it was a a a a uh uh for others to to have a reference to compare with |
---|
0:03:50 | that is |
---|
0:03:51 | a in the context |
---|
0:03:52 | oh i'm like motion low |
---|
0:03:54 | i bit out X we stick cheap of sounds i'm including X the linguistic so |
---|
0:04:00 | so this is to drop |
---|
0:04:01 | um |
---|
0:04:02 | from coming station in press |
---|
0:04:05 | and as you can see here if you introduce these extra we stick uh two cans you get much more |
---|
0:04:10 | overlap |
---|
0:04:11 | which is uh |
---|
0:04:13 | is is um |
---|
0:04:15 | uh the negative scale of of the graph fear |
---|
0:04:19 | uh while the |
---|
0:04:21 | if you are computed for |
---|
0:04:22 | uh in the context of a a motion model |
---|
0:04:25 | uh there is not much different |
---|
0:04:28 | uh a |
---|
0:04:29 | you can build for uh the in look the assumption of them cushion you get slightly more over that be |
---|
0:04:35 | as you can see here |
---|
0:04:36 | uh to the left image or |
---|
0:04:38 | so what does this mean well it seems like the worship station a closed |
---|
0:04:43 | and no will that are mostly due to interaction to complete lap |
---|
0:04:47 | but are uh but to actually want to do here is |
---|
0:04:49 | uh |
---|
0:04:50 | for both interaction direction to complete the were that are shown some of them |
---|
0:04:54 | and occlusion mode |
---|
0:04:55 | uh into action and to silence we need to classify i income speech |
---|
0:04:59 | and some acknowledgement no and off |
---|
0:05:01 | uh as |
---|
0:05:02 | by i early |
---|
0:05:04 | so a a a i to this these calls set that a called maximum it's like |
---|
0:05:08 | might late to classification |
---|
0:05:10 | a a it's quite simple actually it's a is just a several or talks but each segment there a with |
---|
0:05:15 | which has a mean one speech activity |
---|
0:05:18 | a threshold and minimum pulse duration threshold |
---|
0:05:21 | but i want to make the decision at all |
---|
0:05:24 | uh however |
---|
0:05:26 | uh in the first case here you tao use |
---|
0:05:29 | uh uh uh a larger down |
---|
0:05:31 | the talk sport |
---|
0:05:32 | a a a a uh duration plus |
---|
0:05:35 | mean mean um are some threshold you make it at |
---|
0:05:38 | at this time instead |
---|
0:05:39 | to minimize the response times |
---|
0:05:42 | so |
---|
0:05:43 | how the set top were done for the maximum latency a well this or that the durations |
---|
0:05:48 | of these two |
---|
0:05:49 | can talk spurts that most from close um the ones |
---|
0:05:53 | this can see here that these are much shorter |
---|
0:05:55 | so |
---|
0:05:57 | uh if you want to use duration that's a feature down for classification |
---|
0:06:02 | uh you uh |
---|
0:06:04 | um |
---|
0:06:05 | you basically you you can you you you might have to make it or that the longer the wait |
---|
0:06:10 | uh the most the most audience uh direction of the S a feature |
---|
0:06:14 | but to less of between two for |
---|
0:06:16 | a a a a lot to watch |
---|
0:06:17 | so i tried to hunt seconds from seconds and five on men seconds and just see C would get |
---|
0:06:23 | a for the acoustic detector or use this kind of permit station is it um |
---|
0:06:29 | it's a length in improvisation which is basically type do this at T |
---|
0:06:33 | which is smooth the search way that to the divide the length for the talk spurt |
---|
0:06:38 | in this very this is quite a useful because the basis functions are but you could each of those for |
---|
0:06:42 | a good interpolation on the syllabic them |
---|
0:06:45 | length in gives |
---|
0:06:46 | and the station for duration or speaking rate you can separate these and |
---|
0:06:50 | the classifier were a |
---|
0:06:53 | one the sears scopes and to equal to that of yet rich areas so which ms if you made it |
---|
0:06:58 | you only parents are are to that in the real to shape of the this is that directory |
---|
0:07:03 | so this is useful for it want model uh F zero |
---|
0:07:06 | a a a which chance a speaker dependent buys |
---|
0:07:09 | density |
---|
0:07:10 | that's a is because |
---|
0:07:11 | oh them this to the microphone |
---|
0:07:13 | and then it's is used has these channel uh |
---|
0:07:16 | used by |
---|
0:07:18 | and this is the class powered set up |
---|
0:07:20 | use F zero and envelopes |
---|
0:07:21 | um |
---|
0:07:23 | the to shape these intense there'll two shapes the intensity |
---|
0:07:27 | uh the absolute |
---|
0:07:28 | you try to to the absolute and relative shapes of C is |
---|
0:07:32 | "'cause" one to see how this will affect |
---|
0:07:35 | we can get up to it |
---|
0:07:36 | and the for duration |
---|
0:07:38 | uh used to fill those but duration for training while for testing we can at the maximum latency |
---|
0:07:44 | then a at the spectral flux would up too much motivation |
---|
0:07:48 | and the class Y is uh support vector machine with a or of cool |
---|
0:07:53 | this or that was was personal |
---|
0:07:55 | and as you can see |
---|
0:07:57 | uh it seems like of zero envelopes loops are the weakest feature |
---|
0:08:01 | uh followed by intensity |
---|
0:08:03 | and spectral flux |
---|
0:08:05 | well i M C's is are the strongest ones |
---|
0:08:07 | and doesn't are to meet the sears course stamped which means that we actually only modeling all |
---|
0:08:13 | tell that to trees |
---|
0:08:15 | uh |
---|
0:08:17 | uh uh |
---|
0:08:18 | these |
---|
0:08:18 | these features |
---|
0:08:19 | i for duration |
---|
0:08:21 | uh |
---|
0:08:21 | you get nothing |
---|
0:08:22 | for at time that milliseconds |
---|
0:08:24 | but the rate of the longer |
---|
0:08:26 | five and men seconds comes the second most sunny and feature |
---|
0:08:31 | so i |
---|
0:08:32 | uh |
---|
0:08:33 | sorry |
---|
0:08:33 | yeah so you decided to uh include the is is it's of the sears consent |
---|
0:08:38 | and uh all to have zero loops |
---|
0:08:40 | a "'cause" this of the case |
---|
0:08:42 | this were the weakest a in the feature combination |
---|
0:08:45 | uh a sort the walls that was sold and we tried to conditions here that is the online |
---|
0:08:51 | the blind a plan uses the provide is a show while the online |
---|
0:08:55 | use um and and D based threshold |
---|
0:08:58 | uh what voice activity text detector |
---|
0:09:01 | because we a little bit um |
---|
0:09:03 | and about how sounds to this time wearing a station walls |
---|
0:09:08 | to to this kind of um |
---|
0:09:10 | what's active the detection |
---|
0:09:12 | does out that it was not that uh since due to this and she can see |
---|
0:09:16 | the longer you rate to the but their classification the get |
---|
0:09:19 | rubber |
---|
0:09:20 | it's such would pursue to get a quite this simplification that when we hundred milliseconds |
---|
0:09:24 | which it christ |
---|
0:09:25 | uh in i think |
---|
0:09:28 | and it was surprised |
---|
0:09:31 | so |
---|
0:09:31 | uh |
---|
0:09:33 | sort computer uh well duration and C is two |
---|
0:09:36 | seems seems to be the most silent features here |
---|
0:09:40 | and uh if you want to integrate these kind |
---|
0:09:43 | classifier mean |
---|
0:09:44 | um increment the dialogue an framework |
---|
0:09:47 | this baseline framework that just |
---|
0:09:48 | based uh can handle multiple ongoing plans |
---|
0:09:52 | and i two guess been here is to run every classifiers in parallel uh perhaps the first one we prepare |
---|
0:09:58 | decisions optimal miliseconds |
---|
0:10:00 | and the next to execute them at three am and five hundred milliseconds |
---|
0:10:05 | so and the actual implementation there was done in one spot |
---|
0:10:11 | um |
---|
0:10:12 | that's sort for me and thing have to say uh and the questions |
---|
0:10:16 | think much better |
---|
0:10:21 | know |
---|
0:10:22 | question |
---|
0:10:32 | oh or do you have any |
---|
0:10:34 | particularly the nation for is being able to here |
---|
0:10:38 | but that what you get |
---|
0:10:40 | uh no uh |
---|
0:10:42 | well |
---|
0:10:43 | it |
---|
0:10:44 | um |
---|
0:10:45 | simply the because they |
---|
0:10:46 | you can |
---|
0:10:47 | it seems like the annotators basically they have |
---|
0:10:49 | on a lot of on the lexical content |
---|
0:10:52 | say |
---|
0:10:53 | i a particular reason |
---|
0:10:55 | but it's not that is either because if you make the sears cups C em to which it's actually there |
---|
0:11:00 | um |
---|
0:11:02 | a a which holds the information about |
---|
0:11:05 | the actual uh the formants |
---|
0:11:08 | uh |
---|
0:11:08 | if you mid that then you lose this |
---|
0:11:11 | so one we that links kind of the that that directories |
---|
0:11:14 | and that is would be different |
---|
0:11:16 | seems to be more something about voice called |
---|
0:11:19 | um uh |
---|
0:11:20 | uh i can't explain |
---|
0:11:22 | white what's alone |
---|
0:11:27 | so i have a question |
---|
0:11:33 | okay than |
---|
0:11:33 | think again on you |
---|
0:11:35 | thank you |
---|