0:00:14a dash
0:00:16textual
0:00:18moving
0:00:19a minute so can tell our research and just the background on what we doing
0:00:24so
0:00:25we have in inferred variable name eureka was developed by your shoes you did all
0:00:30and part of this project is for a eureka to be able to function simple
0:00:36roles
0:00:37and form task as well as the human so it's a very realistic looking and
0:00:40with rubber we demonstrated your okay how the conferences last year signal
0:00:46and in this work
0:00:48i'm gonna described your caused all of model
0:00:50and are always an attentive listener
0:00:53so you
0:00:54at each of listening
0:00:56we had
0:00:57but of a example in the keynote so i am this morning by was
0:01:01so it into listening is where basically erica
0:01:05try to listen to the user talk
0:01:08and what showed in straight interjections of dialogue so we want to stick might more
0:01:12conversation
0:01:13but primarily by the user
0:01:16and the scenario we don't needed to understand the conversation so it
0:01:20we don't trying to any complex natural language processing
0:01:24and only intended time this is the only people are people do the following i
0:01:29to get some social isolation
0:01:31and the back and hit a robotic control to not my view pretty a like
0:01:38cognitive
0:01:39this at least as well
0:01:40so i is an example
0:01:42suppose my son actually a model
0:01:48okay
0:01:52sorry so it clearly army
0:01:55it is not the language
0:02:02but it helps and that
0:02:05that the by you saying something down to the other we present
0:02:09so we wanna kinda protectors and what they're gonna but obviously not heuristic look up
0:02:13at header
0:02:15and her feel like they choose actually understanding what the users sign
0:02:21so you
0:02:22this is a mobile not to type of system so it is below relied we
0:02:26continue listening
0:02:28since of up to listen is we animals yours applications like something to say
0:02:33and elastic as a paper with its phase is a listening source trouble
0:02:38so the mile almost all the l system actually is that
0:02:42we use a state response system so we actually to use the content of what
0:02:45the user sees
0:02:46and did you write something response
0:02:49so
0:02:50we also want to have that
0:02:52in an open-domain so we don't restrict of the main the should a system should
0:02:56be able to wait for what if we use this is
0:02:59and the language model uses quite minimalistic we don't use any
0:03:04on a very tricky all models or on the training methods
0:03:09but we wanna do is generic simple incoherent responses
0:03:12so all describe that we do that
0:03:15so it just to talk about your is environment we have a kinect since matrix
0:03:21the user so we know close talking
0:03:24and we relocated that's very case
0:03:26and rather handy on what finds we use a microphone array so if we want
0:03:30to use it to be able to actually talk to your case they were and
0:03:34a human in the conversation
0:03:36so a the automatic speech recognition is done entirely to the microphone right so the
0:03:41user's for his hands and things like that
0:03:46it seems of the nonzero architecture of the system
0:03:50we have speech processing we have a natural language processing so one of the focus
0:03:54would into this and all that taking
0:03:57and that the main thing is the response model so we have
0:04:01two things whilst i'm response system
0:04:04which
0:04:04produces responses for the user
0:04:07and we used in the back channeling system which produce backchannels
0:04:11include in this was also to in taking that i one described in section so
0:04:14much and it's also we can actually implemented
0:04:16this we just to the conceptual idea of what we wanna do it seems like
0:04:20in and
0:04:22if you see the video don't show you know why we don't complete syntactic the
0:04:26model
0:04:28so that the channel is a response more just actually run in parallel so we
0:04:31can use them
0:04:34so altogether three features of the system the first additionally
0:04:38in these two types of dictionary that we can consider we have a bit showing
0:04:42a print it so we need a we receive an ipu from the
0:04:47is our system we can maybe this section going insane with of this is a
0:04:51good place tones can just a backchannel
0:04:56and the other one is a time base system where we continuously recognize and we
0:05:00don't know should be still not require k
0:05:03so we trained models one for these types
0:05:06all back channeling systems
0:05:08our for this we use a counseling corpus or in counseling corpus
0:05:12we have many examples of the teams of listening where
0:05:15the user
0:05:17or sorry calcite basically just listens to be
0:05:20types more passionately the other speaker and basically c is something like okay a
0:05:28so there's in japanese and the same basic idea applies
0:05:33and we wanted to the bit channel timing in the form sorry we consider just
0:05:37the japanese backchannel forms
0:05:39on the balloon and a at the moment based with gonna be the most common
0:05:46so it doesn't features that we use a prosodic features are research and tell statistics
0:05:52on those
0:05:53and we have these looks cool features represent by with it is
0:05:59widely base model uses one as all the prosodic and lexical features within the audience
0:06:04or like you
0:06:05where is the time base model will just come to take continuous windows from the
0:06:09whole time
0:06:12possible pass time windows and we just trying these using a simple logistic regression model
0:06:18sorry for the subjective experiment we selected team different recording something counseling corpus we actually
0:06:23talk
0:06:24snippets from the canceling corpus
0:06:28we do this not only use backchannels
0:06:30and the backchannels we actually generated using your to your system and you thing
0:06:36so we had three types of models are fixed form
0:06:39iq base model the time base model and we compute its were graduate condition sorry
0:06:45and the ground truth conditions which replace
0:06:47the chances voice with the synthesized voice so
0:06:51there was any effects of the type of human like voice
0:06:54but of course
0:06:56when you replace these you actually lose the specific prosodic properties of the fictional sorry
0:07:04in this in this case is not an exact ground truth i'll that's kind of
0:07:07the synthesized ground truth
0:07:09so the timing is rates the form is great but the actual press prosody almost
0:07:14i'm actual is different
0:07:17so i with forty subjects listened to
0:07:20business and which should with the rain condition
0:07:23i mean they evaluated each of the
0:07:26snippets of recordings with the backchannels with the look at scales using those images
0:07:33so i'll give an example of but i'm based fiction model so
0:07:36we apply this
0:07:37we apply to model to this particular recording
0:07:40do you
0:07:41i think about going to document or a mogul going to carry out of course
0:07:46not always able to
0:07:48so if we don't go to good
0:07:51many consider the goals of a few don't usually not gonna do you can buy
0:07:55divided into three
0:07:57well
0:07:58posted it for the goal
0:08:01but we do
0:08:03total to the bow
0:08:06so you want to sit too little
0:08:08okay so you get here that approaches backchannels i should also mention that
0:08:13for the time base model we actually quite poor results for predicting the form of
0:08:17the picture
0:08:18so it can still using the prediction we just use a random white intentional for
0:08:23performance
0:08:25where is the widely base model used action which was better
0:08:31so the results of the system
0:08:33which down there that's like base model actually performs better than the rt base model
0:08:38this quite intuitive so we know that it base model takes some processing time
0:08:43tech to produce an ipu
0:08:44that's right approach is a backchannel
0:08:47and so a date time maybe the timing of a
0:08:51a backchannel is quite so i
0:08:54so which they you
0:08:56people who evaluated the system sample as well
0:08:59so it
0:09:01it's a conclusion for this was that the correct timing of the fictional section more
0:09:04including the form
0:09:05so even though we use range of backchannels and it's white based models that's the
0:09:10thing is better
0:09:12so we can we use this i'm actual system for
0:09:18so next of the baptist a personal
0:09:20so
0:09:21the same response is basically
0:09:24trying to generate a response based on the focus where it's we extract the speech
0:09:28the user
0:09:29so the thing is we don't wanna handcrafted model for your can buy some key
0:09:33what sorry we consider an open-domain
0:09:36for you talk conversation
0:09:39we can
0:09:40we can away practically making handcrafted although we'll those
0:09:44keywords
0:09:45sorry routed in doing that we can extract keyword from what the user c is
0:09:50and then we can find
0:09:53an appropriate response so we have four types of responses
0:09:56and the planning what we do we can find a focus now on in a
0:09:59we do you want a pretty good
0:10:01and we do we can find a question with images wannabes
0:10:05so these for other question on focus on the partial repeat what the rising tone
0:10:10i'm the cushion on the predicate and
0:10:12and in the case of full between under these conditions of me we just
0:10:16playful new like expression
0:10:20so we extract the focus phrase all the pretty good we use the conditional random
0:10:24field
0:10:25this is done in previous work so from the focus would
0:10:28and we use it in remodel to match spectral question word so maybe some examples
0:10:34of what types response we can get
0:10:36so for example of cushion on focus we actually identify focus and we can use
0:10:42that to magic which in which the focus
0:10:45so it for example if it is it is the right carry
0:10:50maybe you're gonna
0:10:51sorry
0:10:52the focus will is carried the system can to take the and
0:10:56the question which is what kind of sorry there is response be what kind of
0:10:59carry so she extends the conversation this way
0:11:03for people the writing time for example or to run america vol
0:11:07so the folks would extract is american but we don't approach with the day
0:11:11so we just say something like all america
0:11:14the question the predicate
0:11:17you know i with a lot
0:11:18so we have a pretty could you know and we for nick which would like
0:11:22to wear
0:11:23so you're kick ask we did you gotta
0:11:26and
0:11:27lastly we have like no
0:11:29focusing on remote predicate that we can find
0:11:32so we use the system will just idea or okay so for example that's beautiful
0:11:36which are pointing the finger
0:11:39the simple just say
0:11:41okay
0:11:44so it is thus we actually use data from a previous experiment of the erica
0:11:49in the previous experiment it was another state response system
0:11:54so you would actually only three for direct responses back to that
0:12:00for polite expressions
0:12:01so what we did we handle this data and we applied the this unanswerable
0:12:07statements trusting response system
0:12:09and we could we check to deal with the results will be response that could
0:12:13be generated
0:12:15informers also found i nearly fifty obscene all these previous we do not smoke statements
0:12:21could be responded to what else system sorry we believe that these statements ability from
0:12:27like expressions
0:12:29any
0:12:29these we just by annotators to be
0:12:31coherent responses well sorry
0:12:34responses that would be stated
0:12:36in a
0:12:37i don't know conversation
0:12:40okay so you lost your talk about turn taking so it is gonna be quite
0:12:44brief because we haven't
0:12:47actually implement this so in progress
0:12:51but the can single gaussian second system is that
0:12:54running try to predict
0:12:56take the turn or not take the turn we use it a decision
0:13:01rather than a binary on
0:13:03so i because we know the probabilistic
0:13:07thresholds for some actions
0:13:09been we actually just slider "'cause" response by subway
0:13:12so
0:13:14if you see this very simple diagram which she is
0:13:18goes from not taking a turn
0:13:19and generating an original which indicates not taken into
0:13:23then we can generate a filler which gender in the case that we might we've
0:13:27got seconded
0:13:29and in vastly we would be actually take the turn endorsements
0:13:34sorry backchannels
0:13:36indicate not turn taking in for those in the current turn taking heavily the
0:13:42the benefit of this is that these are fully committed action so we don't actually
0:13:45take the turn at a time
0:13:47we say something
0:13:49in preparation for seconds
0:13:50but the user can select the right this so for example you're good as a
0:13:54filler
0:13:55maybe the user doesn't wanna finished looking so the continue talking the and it doesn't
0:14:00stop before conversation
0:14:02when you're it does response
0:14:05so we had this can see but we wanna know actually
0:14:09how do we finally threshold the others so this is just to extract the real
0:14:12what we wanna the
0:14:15so we trying to tune psyching model and based on logistic regression
0:14:21we use price prosodic and lexical features and we analyze the likelihood scores and from
0:14:25the frequency decisions
0:14:27just the we can find simple example to t one and z two
0:14:31so we found that maybe sing the threshold one at least in
0:14:35zero point four five is
0:14:37to completely silent sorry
0:14:39we use the just keep take the turn
0:14:42where is a threshold zero point ninety five in we say okay we did not
0:14:48be taken to
0:14:50but in the middle because we are not quite sure we live it didn't of
0:14:54filler or backchannel to try and
0:14:58i the make the user you the twins was or side okay no you can
0:15:02continue
0:15:04so it is something that we wish to each one and this basic idea
0:15:09so we the basic algorithm all that interesting it's very simple
0:15:14basically what the user is speaking or continuous you do backchannels
0:15:18using the backchannel system
0:15:20we get the appropriate timing for the
0:15:23when we get a result from the speech recognition system we did all it's a
0:15:27dialogue taking on the results
0:15:29the speech the question we cancelled out so
0:15:33because we can manage this kiwi matching a database or this is you way natural
0:15:40language processing
0:15:42hey that's not a question we know that the segment
0:15:45then we can use a state response more to restore the response
0:15:48based on
0:15:50a universal responses
0:15:52so you the thing is that because the usual talking and we can see beginning
0:15:57asr results
0:16:00we can
0:16:01overwrites our previous response are actually you're only response the last part of the speech
0:16:08and then when we notice that these especially to be in their insiders the response
0:16:13so all they've been example
0:16:16the system in action
0:16:17and you see that actually this latency is the what they have an issue
0:16:22but the response that in the region here
0:16:28i don't know should be run on this
0:16:34so that the question
0:16:36similarly
0:16:37because one of which is not scandalous
0:16:46g
0:16:50so the buttons
0:16:53for so
0:16:56right
0:16:59so much
0:17:04so he shall
0:17:06extract
0:17:08the focus point is that there was a
0:17:13are you know
0:17:23so the skies the focus of it is
0:17:25right
0:17:26which from implies
0:17:35so make a she couldn't one the focus would and could because of then is
0:17:38it your for it just wasn't and the model
0:17:43so you can see that i'm the laziest or problem so this electro three seconds
0:17:47between responses which is not actually that good
0:17:50for this in the posting system you want people to keep talking
0:17:53and feel what robot is actually distinct
0:17:56we can see that the response generation system gives reasonably good responses sorry
0:18:01we hope that the users will keep concerning the conversation like this
0:18:06so that is this a matching supposing system
0:18:11we conducted a pilot study
0:18:13we only use three subjects just part of probably of iterations see the weird
0:18:18one big problem we have is that
0:18:20we tell users to interact with your can they really do
0:18:25some in adding to post install interaction
0:18:27usually locality are able to stay near and after a couple of you easy questions
0:18:32and j
0:18:33kind of not know what to say
0:18:35so and this case
0:18:37we had to actually explicitly tell them what to say
0:18:41so first we got them to read from scripts taken from an existing corpus sorry
0:18:46they would say things that were taken from a previous
0:18:51wasn't was experiments
0:18:54and you know that we instructed him to tell your career story in
0:18:57keep talking as long as possible i is long as they wanted to be in
0:19:01a fruitful scenario
0:19:03so what's they use the script they kind of hunters the what you're the
0:19:09i denote a difficult scenario like that
0:19:11the in a super group of judges listened to the audio of the interaction and
0:19:16we evaluated each of your is backchannels in utterances according to the timing and the
0:19:21coherence
0:19:23so you
0:19:24the results we found that so the backchannel time is quite appropriate actually we find
0:19:29it
0:19:29quite useful
0:19:30but if you can see from the video and it was noted by participants the
0:19:34state response estimates the once
0:19:37so this is something we need to work
0:19:41on the other means in terms of all the sponsors that we generated
0:19:46from the slide response system
0:19:48a more than half of them were quite here sorry we think this is quite
0:19:52reasonable so we
0:19:53find the
0:19:56because responses
0:19:58to the conversation going and a reasonable
0:20:04and just some examples of the model can someone construction so at least instructions we
0:20:08didn't tell people what the dirty they just talks were there are k
0:20:11i we often have these i'm elated
0:20:14so is something you can see the first one
0:20:18the user's libertarians
0:20:21they're talking to i don't know why the variance but it was about a week
0:20:25and re ions
0:20:26and you're actually
0:20:27notice that a focus where does it and ask what we would ideas from ideas
0:20:31from where
0:20:32and this is quite surprising for the for the user to press buttons are always
0:20:37integrate you listen to me one talk
0:20:40we talk about it and so obviously we don't create
0:20:43some responses that are ago aliens "'cause" this doesn't come up much but
0:20:47it shows that we can
0:20:50five use the same response system and a wide variety of contexts
0:20:55another one that's maybe this that's useful
0:20:58the human asked you want to watch it is quite of shoes or watch
0:21:02i see
0:21:04maybe it's quite strange
0:21:05and in than they are states very rainy
0:21:08and the robot's at your "'cause" he's i where is the right so this was
0:21:11about but this is a bit strange in the context of a conversation but it
0:21:16was interesting to people to be user
0:21:20so you just have conclusions and future work
0:21:24user are the demonstration a we find the most imprisonment there are applied to the
0:21:28coherent question
0:21:29and we can of the next in a conversation that way
0:21:33and even incoherence train statements can be interesting or funny so you
0:21:37even though it this time and state are cases like where is the right
0:21:42it doesn't really make scenes but
0:21:44so you
0:21:45users can be quite interesting ditching come up with this
0:21:48kind of thing
0:21:49so maybe it doesn't have to be very
0:21:53grammatically correct way you're
0:21:56with one evictions isn't with quite well and get the randomness of the backchannels is
0:22:01not so severe
0:22:02really useful at the back channels but
0:22:05and the latency is the biggest problem i'm at the moment without system
0:22:10so i future work will be the speed up this latency
0:22:13and
0:22:16just right
0:22:17going on from that keynote today we know it and emotional dialogue responding to that
0:22:21is very important as well
0:22:23so we had to increase the range responses here can generate and do some i'm
0:22:28emotion recognition actually
0:22:30so okay the use of talks about how to say okay you can she can
0:22:33actually generated good response like
0:22:37so thank you any questions
0:22:45thank you
0:22:46and now we have a sometime questions
0:22:56thank you for the talking about one slide
0:23:00your claim there backchannels are generally well time randomness not a survey reissue sort of
0:23:05speech to me that
0:23:07i that if you build a system that just random way that some interval created
0:23:14backchannels especially japanese
0:23:17it would work just as well
0:23:19a what i would like to see as it is a comparison of two systems
0:23:23this one that you have the one where it's just random time backchannels and see
0:23:28people
0:23:29i okay difference between so x actually what i didn't mention is there and this
0:23:34in this section we experiment
0:23:36we actually get another system which i'm sorry i which today which are randomly random
0:23:41backchannels with the egg and redeker there wasn't between the first is much work
0:23:47as well
0:23:48great thank you
0:23:58i was actually wondering that the dialogue seems to be a model encouraging that short
0:24:06utterances and i think this kind of or a feedback giving a behaviour
0:24:13more likely to occur when the usage of stuff really tells a one story or
0:24:19something did you were set of try to encourage the people to the behave in
0:24:24that way or was it more like that kind of but
0:24:28it
0:24:28really depends on the user's site
0:24:31why don't know we do this because japanese people are quite reluctant to
0:24:35telling stories of what it
0:24:39we i would say to people come and may be seen you know them stand
0:24:42in front of your account
0:24:44by say what one season
0:24:45and in white for the response and agencies like for sports
0:24:50but the people like the examples that i gave alive
0:24:55systems people who actually
0:24:56we did some of the side we just a okay
0:24:59you talk with your how you want and actually did that actually see like a
0:25:02long story talked about it they whatever and this would people be seen that most
0:25:08impressive
0:25:10if you tool with the kind of in a stream of consciousness then you're gonna
0:25:14get answers which are maybe actions you probably
0:25:19question-response questions
0:25:22but it is very visited thing to think
0:25:24which is a trick
0:25:26i guess the robot could also to kind of a start to tell everyone sentences
0:25:31and here that would serve kind of well for example to the user's we hoping
0:25:36that we try like
0:25:38think of ideas on how to get the robotic system you like on the section
0:25:41of the users of like
0:25:42she can actually say tell me a story that yourself or something rather be nothing
0:25:46to
0:25:48directly usable
0:25:51well i think we of
0:25:54i
0:26:01thanks for an interesting things
0:26:03they got the implement so she doesn't all use of two
0:26:12make but sentence does smoothing is
0:26:14well yes
0:26:15are you in question about so
0:26:17of what you mean the nonverbal behavior
0:26:20so you we haven't actually
0:26:24i think she does some nodding at random fundamental watches a backchannel
0:26:28but we looking at ways they for example
0:26:31one vector we just only been ordering or one backchannel nodding and the verbal utterance
0:26:37what just the people utterance that we need to look at the research the three
0:26:40what distributions and i'm have actually this
0:26:44for the user so at the moment only backchannels available
0:26:47but you in the future will probably tried accept others modalities like about one actual
0:26:55thank you please thank your speaker once again