a dash

textual

moving

a minute so can tell our research and just the background on what we doing

so

we have in inferred variable name eureka was developed by your shoes you did all

and part of this project is for a eureka to be able to function simple

roles

and form task as well as the human so it's a very realistic looking and

with rubber we demonstrated your okay how the conferences last year signal

and in this work

i'm gonna described your caused all of model

and are always an attentive listener

so you

at each of listening

we had

but of a example in the keynote so i am this morning by was

so it into listening is where basically erica

try to listen to the user talk

and what showed in straight interjections of dialogue so we want to stick might more

conversation

but primarily by the user

and the scenario we don't needed to understand the conversation so it

we don't trying to any complex natural language processing

and only intended time this is the only people are people do the following i

to get some social isolation

and the back and hit a robotic control to not my view pretty a like

cognitive

this at least as well

so i is an example

suppose my son actually a model

okay

sorry so it clearly army

it is not the language

but it helps and that

that the by you saying something down to the other we present

so we wanna kinda protectors and what they're gonna but obviously not heuristic look up

at header

and her feel like they choose actually understanding what the users sign

so you

this is a mobile not to type of system so it is below relied we

continue listening

since of up to listen is we animals yours applications like something to say

and elastic as a paper with its phase is a listening source trouble

so the mile almost all the l system actually is that

we use a state response system so we actually to use the content of what

the user sees

and did you write something response

so

we also want to have that

in an open-domain so we don't restrict of the main the should a system should

be able to wait for what if we use this is

and the language model uses quite minimalistic we don't use any

on a very tricky all models or on the training methods

but we wanna do is generic simple incoherent responses

so all describe that we do that

so it just to talk about your is environment we have a kinect since matrix

the user so we know close talking

and we relocated that's very case

and rather handy on what finds we use a microphone array so if we want

to use it to be able to actually talk to your case they were and

a human in the conversation

so a the automatic speech recognition is done entirely to the microphone right so the

user's for his hands and things like that

it seems of the nonzero architecture of the system

we have speech processing we have a natural language processing so one of the focus

would into this and all that taking

and that the main thing is the response model so we have

two things whilst i'm response system

which

produces responses for the user

and we used in the back channeling system which produce backchannels

include in this was also to in taking that i one described in section so

much and it's also we can actually implemented

this we just to the conceptual idea of what we wanna do it seems like

in and

if you see the video don't show you know why we don't complete syntactic the

model

so that the channel is a response more just actually run in parallel so we

can use them

so altogether three features of the system the first additionally

in these two types of dictionary that we can consider we have a bit showing

a print it so we need a we receive an ipu from the

is our system we can maybe this section going insane with of this is a

good place tones can just a backchannel

and the other one is a time base system where we continuously recognize and we

don't know should be still not require k

so we trained models one for these types

all back channeling systems

our for this we use a counseling corpus or in counseling corpus

we have many examples of the teams of listening where

the user

or sorry calcite basically just listens to be

types more passionately the other speaker and basically c is something like okay a

so there's in japanese and the same basic idea applies

and we wanted to the bit channel timing in the form sorry we consider just

the japanese backchannel forms

on the balloon and a at the moment based with gonna be the most common

so it doesn't features that we use a prosodic features are research and tell statistics

on those

and we have these looks cool features represent by with it is

widely base model uses one as all the prosodic and lexical features within the audience

or like you

where is the time base model will just come to take continuous windows from the

whole time

possible pass time windows and we just trying these using a simple logistic regression model

sorry for the subjective experiment we selected team different recording something counseling corpus we actually

talk

snippets from the canceling corpus

we do this not only use backchannels

and the backchannels we actually generated using your to your system and you thing

so we had three types of models are fixed form

iq base model the time base model and we compute its were graduate condition sorry

and the ground truth conditions which replace

the chances voice with the synthesized voice so

there was any effects of the type of human like voice

but of course

when you replace these you actually lose the specific prosodic properties of the fictional sorry

in this in this case is not an exact ground truth i'll that's kind of

the synthesized ground truth

so the timing is rates the form is great but the actual press prosody almost

i'm actual is different

so i with forty subjects listened to

business and which should with the rain condition

i mean they evaluated each of the

snippets of recordings with the backchannels with the look at scales using those images

so i'll give an example of but i'm based fiction model so

we apply this

we apply to model to this particular recording

do you

i think about going to document or a mogul going to carry out of course

not always able to

so if we don't go to good

many consider the goals of a few don't usually not gonna do you can buy

divided into three

well

posted it for the goal

but we do

total to the bow

so you want to sit too little

okay so you get here that approaches backchannels i should also mention that

for the time base model we actually quite poor results for predicting the form of

the picture

so it can still using the prediction we just use a random white intentional for

performance

where is the widely base model used action which was better

so the results of the system

which down there that's like base model actually performs better than the rt base model

this quite intuitive so we know that it base model takes some processing time

tech to produce an ipu

that's right approach is a backchannel

and so a date time maybe the timing of a

a backchannel is quite so i

so which they you

people who evaluated the system sample as well

so it

it's a conclusion for this was that the correct timing of the fictional section more

including the form

so even though we use range of backchannels and it's white based models that's the

thing is better

so we can we use this i'm actual system for

so next of the baptist a personal

so

the same response is basically

trying to generate a response based on the focus where it's we extract the speech

the user

so the thing is we don't wanna handcrafted model for your can buy some key

what sorry we consider an open-domain

for you talk conversation

we can

we can away practically making handcrafted although we'll those

keywords

sorry routed in doing that we can extract keyword from what the user c is

and then we can find

an appropriate response so we have four types of responses

and the planning what we do we can find a focus now on in a

we do you want a pretty good

and we do we can find a question with images wannabes

so these for other question on focus on the partial repeat what the rising tone

i'm the cushion on the predicate and

and in the case of full between under these conditions of me we just

playful new like expression

so we extract the focus phrase all the pretty good we use the conditional random

field

this is done in previous work so from the focus would

and we use it in remodel to match spectral question word so maybe some examples

of what types response we can get

so for example of cushion on focus we actually identify focus and we can use

that to magic which in which the focus

so it for example if it is it is the right carry

maybe you're gonna

sorry

the focus will is carried the system can to take the and

the question which is what kind of sorry there is response be what kind of

carry so she extends the conversation this way

for people the writing time for example or to run america vol

so the folks would extract is american but we don't approach with the day

so we just say something like all america

the question the predicate

you know i with a lot

so we have a pretty could you know and we for nick which would like

to wear

so you're kick ask we did you gotta

and

lastly we have like no

focusing on remote predicate that we can find

so we use the system will just idea or okay so for example that's beautiful

which are pointing the finger

the simple just say

okay

so it is thus we actually use data from a previous experiment of the erica

in the previous experiment it was another state response system

so you would actually only three for direct responses back to that

for polite expressions

so what we did we handle this data and we applied the this unanswerable

statements trusting response system

and we could we check to deal with the results will be response that could

be generated

informers also found i nearly fifty obscene all these previous we do not smoke statements

could be responded to what else system sorry we believe that these statements ability from

like expressions

any

these we just by annotators to be

coherent responses well sorry

responses that would be stated

in a

i don't know conversation

okay so you lost your talk about turn taking so it is gonna be quite

brief because we haven't

actually implement this so in progress

but the can single gaussian second system is that

running try to predict

take the turn or not take the turn we use it a decision

rather than a binary on

so i because we know the probabilistic

thresholds for some actions

been we actually just slider "'cause" response by subway

so

if you see this very simple diagram which she is

goes from not taking a turn

and generating an original which indicates not taken into

then we can generate a filler which gender in the case that we might we've

got seconded

and in vastly we would be actually take the turn endorsements

sorry backchannels

indicate not turn taking in for those in the current turn taking heavily the

the benefit of this is that these are fully committed action so we don't actually

take the turn at a time

we say something

in preparation for seconds

but the user can select the right this so for example you're good as a

filler

maybe the user doesn't wanna finished looking so the continue talking the and it doesn't

stop before conversation

when you're it does response

so we had this can see but we wanna know actually

how do we finally threshold the others so this is just to extract the real

what we wanna the

so we trying to tune psyching model and based on logistic regression

we use price prosodic and lexical features and we analyze the likelihood scores and from

the frequency decisions

just the we can find simple example to t one and z two

so we found that maybe sing the threshold one at least in

zero point four five is

to completely silent sorry

we use the just keep take the turn

where is a threshold zero point ninety five in we say okay we did not

be taken to

but in the middle because we are not quite sure we live it didn't of

filler or backchannel to try and

i the make the user you the twins was or side okay no you can

continue

so it is something that we wish to each one and this basic idea

so we the basic algorithm all that interesting it's very simple

basically what the user is speaking or continuous you do backchannels

using the backchannel system

we get the appropriate timing for the

when we get a result from the speech recognition system we did all it's a

dialogue taking on the results

the speech the question we cancelled out so

because we can manage this kiwi matching a database or this is you way natural

language processing

hey that's not a question we know that the segment

then we can use a state response more to restore the response

based on

a universal responses

so you the thing is that because the usual talking and we can see beginning

asr results

we can

overwrites our previous response are actually you're only response the last part of the speech

and then when we notice that these especially to be in their insiders the response

so all they've been example

the system in action

and you see that actually this latency is the what they have an issue

but the response that in the region here

i don't know should be run on this

so that the question

similarly

because one of which is not scandalous

g

so the buttons

for so

right

so much

so he shall

extract

the focus point is that there was a

are you know

so the skies the focus of it is

right

which from implies

so make a she couldn't one the focus would and could because of then is

it your for it just wasn't and the model

so you can see that i'm the laziest or problem so this electro three seconds

between responses which is not actually that good

for this in the posting system you want people to keep talking

and feel what robot is actually distinct

we can see that the response generation system gives reasonably good responses sorry

we hope that the users will keep concerning the conversation like this

so that is this a matching supposing system

we conducted a pilot study

we only use three subjects just part of probably of iterations see the weird

one big problem we have is that

we tell users to interact with your can they really do

some in adding to post install interaction

usually locality are able to stay near and after a couple of you easy questions

and j

kind of not know what to say

so and this case

we had to actually explicitly tell them what to say

so first we got them to read from scripts taken from an existing corpus sorry

they would say things that were taken from a previous

wasn't was experiments

and you know that we instructed him to tell your career story in

keep talking as long as possible i is long as they wanted to be in

a fruitful scenario

so what's they use the script they kind of hunters the what you're the

i denote a difficult scenario like that

the in a super group of judges listened to the audio of the interaction and

we evaluated each of your is backchannels in utterances according to the timing and the

coherence

so you

the results we found that so the backchannel time is quite appropriate actually we find

it

quite useful

but if you can see from the video and it was noted by participants the

state response estimates the once

so this is something we need to work

on the other means in terms of all the sponsors that we generated

from the slide response system

a more than half of them were quite here sorry we think this is quite

reasonable so we

find the

because responses

to the conversation going and a reasonable

and just some examples of the model can someone construction so at least instructions we

didn't tell people what the dirty they just talks were there are k

i we often have these i'm elated

so is something you can see the first one

the user's libertarians

they're talking to i don't know why the variance but it was about a week

and re ions

and you're actually

notice that a focus where does it and ask what we would ideas from ideas

from where

and this is quite surprising for the for the user to press buttons are always

integrate you listen to me one talk

we talk about it and so obviously we don't create

some responses that are ago aliens "'cause" this doesn't come up much but

it shows that we can

five use the same response system and a wide variety of contexts

another one that's maybe this that's useful

the human asked you want to watch it is quite of shoes or watch

i see

maybe it's quite strange

and in than they are states very rainy

and the robot's at your "'cause" he's i where is the right so this was

about but this is a bit strange in the context of a conversation but it

was interesting to people to be user

so you just have conclusions and future work

user are the demonstration a we find the most imprisonment there are applied to the

coherent question

and we can of the next in a conversation that way

and even incoherence train statements can be interesting or funny so you

even though it this time and state are cases like where is the right

it doesn't really make scenes but

so you

users can be quite interesting ditching come up with this

kind of thing

so maybe it doesn't have to be very

grammatically correct way you're

with one evictions isn't with quite well and get the randomness of the backchannels is

not so severe

really useful at the back channels but

and the latency is the biggest problem i'm at the moment without system

so i future work will be the speed up this latency

and

just right

going on from that keynote today we know it and emotional dialogue responding to that

is very important as well

so we had to increase the range responses here can generate and do some i'm

emotion recognition actually

so okay the use of talks about how to say okay you can she can

actually generated good response like

so thank you any questions

thank you

and now we have a sometime questions

thank you for the talking about one slide

your claim there backchannels are generally well time randomness not a survey reissue sort of

speech to me that

i that if you build a system that just random way that some interval created

backchannels especially japanese

it would work just as well

a what i would like to see as it is a comparison of two systems

this one that you have the one where it's just random time backchannels and see

people

i okay difference between so x actually what i didn't mention is there and this

in this section we experiment

we actually get another system which i'm sorry i which today which are randomly random

backchannels with the egg and redeker there wasn't between the first is much work

as well

great thank you

i was actually wondering that the dialogue seems to be a model encouraging that short

utterances and i think this kind of or a feedback giving a behaviour

more likely to occur when the usage of stuff really tells a one story or

something did you were set of try to encourage the people to the behave in

that way or was it more like that kind of but

it

really depends on the user's site

why don't know we do this because japanese people are quite reluctant to

telling stories of what it

we i would say to people come and may be seen you know them stand

in front of your account

by say what one season

and in white for the response and agencies like for sports

but the people like the examples that i gave alive

systems people who actually

we did some of the side we just a okay

you talk with your how you want and actually did that actually see like a

long story talked about it they whatever and this would people be seen that most

impressive

if you tool with the kind of in a stream of consciousness then you're gonna

get answers which are maybe actions you probably

question-response questions

but it is very visited thing to think

which is a trick

i guess the robot could also to kind of a start to tell everyone sentences

and here that would serve kind of well for example to the user's we hoping

that we try like

think of ideas on how to get the robotic system you like on the section

of the users of like

she can actually say tell me a story that yourself or something rather be nothing

to

directly usable

well i think we of

i

thanks for an interesting things

they got the implement so she doesn't all use of two

make but sentence does smoothing is

well yes

are you in question about so

of what you mean the nonverbal behavior

so you we haven't actually

i think she does some nodding at random fundamental watches a backchannel

but we looking at ways they for example

one vector we just only been ordering or one backchannel nodding and the verbal utterance

what just the people utterance that we need to look at the research the three

what distributions and i'm have actually this

for the user so at the moment only backchannels available

but you in the future will probably tried accept others modalities like about one actual

thank you please thank your speaker once again