a dash
textual
moving
a minute so can tell our research and just the background on what we doing
so
we have in inferred variable name eureka was developed by your shoes you did all
and part of this project is for a eureka to be able to function simple
roles
and form task as well as the human so it's a very realistic looking and
with rubber we demonstrated your okay how the conferences last year signal
and in this work
i'm gonna described your caused all of model
and are always an attentive listener
so you
at each of listening
we had
but of a example in the keynote so i am this morning by was
so it into listening is where basically erica
try to listen to the user talk
and what showed in straight interjections of dialogue so we want to stick might more
conversation
but primarily by the user
and the scenario we don't needed to understand the conversation so it
we don't trying to any complex natural language processing
and only intended time this is the only people are people do the following i
to get some social isolation
and the back and hit a robotic control to not my view pretty a like
cognitive
this at least as well
so i is an example
suppose my son actually a model
okay
sorry so it clearly army
it is not the language
but it helps and that
that the by you saying something down to the other we present
so we wanna kinda protectors and what they're gonna but obviously not heuristic look up
at header
and her feel like they choose actually understanding what the users sign
so you
this is a mobile not to type of system so it is below relied we
continue listening
since of up to listen is we animals yours applications like something to say
and elastic as a paper with its phase is a listening source trouble
so the mile almost all the l system actually is that
we use a state response system so we actually to use the content of what
the user sees
and did you write something response
so
we also want to have that
in an open-domain so we don't restrict of the main the should a system should
be able to wait for what if we use this is
and the language model uses quite minimalistic we don't use any
on a very tricky all models or on the training methods
but we wanna do is generic simple incoherent responses
so all describe that we do that
so it just to talk about your is environment we have a kinect since matrix
the user so we know close talking
and we relocated that's very case
and rather handy on what finds we use a microphone array so if we want
to use it to be able to actually talk to your case they were and
a human in the conversation
so a the automatic speech recognition is done entirely to the microphone right so the
user's for his hands and things like that
it seems of the nonzero architecture of the system
we have speech processing we have a natural language processing so one of the focus
would into this and all that taking
and that the main thing is the response model so we have
two things whilst i'm response system
which
produces responses for the user
and we used in the back channeling system which produce backchannels
include in this was also to in taking that i one described in section so
much and it's also we can actually implemented
this we just to the conceptual idea of what we wanna do it seems like
in and
if you see the video don't show you know why we don't complete syntactic the
model
so that the channel is a response more just actually run in parallel so we
can use them
so altogether three features of the system the first additionally
in these two types of dictionary that we can consider we have a bit showing
a print it so we need a we receive an ipu from the
is our system we can maybe this section going insane with of this is a
good place tones can just a backchannel
and the other one is a time base system where we continuously recognize and we
don't know should be still not require k
so we trained models one for these types
all back channeling systems
our for this we use a counseling corpus or in counseling corpus
we have many examples of the teams of listening where
the user
or sorry calcite basically just listens to be
types more passionately the other speaker and basically c is something like okay a
so there's in japanese and the same basic idea applies
and we wanted to the bit channel timing in the form sorry we consider just
the japanese backchannel forms
on the balloon and a at the moment based with gonna be the most common
so it doesn't features that we use a prosodic features are research and tell statistics
on those
and we have these looks cool features represent by with it is
widely base model uses one as all the prosodic and lexical features within the audience
or like you
where is the time base model will just come to take continuous windows from the
whole time
possible pass time windows and we just trying these using a simple logistic regression model
sorry for the subjective experiment we selected team different recording something counseling corpus we actually
talk
snippets from the canceling corpus
we do this not only use backchannels
and the backchannels we actually generated using your to your system and you thing
so we had three types of models are fixed form
iq base model the time base model and we compute its were graduate condition sorry
and the ground truth conditions which replace
the chances voice with the synthesized voice so
there was any effects of the type of human like voice
but of course
when you replace these you actually lose the specific prosodic properties of the fictional sorry
in this in this case is not an exact ground truth i'll that's kind of
the synthesized ground truth
so the timing is rates the form is great but the actual press prosody almost
i'm actual is different
so i with forty subjects listened to
business and which should with the rain condition
i mean they evaluated each of the
snippets of recordings with the backchannels with the look at scales using those images
so i'll give an example of but i'm based fiction model so
we apply this
we apply to model to this particular recording
do you
i think about going to document or a mogul going to carry out of course
not always able to
so if we don't go to good
many consider the goals of a few don't usually not gonna do you can buy
divided into three
well
posted it for the goal
but we do
total to the bow
so you want to sit too little
okay so you get here that approaches backchannels i should also mention that
for the time base model we actually quite poor results for predicting the form of
the picture
so it can still using the prediction we just use a random white intentional for
performance
where is the widely base model used action which was better
so the results of the system
which down there that's like base model actually performs better than the rt base model
this quite intuitive so we know that it base model takes some processing time
tech to produce an ipu
that's right approach is a backchannel
and so a date time maybe the timing of a
a backchannel is quite so i
so which they you
people who evaluated the system sample as well
so it
it's a conclusion for this was that the correct timing of the fictional section more
including the form
so even though we use range of backchannels and it's white based models that's the
thing is better
so we can we use this i'm actual system for
so next of the baptist a personal
so
the same response is basically
trying to generate a response based on the focus where it's we extract the speech
the user
so the thing is we don't wanna handcrafted model for your can buy some key
what sorry we consider an open-domain
for you talk conversation
we can
we can away practically making handcrafted although we'll those
keywords
sorry routed in doing that we can extract keyword from what the user c is
and then we can find
an appropriate response so we have four types of responses
and the planning what we do we can find a focus now on in a
we do you want a pretty good
and we do we can find a question with images wannabes
so these for other question on focus on the partial repeat what the rising tone
i'm the cushion on the predicate and
and in the case of full between under these conditions of me we just
playful new like expression
so we extract the focus phrase all the pretty good we use the conditional random
field
this is done in previous work so from the focus would
and we use it in remodel to match spectral question word so maybe some examples
of what types response we can get
so for example of cushion on focus we actually identify focus and we can use
that to magic which in which the focus
so it for example if it is it is the right carry
maybe you're gonna
sorry
the focus will is carried the system can to take the and
the question which is what kind of sorry there is response be what kind of
carry so she extends the conversation this way
for people the writing time for example or to run america vol
so the folks would extract is american but we don't approach with the day
so we just say something like all america
the question the predicate
you know i with a lot
so we have a pretty could you know and we for nick which would like
to wear
so you're kick ask we did you gotta
and
lastly we have like no
focusing on remote predicate that we can find
so we use the system will just idea or okay so for example that's beautiful
which are pointing the finger
the simple just say
okay
so it is thus we actually use data from a previous experiment of the erica
in the previous experiment it was another state response system
so you would actually only three for direct responses back to that
for polite expressions
so what we did we handle this data and we applied the this unanswerable
statements trusting response system
and we could we check to deal with the results will be response that could
be generated
informers also found i nearly fifty obscene all these previous we do not smoke statements
could be responded to what else system sorry we believe that these statements ability from
like expressions
any
these we just by annotators to be
coherent responses well sorry
responses that would be stated
in a
i don't know conversation
okay so you lost your talk about turn taking so it is gonna be quite
brief because we haven't
actually implement this so in progress
but the can single gaussian second system is that
running try to predict
take the turn or not take the turn we use it a decision
rather than a binary on
so i because we know the probabilistic
thresholds for some actions
been we actually just slider "'cause" response by subway
so
if you see this very simple diagram which she is
goes from not taking a turn
and generating an original which indicates not taken into
then we can generate a filler which gender in the case that we might we've
got seconded
and in vastly we would be actually take the turn endorsements
sorry backchannels
indicate not turn taking in for those in the current turn taking heavily the
the benefit of this is that these are fully committed action so we don't actually
take the turn at a time
we say something
in preparation for seconds
but the user can select the right this so for example you're good as a
filler
maybe the user doesn't wanna finished looking so the continue talking the and it doesn't
stop before conversation
when you're it does response
so we had this can see but we wanna know actually
how do we finally threshold the others so this is just to extract the real
what we wanna the
so we trying to tune psyching model and based on logistic regression
we use price prosodic and lexical features and we analyze the likelihood scores and from
the frequency decisions
just the we can find simple example to t one and z two
so we found that maybe sing the threshold one at least in
zero point four five is
to completely silent sorry
we use the just keep take the turn
where is a threshold zero point ninety five in we say okay we did not
be taken to
but in the middle because we are not quite sure we live it didn't of
filler or backchannel to try and
i the make the user you the twins was or side okay no you can
continue
so it is something that we wish to each one and this basic idea
so we the basic algorithm all that interesting it's very simple
basically what the user is speaking or continuous you do backchannels
using the backchannel system
we get the appropriate timing for the
when we get a result from the speech recognition system we did all it's a
dialogue taking on the results
the speech the question we cancelled out so
because we can manage this kiwi matching a database or this is you way natural
language processing
hey that's not a question we know that the segment
then we can use a state response more to restore the response
based on
a universal responses
so you the thing is that because the usual talking and we can see beginning
asr results
we can
overwrites our previous response are actually you're only response the last part of the speech
and then when we notice that these especially to be in their insiders the response
so all they've been example
the system in action
and you see that actually this latency is the what they have an issue
but the response that in the region here
i don't know should be run on this
so that the question
similarly
because one of which is not scandalous
g
so the buttons
for so
right
so much
so he shall
extract
the focus point is that there was a
are you know
so the skies the focus of it is
right
which from implies
so make a she couldn't one the focus would and could because of then is
it your for it just wasn't and the model
so you can see that i'm the laziest or problem so this electro three seconds
between responses which is not actually that good
for this in the posting system you want people to keep talking
and feel what robot is actually distinct
we can see that the response generation system gives reasonably good responses sorry
we hope that the users will keep concerning the conversation like this
so that is this a matching supposing system
we conducted a pilot study
we only use three subjects just part of probably of iterations see the weird
one big problem we have is that
we tell users to interact with your can they really do
some in adding to post install interaction
usually locality are able to stay near and after a couple of you easy questions
and j
kind of not know what to say
so and this case
we had to actually explicitly tell them what to say
so first we got them to read from scripts taken from an existing corpus sorry
they would say things that were taken from a previous
wasn't was experiments
and you know that we instructed him to tell your career story in
keep talking as long as possible i is long as they wanted to be in
a fruitful scenario
so what's they use the script they kind of hunters the what you're the
i denote a difficult scenario like that
the in a super group of judges listened to the audio of the interaction and
we evaluated each of your is backchannels in utterances according to the timing and the
coherence
so you
the results we found that so the backchannel time is quite appropriate actually we find
it
quite useful
but if you can see from the video and it was noted by participants the
state response estimates the once
so this is something we need to work
on the other means in terms of all the sponsors that we generated
from the slide response system
a more than half of them were quite here sorry we think this is quite
reasonable so we
find the
because responses
to the conversation going and a reasonable
and just some examples of the model can someone construction so at least instructions we
didn't tell people what the dirty they just talks were there are k
i we often have these i'm elated
so is something you can see the first one
the user's libertarians
they're talking to i don't know why the variance but it was about a week
and re ions
and you're actually
notice that a focus where does it and ask what we would ideas from ideas
from where
and this is quite surprising for the for the user to press buttons are always
integrate you listen to me one talk
we talk about it and so obviously we don't create
some responses that are ago aliens "'cause" this doesn't come up much but
it shows that we can
five use the same response system and a wide variety of contexts
another one that's maybe this that's useful
the human asked you want to watch it is quite of shoes or watch
i see
maybe it's quite strange
and in than they are states very rainy
and the robot's at your "'cause" he's i where is the right so this was
about but this is a bit strange in the context of a conversation but it
was interesting to people to be user
so you just have conclusions and future work
user are the demonstration a we find the most imprisonment there are applied to the
coherent question
and we can of the next in a conversation that way
and even incoherence train statements can be interesting or funny so you
even though it this time and state are cases like where is the right
it doesn't really make scenes but
so you
users can be quite interesting ditching come up with this
kind of thing
so maybe it doesn't have to be very
grammatically correct way you're
with one evictions isn't with quite well and get the randomness of the backchannels is
not so severe
really useful at the back channels but
and the latency is the biggest problem i'm at the moment without system
so i future work will be the speed up this latency
and
just right
going on from that keynote today we know it and emotional dialogue responding to that
is very important as well
so we had to increase the range responses here can generate and do some i'm
emotion recognition actually
so okay the use of talks about how to say okay you can she can
actually generated good response like
so thank you any questions
thank you
and now we have a sometime questions
thank you for the talking about one slide
your claim there backchannels are generally well time randomness not a survey reissue sort of
speech to me that
i that if you build a system that just random way that some interval created
backchannels especially japanese
it would work just as well
a what i would like to see as it is a comparison of two systems
this one that you have the one where it's just random time backchannels and see
people
i okay difference between so x actually what i didn't mention is there and this
in this section we experiment
we actually get another system which i'm sorry i which today which are randomly random
backchannels with the egg and redeker there wasn't between the first is much work
as well
great thank you
i was actually wondering that the dialogue seems to be a model encouraging that short
utterances and i think this kind of or a feedback giving a behaviour
more likely to occur when the usage of stuff really tells a one story or
something did you were set of try to encourage the people to the behave in
that way or was it more like that kind of but
it
really depends on the user's site
why don't know we do this because japanese people are quite reluctant to
telling stories of what it
we i would say to people come and may be seen you know them stand
in front of your account
by say what one season
and in white for the response and agencies like for sports
but the people like the examples that i gave alive
systems people who actually
we did some of the side we just a okay
you talk with your how you want and actually did that actually see like a
long story talked about it they whatever and this would people be seen that most
impressive
if you tool with the kind of in a stream of consciousness then you're gonna
get answers which are maybe actions you probably
question-response questions
but it is very visited thing to think
which is a trick
i guess the robot could also to kind of a start to tell everyone sentences
and here that would serve kind of well for example to the user's we hoping
that we try like
think of ideas on how to get the robotic system you like on the section
of the users of like
she can actually say tell me a story that yourself or something rather be nothing
to
directly usable
well i think we of
i
thanks for an interesting things
they got the implement so she doesn't all use of two
make but sentence does smoothing is
well yes
are you in question about so
of what you mean the nonverbal behavior
so you we haven't actually
i think she does some nodding at random fundamental watches a backchannel
but we looking at ways they for example
one vector we just only been ordering or one backchannel nodding and the verbal utterance
what just the people utterance that we need to look at the research the three
what distributions and i'm have actually this
for the user so at the moment only backchannels available
but you in the future will probably tried accept others modalities like about one actual
thank you please thank your speaker once again