or it but so slowly start them accession my name is for the province crevices
evaluation very to session
first speaker today
is gonna be special colour
we're gonna have a three talks in the session
which random to lunchtime
so we shall
thank you
can you hear me okay
a high i michelle code i'm a close talking u c davis working jointly with
department of linguistics
computer science and psychology and to they'll be presenting a project i did with our
bit chen and joe you
so more and more humans are talking to voice activated artificially intelligent devices like amazon
alexi to complete daily tasks
like setting a timer turning on the lights
and the new aspects through the amazon elects the price competition is the ability to
engage real users in social chitchat three d systems many view here have competed or
are competing but for those of you who don't know about it
the amazon a leg surprises the competition to create social but that can converse coherently
and engaging lee with humans on a range of topics like food music technology animals
and so on
and what's unique
at least for researchers in academia is the ability to deploy the strap right in
the wild and something dan bohus talked about yesterday
so during the competition anyone with an amazon ago
could say let's chat
and get one of the computing chat bots
you may be familiar with some other teams from twenty eighteen
a including one from katie each phantom advice by gabriel's concept and light by patrick
joan l
but today i mean to be talking about gun rock the social but developed at
u c davis advise by joe you and light by or pitch and make two
corridors
and gun rack a special as it won first place in the twenty eighteen competition
i you can see joanne and our bit here
so when i might show in our bit july last summer a contract team was
about halfway through the competition and i was working on other projects related to how
humans talk to voice ai so it's
interested in seeing how
users would engage with the social but like can rock
so we started to collaborate recording these user interactions you can see my microphone there
but we notice something as he listens to how these interactions unfold it
alexis speech was relatively flat
and really lacked the dynamism in human interaction
we're speakers very their speech just to show their excitement
their interests and their understanding
and this is important
is users for example were offering information about their favourite movie lx i really didn't
sound like she cared
and others have noticed this flatness in the alexi voices well here's an echo review
where they mentioned that it would be nice if alexi didn't sound so monotone
and that she needs to have a little more expression one she speaks
and another where they say that they're having a lot of fun with her
but her monotone productions can make things difficult for us to understand so this flatness
could also effect user's ability to understand her speech
so this slide to several research questions the first was how can improve a lexus
expressiveness in a social dialogue system like on rock
a especially given the time constraints of being in a competition
so we know from work on human interaction that cognitive emotional expression is important for
the quality of our interactions with others
we see that readily in people's faces such as happiness and excitement
we need to go to the vast a museum or contemplation and interest
but we also see that in the way we produce and perceive speech so for
example how emotionally express if we are relates to perceptions a speaker enthusiasm in human
conversation
so this is something we wanted to mimic in a lexus speech
so how do we make a lexus a more expressive what one option is to
completely overhaul the prosody
we really didn't have that as an option we didn't work controlling the tts models
in the competition which are given by amazon
we can adjust the tts in my in minor ways using s m l
but again we are on the time crunch and
we also wanted to very carefully specify a where cognitive
emotional expression would be inserted
so we asked whether we could add discrete units of color emotional expression or voice
them jeez add to improve expressiveness of the lx a voice
so we identified to that we were interested in expressive interjections and these are ones
that we're pre-recorded by the alexi voice
here's an example
wow is a
and filler words like or
and their relatively easy to add in the a lexus skills k just with a
simple ssm l tag to adjust expressiveness
i here for speech call an interjection
or to add in a pause to make the filler words sound more natural
so this is not modeled off of human
interaction where
individual signal their cognitive emotional states
using these smaller response tokens
so for this project we focus on these two types of voice emote jeez interjections
and fillers
and interjections can signal different things
like the speaker's the motion
but also how interested or surprise they are about information
or whether what we're hearing about is newsworthy
the other type of voice emote these are fillers
like and
which can also signal information about the speaker
such as the speaker needing more time to collect their thoughts inconsiderate topic their degree
of uncertainty about a topic and even their level of understanding
so well are first research question was how do we add expressiveness are second is
how will people respond to alexis expressiveness
series of computer personification such as clifford nasa's computers are social actors framework propose that
when a person sense as a cue few manning the system we automatically treated like
a person so here are question is really theoretically important in considering the degree to
which users personify voicing i
what users develop greater report with a
expressive alexi
or will it be creepy falling into the uncanny valley
the idea that the more similar nonhuman entity like a robot or alexi is to
person the more people like it to look at a point where they find it
incredibly creepy
so here's an overview of the rest of the talk
first will go over some prior work looking at interjections in fillers in human computer
interaction
then i'll go over a study we did our dialect surprise track pop and rock
and then go over some conclusions and future directions
so they are actually very few studies that have tested adding interjections and exclamations in
the dialogue system
and there's been a lot greater focus on overall prosodic adjustments to fraser utterance
i one side you did test the impact of non-linguistic affective burst
so buzzes and b
you know robot than our robot and they found that
kids sixty years old readily attribute motion to those noises
and will not using interjections per se sort all colleagues found that speech trained on
a corpus of positive exclamations like great
resulted in higher listener ratings
in a seven utterance simulated dialogue
but they observed no such a fact when the tts was trained on negative exclamations
like dear or groups
so really overall adding interjections as in
under study area in human computer interaction
and there's a bit more work looking at adding filler words but the findings have
been mixed
so i'm the one hand some studies have found a facility were effect
for example users have reported having a greater sense of engagement
with the robot if that robot uses filler words
and in another study independent raters keep higher naturalness ratings
for human computer conversations
when that voice included filler words
but others are found no positive affective introducing filler words or even a negative effect
for some listeners
so it's really an open question as to how humans might response to voice ai
systems
using interjections and fillers
a whether these voice mode jeez for example might be beneficial or detrimental to user
experience
okay so now think and rock
here's the overall architecture i'm just gonna provide a brief overview there's a technical report
if you're if you're curious
so the asr and tts models were provided by amazon
they we have a multi step and all you pipeline including sentence segmentation constituency parsing
in dialogue prediction
and then gonna has a hierarchical dialogue manager with higher level higher level topic or
organizers well as
template specific dialogue flows and that's for about been different topics so includes animals movies
news books
and so on
and this dialogue manager pulls an information from e v a factual knowledge base and
the can rock persona
database
questions about who elects it is
next we have a template based nlg module where the system fill slots with data
retrieved from various knowledge sources such as i am db
and then
finally we adjusted the prosody by adding the fillers and interjections so this is really
the focus of this presentation which were then output by the tts in the i
d for all x of voice
okay so how are we going to insert
interjections and fillers
we can't just insert them randomly that's not how language works
it's ten mentioned in his you know yesterday placement of these elements is really you
words so together we created a framework
for context specific placement of interjections and fillers into existing
can rock templates
and again we didn't manipulate any other prosodic aspects of a lexus speech we just
added these discrete words and phrases
okay so starting with the interjections we define two five context
for each we defined a list of possible interjections which could be used in that
context so we defined a list and then they're randomly pulled in
so the first is to signal interest this was really important because we wanted the
user to elaborate
so for example
so tell me more about it
since the goal the competition is to get users talking as long as possible
we want really wanted them to expand on their experience and make it seem as
though alexi was actually interested in what they had to say
so here we used
a lot of in different interjections which could be randomly inserted
into this word in a phrase initial slot
the second context what's for error resolution or to show a lexus feelings about her
misunderstanding
and this was a really important one since a lexus a often misheard the user
we wanted to convey for disappointment
in not getting it right
again with lots of possible variations for example
there are i think you said probably can you say that one more time
the third was to except the user's request
for example
l t here is some more information
this we didn't have as many as to signal interest since it was a social
dialogue system is less
task based then elect is usually
the fourth was to change topic as it alexi just remembered something she wanted to
share the user
and this was the part of a strategy to change the topic if the user
wasn't being very responsive giving a lot of one word versatile answers
well i've been meaning to ask you do you like animals
and the fifth was to express agreement of opinion
yes
we share the same fox
and this to happen is often in the gun rock template so we just used
to interjections here
but if you had a lot that really wanted to agree with people you could
out a lot of others like also more cool
so in addition to the five context we also included some interjections meant to convey
a lexus playfulness
and these were all utterance specific and not interchangeable so for example or in
that's so cute
and what one so get ready for a cheesy joke
what you call blueberries playing the guitar
a jam session while
okay and now for the fillers like the interjections we identified certain types of context
where fillers would be used in human conversation i so you're the three
the first was to change the topic a freebies we also tune the production of
so slowing it's rate and adding the parts
so i've been meaning to ask you do you like to play video games
the second was when retrieving a lexus back story so this was as if
it to make it seem as though alexi had think about a response that was
less automatic
i'm not all animals that i think my favourite is probably the elephant
and the third was to acknowledge the user's answer to a lexus question
a like an interesting choice
so following that framework this resulted in interjections and about twelve percent
of turns in a conversation and filler words in about eighty percent of turns
the exact number of fillers and interjections varied based on whether that particular utterance was
engaged in
being the conversations you could
can think of it is each user got a slightly different sampling of
of these interjections and fillers based on the course of their conversation
okay so here's an example i'll up maybe user so he lx that let's chat
hi this is an annex a prize social that and black to median how are
you doing today
doing okay how are you
and good sell tell me more about yourself do you like sports
not really
so have you seen any movies lately
a star is boring
a i seen a stars board to what would you rate this movie on a
scale from one to ten times
wow
a perfect and it really is an exceptional movie here is something i just learned
some of the scenes were filmed at the co-channel a festival in twenty seventeen we
only got how was the first female headliner in ten years
festival goers pay ten dollars to see the scenes being filmed all the recording devices
were locked away to prevent any footage being read any thoughts
so you can see it's really a discrete
phrase in a very long utterance
okay so does adding interjections and fillers have an impact
so we conducted a user study through the devices themselves so this is in the
wild as part of the lx the price competition
so we had four conditions
one with interjections
one with fillers one with both and one with the night or and these are
these conditions push live to all alexi table devices from november twenty have to december
third so this was after the competition was over and know what other code updates
were happening that's very crucial
and this methodology extends prior work on human computer interaction
giving us large sample size for over five thousand unique users individuals who actually wanted
to talk to the device and we're doing so on the place most comparable to
that
in their own homes
and the reader so at the end of the conversation they would re the conversational
scale from one to five so
the raiders where actually the users in the conversation itself
this also consists of users anyone with the device so it's not constrain to the
eighteen to twenty two year old slice that we generally test
but it's still likely skewed by social economic status and finally users have more experience
with the specific system so perhaps they have more familiarity and report with their legs
that
so we analyze the reading at the end of the conversation with a linear mixed
effects model weather conditions and values are random intercepts
we only included data for conversations with at least ten turns
and for the ones that had a filler interjection both
that had at least one of those
or two of those options
so i'll take you through the results one by one i here we have the
conditions on the x-axis and the rating on the y
i here we can see the baseline model this is the one without interjections and
fillers had an average around two point eight
then we site
the linear regression model revealed a main effect of condition so we see significantly higher
ratings for conversations with interjections this is all relative to the baseline
we also see higher ratings for the conversations with fillers
and also for the conversations with both with an average increase of about
point seven five
we are curious to see if the combined condition
was different from the
single interjections and fillers and we did
indeed thought that was the case
so adding voice them jeez inappropriate context
improves user ratings
and
this shows that even adding discrete elements may improve overall expressiveness of a social dialogue
system in this provides support forecaster frameworks as humans appear to be responding positively to
human like displays of cognitive emotional expression
in an alexi voice
in may in some ways be responding to the system or like a person
we also see that the effect is additive for different types of voice m o
g so users keep the high ratings or conversations with both fillers and interjections
and overall this effect is robust we see it over thousands of unique
users can conversations
but one limitation perhaps you've already thinking of is that these ratings are really a
holistic measure of the overall conversation so we wanna do one more controlled study
to confirm that the voice them jeez do indeed improve the ratings of the conversations
so we did a mechanical turk experiment with any five turkers
and the similar conditions structure as in the user study
with two dialogues one to signal interest
and one to resolve endeavour
so just as in the main study we had the baseline one with fillers
one with that interjection
and one with both yours an example
movies can be really fun
so i've been meaning to ask you what else are you interested in do you
like animals
what we're animals
some i think my favourite animal is the elephant
and then same for the dialogue or the error resolution dialogue
i one with night or fillers or interjections
one with fillers only
one with interjections only in one with both
that's pretty interesting
so have you seen any movies lately
the not really is really in good
darn i didn't catch that can you say that again
so these are real user interaction caesar once we scripted loosely based off of topics
in gun rock
so the turkers heard these two dialogues and all possible conditions randomly and then for
each dialogue they heard a raster radial x a voice on a sliding scale so
how engaged is a lexus sound how expressive does a lexus sound how likable and
how natural
and we analyze these ratings with separate linear mixed effects models
since i'm running on a time ago through this quickly
so here's what we found as with the overall user study we found a main
effect of condition
i get relative to the baseline
my computers
having some issues
so we see an increase for
so conversations with interjections shown in red show significantly higher readings of all of those
social variables look for
for those four dimensions
i'll just give you a quick summary my computer it's frozen so overall what we
found perfect so what we saw that the results for the user study me were
what we observed in the mechanical turk study instances of social ratings we saw something
a little bit different with the fillers so users
the mechanical turkers actually redid the voice as having lower likability a low-rank each meant
when that voice had the fillers so this is a little bit different in suggests
that the role of the reader so if you are is makes a difference so
if you're the person in the conversation you tend to like the interjections you don't
also like the fillers
but if you're an external rate or listening and on the conversation
you really pick up on those fillers and that really made from yours what we
seen in research one human interaction
thank you
we have some five questions
very interesting topics i'm wondering about how
given the way that you're adding this
fillers and interjections it seems like it somewhat stochastic us to when they come out
an s one and f
all the dialogues that included them have roughly the same percentage or number wrong number
more work number per term are or where there's a big variance within the different
dialogs and if there's variance whether you
john more carefully at a whether having more fillers robust fillers changed the rating is
actually question we didn't look at that are so we looked at the number of
fillers encryption particular conversation
and didn't seem to your relationship at least with reading
is related to overall turns that that's
let me to be expected
that backs fascinating and results reducing and
i was wondering having looked at the data
do you think doesn't is goal for building a model that can you know look
at context and decides yes or no we're gonna put a veteran seems likely
limits the yes right so this was just a very simple kind of way to
test this but we it was not the most sophisticated way that we could we
could do it by definitely
but i mean if you look at the conversations in the ones that looks like
it's going well looks like number do you think there's some signal on the
but there could be a model to train or
i noticed in the increase in user studies
that the users would smile if you had interjection
and some actually
mention the filler words themselves
it's so
i mean that's a very explicit sort of q by if you're able to record
we you know you could
use
you know the smiling the facial expressions
to know if it's
if it's going well that's appropriate
more question
since build a t vs to keep people engaged for longer what has the effect
of length of conversation
there wasn't a clear relationship so there are two so we wanna keep people and
each as long as possible but also
however in meaningful conversation
really feel for so there was no relationship between number of
okay utterances but well only with reading
in the this is more common than questions
sometimes people have news stories and they like tori the first time then after a
while a good point five t
in have you sort of making in experiment over time
well
you see if this really works
in the long time that's a great that's a great question no we haven't but
that's already down
and we have time for one last question
just for clarification what you're fillers seem to be all the sort of a turn
initial did you have them you know like the most notable fillers a like you
know in noun phrases just up the services
so we didn't so we just put them in the same location is the interjections
but if you're absolutely right they occur in a lot of different places if you
have a hesitation for example or of false start sometimes you get fillers there as
well
you're just trying to keep it very simple
but stack the speaker model