or it but so slowly start them accession my name is for the province crevices

evaluation very to session

first speaker today

is gonna be special colour

we're gonna have a three talks in the session

which random to lunchtime

so we shall

thank you

can you hear me okay

a high i michelle code i'm a close talking u c davis working jointly with

department of linguistics

computer science and psychology and to they'll be presenting a project i did with our

bit chen and joe you

so more and more humans are talking to voice activated artificially intelligent devices like amazon

alexi to complete daily tasks

like setting a timer turning on the lights

and the new aspects through the amazon elects the price competition is the ability to

engage real users in social chitchat three d systems many view here have competed or

are competing but for those of you who don't know about it

the amazon a leg surprises the competition to create social but that can converse coherently

and engaging lee with humans on a range of topics like food music technology animals

and so on

and what's unique

at least for researchers in academia is the ability to deploy the strap right in

the wild and something dan bohus talked about yesterday

so during the competition anyone with an amazon ago

could say let's chat

and get one of the computing chat bots

you may be familiar with some other teams from twenty eighteen

a including one from katie each phantom advice by gabriel's concept and light by patrick

joan l

but today i mean to be talking about gun rock the social but developed at

u c davis advise by joe you and light by or pitch and make two

corridors

and gun rack a special as it won first place in the twenty eighteen competition

i you can see joanne and our bit here

so when i might show in our bit july last summer a contract team was

about halfway through the competition and i was working on other projects related to how

humans talk to voice ai so it's

interested in seeing how

users would engage with the social but like can rock

so we started to collaborate recording these user interactions you can see my microphone there

but we notice something as he listens to how these interactions unfold it

alexis speech was relatively flat

and really lacked the dynamism in human interaction

we're speakers very their speech just to show their excitement

their interests and their understanding

and this is important

is users for example were offering information about their favourite movie lx i really didn't

sound like she cared

and others have noticed this flatness in the alexi voices well here's an echo review

where they mentioned that it would be nice if alexi didn't sound so monotone

and that she needs to have a little more expression one she speaks

and another where they say that they're having a lot of fun with her

but her monotone productions can make things difficult for us to understand so this flatness

could also effect user's ability to understand her speech

so this slide to several research questions the first was how can improve a lexus

expressiveness in a social dialogue system like on rock

a especially given the time constraints of being in a competition

so we know from work on human interaction that cognitive emotional expression is important for

the quality of our interactions with others

we see that readily in people's faces such as happiness and excitement

we need to go to the vast a museum or contemplation and interest

but we also see that in the way we produce and perceive speech so for

example how emotionally express if we are relates to perceptions a speaker enthusiasm in human

conversation

so this is something we wanted to mimic in a lexus speech

so how do we make a lexus a more expressive what one option is to

completely overhaul the prosody

we really didn't have that as an option we didn't work controlling the tts models

in the competition which are given by amazon

we can adjust the tts in my in minor ways using s m l

but again we are on the time crunch and

we also wanted to very carefully specify a where cognitive

emotional expression would be inserted

so we asked whether we could add discrete units of color emotional expression or voice

them jeez add to improve expressiveness of the lx a voice

so we identified to that we were interested in expressive interjections and these are ones

that we're pre-recorded by the alexi voice

here's an example

wow is a

and filler words like or

and their relatively easy to add in the a lexus skills k just with a

simple ssm l tag to adjust expressiveness

i here for speech call an interjection

or to add in a pause to make the filler words sound more natural

so this is not modeled off of human

interaction where

individual signal their cognitive emotional states

using these smaller response tokens

so for this project we focus on these two types of voice emote jeez interjections

and fillers

and interjections can signal different things

like the speaker's the motion

but also how interested or surprise they are about information

or whether what we're hearing about is newsworthy

the other type of voice emote these are fillers

like and

which can also signal information about the speaker

such as the speaker needing more time to collect their thoughts inconsiderate topic their degree

of uncertainty about a topic and even their level of understanding

so well are first research question was how do we add expressiveness are second is

how will people respond to alexis expressiveness

series of computer personification such as clifford nasa's computers are social actors framework propose that

when a person sense as a cue few manning the system we automatically treated like

a person so here are question is really theoretically important in considering the degree to

which users personify voicing i

what users develop greater report with a

expressive alexi

or will it be creepy falling into the uncanny valley

the idea that the more similar nonhuman entity like a robot or alexi is to

person the more people like it to look at a point where they find it

incredibly creepy

so here's an overview of the rest of the talk

first will go over some prior work looking at interjections in fillers in human computer

interaction

then i'll go over a study we did our dialect surprise track pop and rock

and then go over some conclusions and future directions

so they are actually very few studies that have tested adding interjections and exclamations in

the dialogue system

and there's been a lot greater focus on overall prosodic adjustments to fraser utterance

i one side you did test the impact of non-linguistic affective burst

so buzzes and b

you know robot than our robot and they found that

kids sixty years old readily attribute motion to those noises

and will not using interjections per se sort all colleagues found that speech trained on

a corpus of positive exclamations like great

resulted in higher listener ratings

in a seven utterance simulated dialogue

but they observed no such a fact when the tts was trained on negative exclamations

like dear or groups

so really overall adding interjections as in

under study area in human computer interaction

and there's a bit more work looking at adding filler words but the findings have

been mixed

so i'm the one hand some studies have found a facility were effect

for example users have reported having a greater sense of engagement

with the robot if that robot uses filler words

and in another study independent raters keep higher naturalness ratings

for human computer conversations

when that voice included filler words

but others are found no positive affective introducing filler words or even a negative effect

for some listeners

so it's really an open question as to how humans might response to voice ai

systems

using interjections and fillers

a whether these voice mode jeez for example might be beneficial or detrimental to user

experience

okay so now think and rock

here's the overall architecture i'm just gonna provide a brief overview there's a technical report

if you're if you're curious

so the asr and tts models were provided by amazon

they we have a multi step and all you pipeline including sentence segmentation constituency parsing

in dialogue prediction

and then gonna has a hierarchical dialogue manager with higher level higher level topic or

organizers well as

template specific dialogue flows and that's for about been different topics so includes animals movies

news books

and so on

and this dialogue manager pulls an information from e v a factual knowledge base and

the can rock persona

database

questions about who elects it is

next we have a template based nlg module where the system fill slots with data

retrieved from various knowledge sources such as i am db

and then

finally we adjusted the prosody by adding the fillers and interjections so this is really

the focus of this presentation which were then output by the tts in the i

d for all x of voice

okay so how are we going to insert

interjections and fillers

we can't just insert them randomly that's not how language works

it's ten mentioned in his you know yesterday placement of these elements is really you

words so together we created a framework

for context specific placement of interjections and fillers into existing

can rock templates

and again we didn't manipulate any other prosodic aspects of a lexus speech we just

added these discrete words and phrases

okay so starting with the interjections we define two five context

for each we defined a list of possible interjections which could be used in that

context so we defined a list and then they're randomly pulled in

so the first is to signal interest this was really important because we wanted the

user to elaborate

so for example

so tell me more about it

since the goal the competition is to get users talking as long as possible

we want really wanted them to expand on their experience and make it seem as

though alexi was actually interested in what they had to say

so here we used

a lot of in different interjections which could be randomly inserted

into this word in a phrase initial slot

the second context what's for error resolution or to show a lexus feelings about her

misunderstanding

and this was a really important one since a lexus a often misheard the user

we wanted to convey for disappointment

in not getting it right

again with lots of possible variations for example

there are i think you said probably can you say that one more time

the third was to except the user's request

for example

l t here is some more information

this we didn't have as many as to signal interest since it was a social

dialogue system is less

task based then elect is usually

the fourth was to change topic as it alexi just remembered something she wanted to

share the user

and this was the part of a strategy to change the topic if the user

wasn't being very responsive giving a lot of one word versatile answers

well i've been meaning to ask you do you like animals

and the fifth was to express agreement of opinion

yes

we share the same fox

and this to happen is often in the gun rock template so we just used

to interjections here

but if you had a lot that really wanted to agree with people you could

out a lot of others like also more cool

so in addition to the five context we also included some interjections meant to convey

a lexus playfulness

and these were all utterance specific and not interchangeable so for example or in

that's so cute

and what one so get ready for a cheesy joke

what you call blueberries playing the guitar

a jam session while

okay and now for the fillers like the interjections we identified certain types of context

where fillers would be used in human conversation i so you're the three

the first was to change the topic a freebies we also tune the production of

so slowing it's rate and adding the parts

so i've been meaning to ask you do you like to play video games

the second was when retrieving a lexus back story so this was as if

it to make it seem as though alexi had think about a response that was

less automatic

i'm not all animals that i think my favourite is probably the elephant

and the third was to acknowledge the user's answer to a lexus question

a like an interesting choice

so following that framework this resulted in interjections and about twelve percent

of turns in a conversation and filler words in about eighty percent of turns

the exact number of fillers and interjections varied based on whether that particular utterance was

engaged in

being the conversations you could

can think of it is each user got a slightly different sampling of

of these interjections and fillers based on the course of their conversation

okay so here's an example i'll up maybe user so he lx that let's chat

hi this is an annex a prize social that and black to median how are

you doing today

doing okay how are you

and good sell tell me more about yourself do you like sports

not really

so have you seen any movies lately

a star is boring

a i seen a stars board to what would you rate this movie on a

scale from one to ten times

wow

a perfect and it really is an exceptional movie here is something i just learned

some of the scenes were filmed at the co-channel a festival in twenty seventeen we

only got how was the first female headliner in ten years

festival goers pay ten dollars to see the scenes being filmed all the recording devices

were locked away to prevent any footage being read any thoughts

so you can see it's really a discrete

phrase in a very long utterance

okay so does adding interjections and fillers have an impact

so we conducted a user study through the devices themselves so this is in the

wild as part of the lx the price competition

so we had four conditions

one with interjections

one with fillers one with both and one with the night or and these are

these conditions push live to all alexi table devices from november twenty have to december

third so this was after the competition was over and know what other code updates

were happening that's very crucial

and this methodology extends prior work on human computer interaction

giving us large sample size for over five thousand unique users individuals who actually wanted

to talk to the device and we're doing so on the place most comparable to

that

in their own homes

and the reader so at the end of the conversation they would re the conversational

scale from one to five so

the raiders where actually the users in the conversation itself

this also consists of users anyone with the device so it's not constrain to the

eighteen to twenty two year old slice that we generally test

but it's still likely skewed by social economic status and finally users have more experience

with the specific system so perhaps they have more familiarity and report with their legs

that

so we analyze the reading at the end of the conversation with a linear mixed

effects model weather conditions and values are random intercepts

we only included data for conversations with at least ten turns

and for the ones that had a filler interjection both

that had at least one of those

or two of those options

so i'll take you through the results one by one i here we have the

conditions on the x-axis and the rating on the y

i here we can see the baseline model this is the one without interjections and

fillers had an average around two point eight

then we site

the linear regression model revealed a main effect of condition so we see significantly higher

ratings for conversations with interjections this is all relative to the baseline

we also see higher ratings for the conversations with fillers

and also for the conversations with both with an average increase of about

point seven five

we are curious to see if the combined condition

was different from the

single interjections and fillers and we did

indeed thought that was the case

so adding voice them jeez inappropriate context

improves user ratings

and

this shows that even adding discrete elements may improve overall expressiveness of a social dialogue

system in this provides support forecaster frameworks as humans appear to be responding positively to

human like displays of cognitive emotional expression

in an alexi voice

in may in some ways be responding to the system or like a person

we also see that the effect is additive for different types of voice m o

g so users keep the high ratings or conversations with both fillers and interjections

and overall this effect is robust we see it over thousands of unique

users can conversations

but one limitation perhaps you've already thinking of is that these ratings are really a

holistic measure of the overall conversation so we wanna do one more controlled study

to confirm that the voice them jeez do indeed improve the ratings of the conversations

so we did a mechanical turk experiment with any five turkers

and the similar conditions structure as in the user study

with two dialogues one to signal interest

and one to resolve endeavour

so just as in the main study we had the baseline one with fillers

one with that interjection

and one with both yours an example

movies can be really fun

so i've been meaning to ask you what else are you interested in do you

like animals

what we're animals

some i think my favourite animal is the elephant

and then same for the dialogue or the error resolution dialogue

i one with night or fillers or interjections

one with fillers only

one with interjections only in one with both

that's pretty interesting

so have you seen any movies lately

the not really is really in good

darn i didn't catch that can you say that again

so these are real user interaction caesar once we scripted loosely based off of topics

in gun rock

so the turkers heard these two dialogues and all possible conditions randomly and then for

each dialogue they heard a raster radial x a voice on a sliding scale so

how engaged is a lexus sound how expressive does a lexus sound how likable and

how natural

and we analyze these ratings with separate linear mixed effects models

since i'm running on a time ago through this quickly

so here's what we found as with the overall user study we found a main

effect of condition

i get relative to the baseline

my computers

having some issues

so we see an increase for

so conversations with interjections shown in red show significantly higher readings of all of those

social variables look for

for those four dimensions

i'll just give you a quick summary my computer it's frozen so overall what we

found perfect so what we saw that the results for the user study me were

what we observed in the mechanical turk study instances of social ratings we saw something

a little bit different with the fillers so users

the mechanical turkers actually redid the voice as having lower likability a low-rank each meant

when that voice had the fillers so this is a little bit different in suggests

that the role of the reader so if you are is makes a difference so

if you're the person in the conversation you tend to like the interjections you don't

also like the fillers

but if you're an external rate or listening and on the conversation

you really pick up on those fillers and that really made from yours what we

seen in research one human interaction

thank you

we have some five questions

very interesting topics i'm wondering about how

given the way that you're adding this

fillers and interjections it seems like it somewhat stochastic us to when they come out

an s one and f

all the dialogues that included them have roughly the same percentage or number wrong number

more work number per term are or where there's a big variance within the different

dialogs and if there's variance whether you

john more carefully at a whether having more fillers robust fillers changed the rating is

actually question we didn't look at that are so we looked at the number of

fillers encryption particular conversation

and didn't seem to your relationship at least with reading

is related to overall turns that that's

let me to be expected

that backs fascinating and results reducing and

i was wondering having looked at the data

do you think doesn't is goal for building a model that can you know look

at context and decides yes or no we're gonna put a veteran seems likely

limits the yes right so this was just a very simple kind of way to

test this but we it was not the most sophisticated way that we could we

could do it by definitely

but i mean if you look at the conversations in the ones that looks like

it's going well looks like number do you think there's some signal on the

but there could be a model to train or

i noticed in the increase in user studies

that the users would smile if you had interjection

and some actually

mention the filler words themselves

it's so

i mean that's a very explicit sort of q by if you're able to record

we you know you could

use

you know the smiling the facial expressions

to know if it's

if it's going well that's appropriate

more question

since build a t vs to keep people engaged for longer what has the effect

of length of conversation

there wasn't a clear relationship so there are two so we wanna keep people and

each as long as possible but also

however in meaningful conversation

really feel for so there was no relationship between number of

okay utterances but well only with reading

in the this is more common than questions

sometimes people have news stories and they like tori the first time then after a

while a good point five t

in have you sort of making in experiment over time

well

you see if this really works

in the long time that's a great that's a great question no we haven't but

that's already down

and we have time for one last question

just for clarification what you're fillers seem to be all the sort of a turn

initial did you have them you know like the most notable fillers a like you

know in noun phrases just up the services

so we didn't so we just put them in the same location is the interjections

but if you're absolutely right they occur in a lot of different places if you

have a hesitation for example or of false start sometimes you get fillers there as

well

you're just trying to keep it very simple

but stack the speaker model