Speech Transcript - Predicting Perceived Age: Both Language Ability and Appearance are Important

and you

thanks for coming back for this session

this is work by these three students mostly sarah plenary almost entirely you can of

all undergraduate students they are kind of converge to meet the same time and we're

interested in this

and now they're all different and they're doing other things so i'm here i'm just

this person

presenting us and are marginally action is voiced or cd

in the next couple minutes you look up or policy is not can be offended

you look up or i don't know is it is it is a real state

in united states

it does exist and voices the capital of that state if you know that's anything

to be part of it's a nice university a really i really enjoyed been there

and i run the speech language an interactive machines group sort of in your early

research group there have only been there for about two years

just start

actually wanted right attention to this bottom reference here this is what we're doing in

this paper builds a large enough of on the nova covered all paper and this

is a lot of oliver lemons lab in a real number of

and they did some research on it was basically social robotics which is pretty similar

to what we're doing and we follow a lot of the methodology here

but what we wanted to look at and we have this little robot we wanted

to do some language grounding studies with it and then one of my students asked

this

question

that we couldn't like go of she said well are people gonna treat this robot

the way we want them to treated

like first language acquisition and i was thinking well

i don't know maybe we should study this

and that's actually what happened with this paper

but a lot of motivation comes from all of a great work in grounded

grounded semantics in symbol grounding of w

you have lots of other people i mean are not all mentioned here but here's

if you that we kind of build a focus on the point is well

that's just getting to hear the point is if you're a person and you're interacting

with a child

and the child's learning language child doesn't know language to the degree that an adult

load knows language

an entire season object and the idea that when the child nodes that pretty much

all l objects have had annotation

i paraphrase or a single word or something and so this child sees this object

here and the channel maybe doesn't know the annotation for this object and so the

adults there's lots of all

and the choppy numbers this and it's quite amazing and this is kind of what

grounding is doing when you read it when you do this with a machine like

a robot has to perceive this object somehow represent this object somehow a lot of

the work up until now has been done with

with vision as the main modality for grounding language into some

some perceptual modality

but once you have

a robot once you have an embodied agents

if a start assigning anthropomorphic characteristics to them based upon have a look and on

a based upon how to act

as soon as they see a robot like immediately think is this manner woman's tells

inches it's is it sympathetic how can interact with this thing what can i expect

and really as soon as someone says this is a robot people think it has

a don't intelligence and you don't want that if you have a first language acquisition

tasks that you want the robot to do

and that was the question my student task and if we have this little robot

we want to do first language acquisition task in a setting that is very similar

to the way children acquire their language

we have to we cannot assume that the that people who interact with the robot

are gonna treated likely what child

and so that's what we set off to do so we want to actually projects

are what age do anything a robot user but the academic level that's what we're

working on here so the main research question is this does the way are about

the way a robot verbally interact affect how humans perceive the age of robot

short answers yes it's after wants to go ahead and put your head down

have a little rest if you don't care about rest but we can we can

sort of jesus apart a little bit kinda show you what we did

we didn't experiment

we have some robots one of the very the appearance of how looked

and it's three different ones which will show you moment we varied the way the

robot's verbally interacted

and we as participants sure robot how to build a simple possible that was kind

of the language on task force running wasn't actually have

but they were there are interacting with the robot's in this very simple dialogue setting

and we recorded the participants there what we had a camera pointed out them as

they were interacting with these robot some record their speech interface

maybe then after they interact with each row about their thought a questionnaire about the

perceptions

and then after gathered all this data we analyse it and we're well we recorded

the data analyzed it with the facial motions prosody linguistic complexity

and we found correlations between data and the perceived age

and we from that we can predict

so it is a three robots were used

because we had

and because we wanted a robot that was kind of anthropomorphic in one that wasn't

so here is a non anthropomorphic robots could work it looks basically like a rumbling

with broken act on it

and then this is on keys cosmo i don't know if you see in itself

very small robot it's marketed as a choice has a nice python is to k

and then we just had an uninvited on physical spoken dialogue system which we

which we affectionately named an overall

an overall not a robot

so there are three robots

and it's kind of embarrassing what we did with the robot's but we have to

squeeze settings that we wanted to test because we want to see how do how

to people treated based on how this robot interaction from

and the only things we only speech that we have a robust produce was feedback

and there were two settings of this feedback one was minimal feedback on like yes

okay

which was basically marking phonetic receipt we call this the lower low setting like i

heard that i heard i heard but whether or not it understood that's kind of

a one year

and then we had another feedback which mark semantic understanding much sure okay i see

a higher-order repeat or something like to show i understood correctly understood you these are

they're all feedback like it's not really in taking the floor it's not really doing

anything really a lot of dialogue going on here but there's these two settings and

then we found that it makes quite a difference these settings

other than that the robot student move

which from the kabuki thing on the light was on an that was a

non-causal had a in its default setting it had this little animated eyes are just

kind of the round but they didn't it into anything it and move it and

per participant the task i think there just talking

until this we have six settings where three robots into speech settings

so the task was this we had a we had a we had we'd set

a robot down right here

whether the cookie the

the cosmo robot or we just not have anything there for the no problem setting

and then we have these cameras here record participant

and we had these just ask for with this also we have these little puzzle

pieces and don't know if you recognise them

on this paper there's three different target shapes that they can like with these three

pieces in each of these shapes had a name

the only instructions we gave these participants was

sure about how to build these each of these each of these shapes make sure

at the and you tell the robot what the name is

and just using one after another

and what would happen is as they interact with the robot the robot would give

some feedback depending on the setting as they're talking to its own kind of interacting

with the but of course it was controlled by wizard

so the procedure went like this we randomly

but a robot here that interact with it

based on a questionnaire about this interaction and then we give them a new set

puzzle tiles on a new list of

target shapes

it interactive that robot have a quite a questionnaire again for that interaction and then

they'd have the third robot

with a new set of shapes and possible target shapes on and then that thought

for questionnaire

the things we randomly assign was the robot presentation order you order of the puzzle

we had a different we had two different voices for the codebook in the spoken

dialogue system from amazon was a male and female voice that was randomly assigned words

"'cause" my head it's had its own voice

and then we had a different language setting

so that the high and low language which stay the same for all three interactions

we just sort flip a coin beginning and then they would get that one for

all three of them

and so we collected data from the camera facing the participants that which was audio

and video and then of course the questionnaire

in the end we got one participant send mail and eleven female what we can

further time

and each interact with all three robots folding sixty three interactions we collected and fifty

eight questionnaires for had to be thrown out because you want correct for correctly filled

out

and then we move interested data analysis

for each interaction

with individual robot's we would take a snapshot every five seconds and averaged over the

emotion distribution from the microsoft emotions a few not familiar with this api

you can send the actual like this the eight k and i will give you

just

i is a different emotions

so here's an example here someone kind of mostly neutral

little bit is spread over the other ones you're someone who's happy little bit is

a the other ones

there's some of these mostly neutral but there's more contemporary look at that you're like

this contempt there and the content actually came up a little bit in our in

our study so we collected the state

and just to give you some numbers here about what we found of emotions most

of the time people were in their in we're just neutral and then about eleven

percent of time they were enhanced eight times that surprising content for the next most

common ones

and then the other ones were negligible less than one percent on average for all

for all settings all robots everything

but then we

the robot's in the different settings individually so if you marginalise out the robot's and

just look at low and high setting we find that people spend a lot more

time being happy with the robot's then in the high setting

and this just getting given genetic receipt it's and part of this is

in the high setting it's marking that are semantically understood you and people got really

frustrated with "'cause" expected more interaction from the robot's but they weren't they are doing

more than just giving this verbal feedback

so you want very happy with a with any role and i said

and that's kind of the dictate come here the robot's themselves a little more happiness

with cosmo they would rather interact slightly with a with a

and in by a spoken dialogue system then with a codebook e

for whatever reason

and you can sort of tease apart but them in their individual settings here

all refer to just a paper to get

you dig in the more detail

we looked at prosody the very simply just for each interaction we average the f

zero for the entire actually might have in about you know a couple minutes of

speech and just the just the participant would not the robot

and here some results for that's a if you just

if you just look at should marginalise out the robot's in the low setting people

had a higher pitch

where is not have setting at all

the location this kind of goes with you know literature of people who talk to

children raise their voice is a little bit that's kind of what we want

but even the small difference in feedback next that of the pitch difference

in all the robot's and then

if you just look at cosmo on the low and high setting or marginalise out

the low and high setting you just look at the robot's people talked with got

to discount the robot at a very high much higher pitch than the other two

about these were kind of negligible

is a kind a negligible neither a little bit different but i mean not a

whole lot of different so

the way the robot looks the way the robot talks on prosody kind of tells

us that

both make a difference here

we then

for each user interacts with transcribed speech using speech at a time courses can make

some mistakes but we just kind of one with it

segments the transcriptions into sentences by detection one sec selsa pretty

pretty rough the way we did this we didn't taken to it too much we

just sort of to check these transcriptions and passed through some tools that gave us

some lexical complexity and syntactic complexity so we have

lexical complexity analyze which causes lexical diversity means segmented type token ratios m s two

t r and lexical sophistication

these are nice measures that we can use and then we have

for syntax for syntactic complexity we use the do you level analyser which is just

a value between zero and seven

zero meeting it's a very short you know one words to words sentence very syntactically

simplistic and then but seven means it's a long sentence with a lot of complexity

with the with the l d the ls nasty are it's very simple the process

very similar to the results we get for prosody

in the low setting people use very complex lexical word that very complex vocabulary the

thing that was surprising that i want to show you here is the these syntactic

complex its complexity and the low setting we have higher syntactic complexity we have more

l seven more longer sentences versus high setting

i mean for the most part they're saying very short one to word sentences in

all settings with all robots button some cases there there's speaking on their speaking longer

sentences we dug into just a little bit and we found some literature that serves

in this is kind of what we what we found in our data

in the low setting its get its infinite it receives not semantic understanding it's not

signalling semantic understanding so they just kind of kept talking

the sentences got the since in text more complex even if the vocabulary was press

so the other measures

low lexical sophistication but high some syntactic complexity because they just they just kept talking

looking at the questionnaires for each interaction with the gaussian question hermit just one contrast

in parents each with a five point scale between your some examples artificial life like

unfriendly versus friendly in congress competent confusing versus clear

and then we add the following two questions which was the information we are interested

if you could give the robot who interacted with a human age how old would

you say

we've been than the ages in these ranges we have under two to five six

twelve thirteen seventeen eighteen to twenty four twenty five thirty four thirty five we know

that thirty five and all there is a pretty much pronounced speech thing

what level of education would be appropriate for the robot who interacted with sort of

another proxy to age and we said preschool kindergarten each of each grade had its

own value and then of course is

so just looking at that time just

the questionnaires on their own people assigned

the low setting here

people assigned you know on average lower ages and the high setting on average higher

ages sets kind of expected and then looking at the robot's

you know codebook you got could work in the no rollback high rate word with

uninvited robot gets higher stage i think it's the sort of the most

intelligent the smartest the oldest and then we get calls more here which is like

the oldest six to twelve

not surprising and education which tells a similar story

you have the low setting on average much lower much younger

what muscle or education rather high setting in and the difference is not much right

it's just

phonetic greasy verses signalling semantic understanding is just a different feedback strategy but makes a

huge difference

and then of course the robot's people treat them differently

where you have the highest cosmo gets is a tenth grade and then the other

ones get undergraduate

and that was put the what we found from the questionnaires together with the some

of some of the other features that we had i want to point out a

few things here

in the low setting if you look at prosody the f zero average you can

look at is a questionnaire values and as both go up to correlate with each

other

so high if you have a higher pitch it means we think your friend you're

intelligent kind of conscious knowledge and if you have a high or low complexity we

think a more friendly

in the high setting different things come up here sensible enjoyable natural human like and

then lexical diversity

and lexical

sophistication this one i think is interesting in the high setting

he if i'm using more complicated

words to talk to the robot

it is more likely from into this at about the robot and to be contentious

against the robot

in a big white at that was the interesting result people have high expectations of

these other of the robot in the high setting

well you understood me well same or do more they would they were asked followup

questions and it would we wasn't allowed to say anything other than sort of

given these simple feedback

some other stuff which kind of gives

tells a similar sort of look at the robot's instead of just the little high

setting

kind of the same thing sinus

sinuses negatively correlated here with them as you are on the other robots have some

things as well

and this feature is negatively correlated with the low stage will the second most was

represented there

so we can begin in this little bit more the paper

so to predict the perceived age an academic levels now that we have this data

we want to use our prof prosodic linguistic and

language features

what prosodic you motion language features to predict

the age and so you fifty eight data points five fold cross validation and we

just use a simple logistic regression classifier

nothing terribly complicated here not very much data if we use all seven labels we

don't you very well if we find a splitting criterion say okay let's split

at eighteen years old and see how well it does

we can predict fairly well if someone thinks that a robot is of minor or

an adult

and for academic level we can we kind of the same thing and we found

that we can split preschool with reasonable accuracy and

so we can tell if someone thinks a robot is a preschool age so taken

together you can tell someone if someone tree is

assigning adulthood or minor her to adapt to a robot and if they're furthermore assign

preschool academic level to the robot and that's actually what we want to do we

want to be able to determine do they think my robot is preschool age of

the language learning stage

so it's and this we did some other stuff along with this stuff that i

showed you that confirms the stuff of nova coveralls april actually still workshop

where the robot verbally interact this is just a back again the way it looks

changes the way human participants perceived robust agent academic levels

perceived age academic level can be predicted using multiple features future work is what we've

kind of verify because most the right of robot for the job for first language

acquisition test and it doesn't look like human which

people don't wanna look at you and we thank you for your attention

i was sick you know i was curious why use really dh i don't want

nine or by data preschool is really small children cell

it would have sleep day

education level in many different ways right

we worked

we did try a couple of quite things and rely on that one work but

also make kind of sense

i don't versus models that's seems like a reasonable splitting criteria let's use that of

course it's not the one we're looking for which is

i think initially to make a child and that's what the preschool one does pretty

well

it just words that's the way i'm sorry

when you have this

chart of the predicted a for the low and high level kind of

looked

two of the

the parser

it was lots of there is no the u

this one where i can read it was

below is

more likely to be perceived as being a child but also more likely to be

perceived as being able to just

is really unlikely to be a teenager

there's these pesky

undergraduate assignments

so that to for example well i mean in general they both get a look

at the academic level here okay i got quantities and then this one gets an

additional to have great one but i mean there's a lot more like preschool here

as it is more you know can and first grade stuff here i mean

on average it is quite a bit more but there are some people who

hi how that's why that's quite interesting and there are a

i may have missed something in are also in your

questionnaire supply

saying

people

expectation

the robot

i was wondering give

this was what people told you or

your explanation of the data based on other kinds of things that we found a

assess that the robot's as knowledgeable and so on some more

i and experiment together so the iq stuff is what they said

questionnaires the other stuff like the p the l and the ear

right so q means a came from the questionnaire you means a came from the

motion stuff that we got from the microsoft emotions api that we just read off

of it

so we have what they're telling as we have we're getting from just data we

collected from mister the correlations we collected from that so it does it is our

interpretation like with this set content we're saying okay a high setting

we detected from them that the use i lexical diversity using our tools

and so collected from than that they had high said that the content from my

something like so in this case those things were correlated but in this other stuff

this is what they reported like i they thought it was enjoyable my sensible

whatever and in this case like in the low setting highlight when it was high

lexical sophistication they questionnaire they would've given a high score on the questionnaire

it's a testable

so i understand that vectors

really

you know

yes okay

a common

thank you

Predicting Perceived Age: Both Language Ability and Appearance are Important

Special Session: Physically Situated Dialogue

Sarah Plane, Ariel Marvasti, Tyler Egan, Casey Kennington