and you
thanks for coming back for this session
this is work by these three students mostly sarah plenary almost entirely you can of
all undergraduate students they are kind of converge to meet the same time and we're
interested in this
and now they're all different and they're doing other things so i'm here i'm just
this person
presenting us and are marginally action is voiced or cd
in the next couple minutes you look up or policy is not can be offended
if
you look up or i don't know is it is it is a real state
in united states
it does exist and voices the capital of that state if you know that's anything
to be part of it's a nice university a really i really enjoyed been there
and i run the speech language an interactive machines group sort of in your early
research group there have only been there for about two years
just start
actually wanted right attention to this bottom reference here this is what we're doing in
this paper builds a large enough of on the nova covered all paper and this
is a lot of oliver lemons lab in a real number of
and they did some research on it was basically social robotics which is pretty similar
to what we're doing and we follow a lot of the methodology here
but what we wanted to look at and we have this little robot we wanted
to do some language grounding studies with it and then one of my students asked
this
question
that we couldn't like go of she said well are people gonna treat this robot
the way we want them to treated
like first language acquisition and i was thinking well
i don't know maybe we should study this
and that's actually what happened with this paper
but a lot of motivation comes from all of a great work in grounded
grounded semantics in symbol grounding of w
you have lots of other people i mean are not all mentioned here but here's
if you that we kind of build a focus on the point is well
that's just getting to hear the point is if you're a person and you're interacting
with a child
and the child's learning language child doesn't know language to the degree that an adult
load knows language
an entire season object and the idea that when the child nodes that pretty much
all l objects have had annotation
i paraphrase or a single word or something and so this child sees this object
here and the channel maybe doesn't know the annotation for this object and so the
adults there's lots of all
and the choppy numbers this and it's quite amazing and this is kind of what
grounding is doing when you read it when you do this with a machine like
a robot has to perceive this object somehow represent this object somehow a lot of
the work up until now has been done with
with vision as the main modality for grounding language into some
some perceptual modality
but once you have
a robot once you have an embodied agents
if a start assigning anthropomorphic characteristics to them based upon have a look and on
a based upon how to act
as soon as they see a robot like immediately think is this manner woman's tells
inches it's is it sympathetic how can interact with this thing what can i expect
and really as soon as someone says this is a robot people think it has
a don't intelligence and you don't want that if you have a first language acquisition
tasks that you want the robot to do
and that was the question my student task and if we have this little robot
we want to do first language acquisition task in a setting that is very similar
to the way children acquire their language
we have to we cannot assume that the that people who interact with the robot
are gonna treated likely what child
and so that's what we set off to do so we want to actually projects
are what age do anything a robot user but the academic level that's what we're
working on here so the main research question is this does the way are about
the way a robot verbally interact affect how humans perceive the age of robot
short answers yes it's after wants to go ahead and put your head down
have a little rest if you don't care about rest but we can we can
sort of jesus apart a little bit kinda show you what we did
we didn't experiment
we have some robots one of the very the appearance of how looked
and it's three different ones which will show you moment we varied the way the
robot's verbally interacted
and we as participants sure robot how to build a simple possible that was kind
of the language on task force running wasn't actually have
but they were there are interacting with the robot's in this very simple dialogue setting
and we recorded the participants there what we had a camera pointed out them as
they were interacting with these robot some record their speech interface
maybe then after they interact with each row about their thought a questionnaire about the
perceptions
and then after gathered all this data we analyse it and we're well we recorded
the data analyzed it with the facial motions prosody linguistic complexity
and we found correlations between data and the perceived age
and we from that we can predict
so it is a three robots were used
because we had
and because we wanted a robot that was kind of anthropomorphic in one that wasn't
so here is a non anthropomorphic robots could work it looks basically like a rumbling
with broken act on it
and then this is on keys cosmo i don't know if you see in itself
very small robot it's marketed as a choice has a nice python is to k
and then we just had an uninvited on physical spoken dialogue system which we
which we affectionately named an overall
an overall not a robot
so there are three robots
and it's kind of embarrassing what we did with the robot's but we have to
squeeze settings that we wanted to test because we want to see how do how
to people treated based on how this robot interaction from
and the only things we only speech that we have a robust produce was feedback
and there were two settings of this feedback one was minimal feedback on like yes
okay
which was basically marking phonetic receipt we call this the lower low setting like i
heard that i heard i heard but whether or not it understood that's kind of
a one year
and then we had another feedback which mark semantic understanding much sure okay i see
a higher-order repeat or something like to show i understood correctly understood you these are
they're all feedback like it's not really in taking the floor it's not really doing
anything really a lot of dialogue going on here but there's these two settings and
then we found that it makes quite a difference these settings
other than that the robot student move
which from the kabuki thing on the light was on an that was a
non-causal had a in its default setting it had this little animated eyes are just
kind of the round but they didn't it into anything it and move it and
per participant the task i think there just talking
until this we have six settings where three robots into speech settings
so the task was this we had a we had a we had we'd set
a robot down right here
whether the cookie the
the cosmo robot or we just not have anything there for the no problem setting
and then we have these cameras here record participant
and we had these just ask for with this also we have these little puzzle
pieces and don't know if you recognise them
on this paper there's three different target shapes that they can like with these three
pieces in each of these shapes had a name
the only instructions we gave these participants was
sure about how to build these each of these each of these shapes make sure
at the and you tell the robot what the name is
and just using one after another
and what would happen is as they interact with the robot the robot would give
some feedback depending on the setting as they're talking to its own kind of interacting
with the but of course it was controlled by wizard
so the procedure went like this we randomly
but a robot here that interact with it
based on a questionnaire about this interaction and then we give them a new set
of
puzzle tiles on a new list of
target shapes
it interactive that robot have a quite a questionnaire again for that interaction and then
they'd have the third robot
with a new set of shapes and possible target shapes on and then that thought
for questionnaire
the things we randomly assign was the robot presentation order you order of the puzzle
we had a different we had two different voices for the codebook in the spoken
dialogue system from amazon was a male and female voice that was randomly assigned words
"'cause" my head it's had its own voice
and then we had a different language setting
so that the high and low language which stay the same for all three interactions
we just sort flip a coin beginning and then they would get that one for
all three of them
and so we collected data from the camera facing the participants that which was audio
and video and then of course the questionnaire
in the end we got one participant send mail and eleven female what we can
further time
and each interact with all three robots folding sixty three interactions we collected and fifty
eight questionnaires for had to be thrown out because you want correct for correctly filled
out
and then we move interested data analysis
for each interaction
with individual robot's we would take a snapshot every five seconds and averaged over the
emotion distribution from the microsoft emotions a few not familiar with this api
you can send the actual like this the eight k and i will give you
just
i is a different emotions
so here's an example here someone kind of mostly neutral
little bit is spread over the other ones you're someone who's happy little bit is
a the other ones
there's some of these mostly neutral but there's more contemporary look at that you're like
this contempt there and the content actually came up a little bit in our in
our study so we collected the state
and just to give you some numbers here about what we found of emotions most
of the time people were in their in we're just neutral and then about eleven
percent of time they were enhanced eight times that surprising content for the next most
common ones
and then the other ones were negligible less than one percent on average for all
for all settings all robots everything
but then we
the robot's in the different settings individually so if you marginalise out the robot's and
just look at low and high setting we find that people spend a lot more
time being happy with the robot's then in the high setting
and this just getting given genetic receipt it's and part of this is
in the high setting it's marking that are semantically understood you and people got really
frustrated with "'cause" expected more interaction from the robot's but they weren't they are doing
more than just giving this verbal feedback
so you want very happy with a with any role and i said
and that's kind of the dictate come here the robot's themselves a little more happiness
with cosmo they would rather interact slightly with a with a
and in by a spoken dialogue system then with a codebook e
for whatever reason
and you can sort of tease apart but them in their individual settings here
all refer to just a paper to get
you dig in the more detail
we looked at prosody the very simply just for each interaction we average the f
zero for the entire actually might have in about you know a couple minutes of
speech and just the just the participant would not the robot
and here some results for that's a if you just
if you just look at should marginalise out the robot's in the low setting people
had a higher pitch
where is not have setting at all
the location this kind of goes with you know literature of people who talk to
children raise their voice is a little bit that's kind of what we want
but even the small difference in feedback next that of the pitch difference
in all the robot's and then
if you just look at cosmo on the low and high setting or marginalise out
the low and high setting you just look at the robot's people talked with got
to discount the robot at a very high much higher pitch than the other two
about these were kind of negligible
is a kind a negligible neither a little bit different but i mean not a
whole lot of different so
the way the robot looks the way the robot talks on prosody kind of tells
us that
both make a difference here
we then
for each user interacts with transcribed speech using speech at a time courses can make
some mistakes but we just kind of one with it
segments the transcriptions into sentences by detection one sec selsa pretty
pretty rough the way we did this we didn't taken to it too much we
just sort of to check these transcriptions and passed through some tools that gave us
some lexical complexity and syntactic complexity so we have
lexical complexity analyze which causes lexical diversity means segmented type token ratios m s two
t r and lexical sophistication
these are nice measures that we can use and then we have
for syntax for syntactic complexity we use the do you level analyser which is just
a value between zero and seven
zero meeting it's a very short you know one words to words sentence very syntactically
simplistic and then but seven means it's a long sentence with a lot of complexity
with the with the l d the ls nasty are it's very simple the process
very similar to the results we get for prosody
in the low setting people use very complex lexical word that very complex vocabulary the
thing that was surprising that i want to show you here is the these syntactic
complex its complexity and the low setting we have higher syntactic complexity we have more
l seven more longer sentences versus high setting
i mean for the most part they're saying very short one to word sentences in
all settings with all robots button some cases there there's speaking on their speaking longer
sentences we dug into just a little bit and we found some literature that serves
in this is kind of what we what we found in our data
in the low setting its get its infinite it receives not semantic understanding it's not
signalling semantic understanding so they just kind of kept talking
the sentences got the since in text more complex even if the vocabulary was press
so the other measures
low lexical sophistication but high some syntactic complexity because they just they just kept talking
looking at the questionnaires for each interaction with the gaussian question hermit just one contrast
in parents each with a five point scale between your some examples artificial life like
unfriendly versus friendly in congress competent confusing versus clear
and then we add the following two questions which was the information we are interested
in
if you could give the robot who interacted with a human age how old would
you say
we've been than the ages in these ranges we have under two to five six
twelve thirteen seventeen eighteen to twenty four twenty five thirty four thirty five we know
that thirty five and all there is a pretty much pronounced speech thing
what level of education would be appropriate for the robot who interacted with sort of
another proxy to age and we said preschool kindergarten each of each grade had its
own value and then of course is
so just looking at that time just
the questionnaires on their own people assigned
the low setting here
people assigned you know on average lower ages and the high setting on average higher
ages sets kind of expected and then looking at the robot's
you know codebook you got could work in the no rollback high rate word with
uninvited robot gets higher stage i think it's the sort of the most
intelligent the smartest the oldest and then we get calls more here which is like
the oldest six to twelve
not surprising and education which tells a similar story
you have the low setting on average much lower much younger
what muscle or education rather high setting in and the difference is not much right
it's just
phonetic greasy verses signalling semantic understanding is just a different feedback strategy but makes a
huge difference
and then of course the robot's people treat them differently
where you have the highest cosmo gets is a tenth grade and then the other
ones get undergraduate
and that was put the what we found from the questionnaires together with the some
of some of the other features that we had i want to point out a
few things here
in the low setting if you look at prosody the f zero average you can
look at is a questionnaire values and as both go up to correlate with each
other
so high if you have a higher pitch it means we think your friend you're
intelligent kind of conscious knowledge and if you have a high or low complexity we
think a more friendly
in the high setting different things come up here sensible enjoyable natural human like and
then lexical diversity
and lexical
sophistication this one i think is interesting in the high setting
he if i'm using more complicated
words to talk to the robot
it is more likely from into this at about the robot and to be contentious
against the robot
in a big white at that was the interesting result people have high expectations of
these other of the robot in the high setting
well you understood me well same or do more they would they were asked followup
questions and it would we wasn't allowed to say anything other than sort of
given these simple feedback
some other stuff which kind of gives
tells a similar sort of look at the robot's instead of just the little high
setting
kind of the same thing sinus
sinuses negatively correlated here with them as you are on the other robots have some
things as well
and this feature is negatively correlated with the low stage will the second most was
represented there
so we can begin in this little bit more the paper
so to predict the perceived age an academic levels now that we have this data
we want to use our prof prosodic linguistic and
language features
what prosodic you motion language features to predict
the age and so you fifty eight data points five fold cross validation and we
just use a simple logistic regression classifier
nothing terribly complicated here not very much data if we use all seven labels we
don't you very well if we find a splitting criterion say okay let's split
at eighteen years old and see how well it does
we can predict fairly well if someone thinks that a robot is of minor or
an adult
and for academic level we can we kind of the same thing and we found
that we can split preschool with reasonable accuracy and
so we can tell if someone thinks a robot is a preschool age so taken
together you can tell someone if someone tree is
assigning adulthood or minor her to adapt to a robot and if they're furthermore assign
preschool academic level to the robot and that's actually what we want to do we
want to be able to determine do they think my robot is preschool age of
the language learning stage
so it's and this we did some other stuff along with this stuff that i
showed you that confirms the stuff of nova coveralls april actually still workshop
where the robot verbally interact this is just a back again the way it looks
changes the way human participants perceived robust agent academic levels
perceived age academic level can be predicted using multiple features future work is what we've
kind of verify because most the right of robot for the job for first language
acquisition test and it doesn't look like human which
we
people don't wanna look at you and we thank you for your attention
i was sick you know i was curious why use really dh i don't want
nine or by data preschool is really small children cell
it would have sleep day
education level in many different ways right
we worked
we did try a couple of quite things and rely on that one work but
also make kind of sense
i don't versus models that's seems like a reasonable splitting criteria let's use that of
course it's not the one we're looking for which is
i think initially to make a child and that's what the preschool one does pretty
well
it just words that's the way i'm sorry
when you have this
chart of the predicted a for the low and high level kind of
looked
two of the
the parser
it was lots of there is no the u
this one where i can read it was
below is
more likely to be perceived as being a child but also more likely to be
perceived as being able to just
is really unlikely to be a teenager
there's these pesky
undergraduate assignments
so that to for example well i mean in general they both get a look
at the academic level here okay i got quantities and then this one gets an
additional to have great one but i mean there's a lot more like preschool here
as it is more you know can and first grade stuff here i mean
on average it is quite a bit more but there are some people who
hi how that's why that's quite interesting and there are a
i may have missed something in are also in your
questionnaire supply
saying
people
expectation
the robot
i was wondering give
this was what people told you or
your explanation of the data based on other kinds of things that we found a
assess that the robot's as knowledgeable and so on some more
i and experiment together so the iq stuff is what they said
questionnaires the other stuff like the p the l and the ear
right so q means a came from the questionnaire you means a came from the
motion stuff that we got from the microsoft emotions api that we just read off
of it
so we have what they're telling as we have we're getting from just data we
collected from mister the correlations we collected from that so it does it is our
interpretation like with this set content we're saying okay a high setting
we detected from them that the use i lexical diversity using our tools
and so collected from than that they had high said that the content from my
something like so in this case those things were correlated but in this other stuff
this is what they reported like i they thought it was enjoyable my sensible
whatever and in this case like in the low setting highlight when it was high
lexical sophistication they questionnaire they would've given a high score on the questionnaire
it's a testable
so i understand that vectors
really
you know
yes okay
a common
thank you