Speech Transcript - Supporting Spoken Assistant Systems with a Graphical User Interface that Signals Incremental Understanding and Prediction State

she you not good afternoon

i am casey kennington

currently boise state university but this is work that i did

well i was to build a full university with along with that was long and

and i'm gonna give my two cents on

a continuation i guess on yesterday's discussion on personal assistants

"'cause" we're gonna tell you a little bit about a personal assistant of that we've

been working on

and if you don't know what a personal assistant is your in the wrong conference

you've heard of them you've use them and they're great i mean they their useful

not we dialogue people aren't the only ones using and lay people are using

quite often quite regularly

but

when these laypeople use these

systems

these dialogue systems essentially these personal assistants they do weird things with them and they

complain about mary all things

and so today want to talk about a few of those things and maybe make

a approach addressing a couple of them

one thing is that they kind of have a difficulty signalling affordances someone shorter but

yesterday and things you can do with your e

why doesn't need a book

that you need to disney to be signal somehow and it shows be a lot

of these sure speech recognition output and sometimes it's great perfect

but you know well

that speech recognition even if it is perfect does not you know understand

that something else that needs to happen here

they don't know that understood until it finally does something comes back and the results

are

maybe what they wanted maybe not

another thing is the user has to expressed

express their intended one goal

that you have to say the whole thing wait for to get back to them

and then they can continue wanting

sort of like this again with the system

looking into that a little bit more if you if you consider a

personal assistant on a continuum like there's some one extreme you have these

person or systems that i don't even really want to talk to you

they

want to its apparently easier to predict your life then it is to predict

what you're trying to say and so groove allows trying to do this in this

is useful

on the other side of the continuum you have the full turn

personal assistant that is expecting you to

given entire intent and then it

that was all that's understanding and you do some kind of response maybe there's something

in the middle that would be a little bit nicer

sub-turn little bit a little bit to the left ear so

i say call mom and there's some sort of feedback that it understood be a

and i know that understood me a nice to amend it and then i can

say on speaker phone and okay good

and we can move this may be given a little bit more to the left

and say something call a your mom

one speaker phone

it's

exactly that's what i meant to say

so there's a little bit production it's not trying to predict your entire life it's

allowing it to give at least part of the intent but that's doing some prediction

error so we can maybe make our dialogue systems fit some runs continuum that's useful

for any particular user

we want to look at this a little bit

really quick related work some inspiration joyce tries work on misalignment manners signalling understanding and

others work

on backchannels stuff on arts and

work on goodies which we kind of are gonna do here and then of course

lose project

we would take inspiration from all of these

for some reason they're not none of these people here

but we're gonna do something using all this all of these as a sort of

inspiration so we're gonna signal ongoing understanding

you can agree

assuming here of course that people have a way to display agree so this might

not work on something like the amazon

echo but most people have other phones with them and can use the personal assistant

with the display

and with it with this really backchannels don't overlap speech so for talking and its

updating and showing them its understanding then it's not gonna have any problems importantly works

incrementally

that is word for word are explained that the moment a little bit more and

it works with

minimal or no training data

the rest the talk is as follows i'm gonna explain our system

and the components of it and then

see if that system is worth its salt

well first the system

at first blush looks like any other dialogue system you've ever seen their speech there's

nlu errors dialogue management there's some way to convey the it

i'm response to the user

user with technology in but in this case agree

the speech recognition i'm not gonna going too much it's

google asr we have it modularised here nicely to give us incremental

results so word-byword it's coming back to us and we take the those that incremental

output from the asr give it to our nlu

and are not use working in lockstep with that so one takes a word

and we're gonna use the in the simple incremental update model which we introduced in

sect dial and that's in two thousand thirteen

and without getting technical you can look at the paper if you like

equation thing like that you can if you what you get is you don't word

and its going to produce a distribution over slots

and that's can be given to the dm the dm the dialogue manager gonna use

that somehow

with this little provision when someone utters a word

asr gives us a word

that is the same as more similar to

a value that could fill a candidate slot

then that's gonna get more credit and this is how we are able to make

the system work with little or no training data and then build up from there

that's no you're

but the dialog managers taking these

word for word the not use given this these slot

distributions to dialogue management dialogue manager has to do something with that

though

in fact it's making one of four

there are simple decisions one is

i get a slot a look at its confidence value and what why do i

can wait

if it's if the confidence values well just sort of ignore it

in particular so particular value isn't enough to make the slot the one that i

want

or i can select something

is above some confidence threshold than the slot as good let's fill it with this

value

or to others here is we're close to that threshold

but not quite there so let's make a clarification request and somehow display that agree

and then of course they have to be able to confirm that request

i want to point out here that it is here between sort of the nlu

on the dialogue manager

where this and pointing is done we're not doing and pointing with speech recognition that's

just always on

and it's here that where

so they can stop and pause and think and what do something it'll wait for

them to finish so they can do things in instalments so it sort of semantic

driven and pointing

and we can use of and i'll

for this it's sort of rulebased at the moment but we have the provisions are

there now for

reinforcement learning and learning on-line to improve the system as people interact with it

now we do we

the dialogue manager decides which was to be filled and it says gui here's what

the decision i've made please convey this information to the user

and the golay you'll notice right off the bat we aren't

obviously aren't you are designers

but here's the here is that you turn the system on and

this comes up it's in java script so

and it just looks like a right branching tree and really that's all it is

but right here you can already see what the importance as r o we can

do these five things are nice

i don't have to guess i'd have to play with it in figure out what

it knows and what it doesn't know

and so i look at this thing is a well you know i am kinda

hungry and it will go then into the food domain and sort of open up

the treatments a lot

if you if you're hungry then i

you know one where you want you know what you want and where

you're gonna unit

and i can say you know i'm among we first and thai food and at

that point in

go to the top here and

shoulders note and read a question mark for this clarification state did you say tie

in to the and this to me as

into it in that it

is trying to understand me and i have to do is say yes or i

mean time and that would fit

basically feel that slot which

conveyed visually means that it just collapses that are the tree and shows like this

so the here's a here's a frame that is filled

and it shown visually like this

that's our system

recall right

now well we did some experiments to see if that's system it was everything we

hoped it would be and where to put some people in front of it

though

we want to test a couple of things about this system so we're gonna break

it up in the basically for different

different settings

we want to test

we want to see if our incremental system is better than or more useful i

suppose than the traditional one

so we're gonna let them play with that of first and give them a trial

phase here's our system here some tasks to do them and get used to the

interface and then we're gonna

sort of move start on the very right side of the continue one where they're

doing this

traditional

current turn taking full fully intend mentioning

personal assistant

so and points

as usual

kind of like the traditional personal system

so we then we

then move the continuum move on the continuum a little bit to the left and

nouns incremental now we're doing some terms

and you can

do things in instalments

and then we have phase three for removing that

a little bit more to the left on a continuum answering

now it's going to adapt to you a little bit and try to predicts and

fill some these slots for you

or expanded a little bit phase one acted like a standard personal assistant silence and

pointing before they can we would even show and the asr was shown like it

is in your standard personal system

based to is incremental phase so they did phase one for four minutes

and then they began face-to-face to did not display asr is just the query and

it just was always there are showing always updating

and the endpointing as i mentioned was done semantically

s two and determine there was a question and we just asked them you know

what you think about

these different systems so there was a ten questions and we ask some you know

that they prefer the first system the second system either or both

and case three started this was the adaptability adaptive phase

which is basically the same as face to with adaptation and the wayward is that's

very simple way

if base

if they did it task

basically build a slot or frame

and they

did that same thing again it will remember it and start to

ask them just immediately ask a clarification so instead of saying i want this i

want the thai food they would say i'm hungry and then it would say then

it just have to say yes and it was shown slots for them

and then after three times we just filling all the frame entirely for

and also an example of that much for video card movement

and then after face three we had another questionnaire that compared phases two three

so here's that video

so this is in german i'm doing this

so if you speak your mind you apologise from my accent and so anyway so

i'm saying something like this i'm hungry us i want to eat something around here

maybe thai food

and it does a clarifications are to say exactly

and then i repeat this several times to show you the adaptability of this

this isn't something you would do you're not gonna take your personal assistant read be

yourself five times

it's gonna give us a lot

but just to show the functionality of this

stress

are

it's filter not just one more kiss

we are hungry and now it's also

i feel like

and i don't see that same thing i am hungry

and then the last time i said calmly

if someone else

i'm a pretty

pretty easy going to predict yes but this is common

it will use their people want to use these personal assistance data the same thing

over and over again

my brother here's an act my brother everyday twice a day all opens up as

i phone subspace yuri

google voice you traffic

every day

is it just like that and it gets the response he once in people do

this and it could probably just pop up and shown the traffic

where am here

so we got fourteen participants to come and sit down with our system so we

set them data at a table there is a

a screen that show the task that they were to do not spend a moment

and then there is a chat with it was a turn on its side it

shows the gui and the gooey was this was as i showed you and it's

it's javascript so it was in a in a web browser basically a motel what

and then as a keyboard push a button to let them know that they couldn't

one

but to signal about that the task was complete rather so the tasks were like

this there are five possible tasks call reminder

find a restaurant leave a message or find a route between two cities

and that asks questions icons and the task items were randomly chosen randomly chosen task

randomly chose the slot so we want them to convey to the system and then

there is a fifty percent chance later that the task would be repeated

here's an example

they were said they'd be sitting down playing with this the system and then something

like this would pop up on the screen and that thousand or call

peter

and the system with then

due to its magic then show

google really show it's gooey and once they

recognise that understood then they would push a button and a new task pop up

and they were charged with doing so many of these task as possible

because the we wanted to do this

and not just let him play with it because the tasks

help us

collect some objective measures as well if we tell them we want them to do

is many tasks as possible in the four minutes of to have to interact with

each setting of the system then we can learn a little bit more about how

productive they work

so here's the other tasks they would see stuff like this

so we have the twenty most common german names you know how to most published

cities in germany billfold it turns out as among them

and you know everything else part of the so there's quite a few possibilities that

could be said here

but again

we didn't train this at all we just sort of type these and got a

list of stuff and threw it into to the system important that was the end

of it and then worked

but here some results from the questionnaire as we get we can we can conclude

the following based on sums some significance courses that they generally like the gucci

they counterintuitive to use an easy and understandable

and that was our main focus now something goal

the grill optimistic to be taken care of locally and they did this a lot

if a mistake if the if of slot was filled with the wrong thing they

would immediately try to fix it

it didn't always just push a button move on to the next task or

there is a keyword they could say that could we start from the beginning they

generally trying to fix it right there and it was able to do it for

the most the time

and they didn't generally notice that the between face to face three the incremental and

adaptive phase they didn't really know there's

something adapting but for those who did not which was about half of them they

notice that was face three nineveh did get wrong and there's a listing of all

the questions and there's more in the in the results section of the paper on

this because of the

this is what some things we want to highlight from that

the objective results we are these tell in interesting story so we just cut we

just kinda that the number of tasks of their able to do in the different

settings

and once they get increments one adaptive variable to do quite a few more tasks

at least they thought the tasks were complete

and here the next the next rows frame accuracies so when all the slots in

the framework the same as the one that we wanted them to convey in the

task that we showed

and the adaptive wanna

does quite well because basis it's part of the time the slots are already field

for them

so it score one for google now

i guess trying to predict your life is actually maybe easier than learning how to

understand language

the other to tell an interesting the more interesting story we get f-score which is

basically maybe the entire frame wasn't correct but the this gives a and idea of

the correctness of the slots of the frame maybe wanted to the slots were correct

one wasn't

and

in this case incremental lower and then look at the time the time is about

the same across all and this tells us that the degree was

intuitive enough that in the in the printed

phase where they are just playing with it in the trial phase

they learn enough about an experience enough that they are just getting used to it

over time

and

what both these rules tell kind of that story

so it helps to be a little bit more productive especially in the adaptive the

adaptive

ending

so they're kinda nice results not the most stellar thing this thing is and you

know going to be in everyone's phone next month

but

like i said we didn't use any training data and it was fairly robust

some discussion here

our incremental personal assistant or ip a different i suppose allow users to make mistakes

easier and sooner allow the users to interpret the state of the system's understanding

and under the adaptive settings it allows users to be more productive you get more

tasks done in this kind of the setting where we're driving them to do tasks

like this

and endpointed based on semantics not based on site

i have a nice thing

future work

i mandarin is the obvious thing we have a system no training data let's interact

with it and it should start to learn and do things better

and the mechanisms of their siam the nlu model we have the dialogue manager we

have all have provisions for this we just need some kind of a supervision signal

which we have if the frames filament get sent on their happy with that

we can give feedback now to say those utterances led to this then that should

that should help the nlu and hope that the dialogue manager work better

same for additive

and better use user modelling and adaptability

like to be improved

also web based authoring loose does this a lot of systems other that do this

right now it's not too bad you can after adjacent file and it'll important there's

tools for that and is actually fairly quick and easy but where they softly might

be nice and then of course we need to scale up to more

larger domains degrees the bottleneck here and it's sort of a two edged sword you

wanna show your stuff but also be able to handle lots and lots of general

things so

that is it thank you

note that focus on

if the

right

right like a like i said we're not ui

experts bring us to if you're right it's gives call i guess on but what

we have right now is sort of a max after their seven or eight knows

that is just sort of dot the thing you have to do there is there

and what gets shown what are the top seven that you will show and if

those are if there's something that's not english on their then you doing something wrong

so there's more user modelling that happens in that regard what get shown on the

gui

better no you would help with that

better user model and help with the

good question

research future stuff

i q

right that i'm not that the future work i mean the way we don't the

provisions are there are also in this you can you can click on of the

clicking doesn't do anything about the idea is kind of like the stuff on larson

to his gui as you can talk about the gui itself and navigate to go

insane know why don't want any of those go down a little bit we start

right there are some exactly

exactly so you can flip through it put stuff and you can add something if

it's not there that would be nice to and i guess but right and system

in as becomes intent that you can use in the future the gui should be

able to help with that

okay

right so it

right so the common question comment was on the semantic endpointing bit of it i

something to look at i don't have

don't have an answer

definitely something considering

right

agree

no not

i want to be really clear on that they're in the trial phase maybe they've

done all the adapting they're done adapting but the system is so rudimentary and simple

and the gui is that it doesn't it doesn't do much you know there's only

a couple of things that it that it does they learn about a very quickly

that's why that time to really change

you know the average time per

for task

so they weren't just

getting used to it over time because they are already used to before they even

started the first phase that's kind of the taken thing i got from the objective

scores

that's something we were concerned with that's why we designed it this way

that was i need i knew someone asks a question i'm glad somebody did exactly

we because of the way we wanted to do the comparisons we wanted to do

this objective comparisons and we wanted to do some objective scores and this was a

debate we had what we ended up doing it this way with the hope that

if we designed the right way

you don't get used to write beginning we will have as facts and the numbers

can show that

i'm glad you ask that

Supporting Spoken Assistant Systems with a Graphical User Interface that Signals Incremental Understanding and Prediction State

Oral session 4: Incremental processing

Casey Kennington and David Schlangen