welcome to the special session so it's actually a time is passed so that start

now so this your we propose the special session entitled future directions of dialogue based

on intelligent personal assistant so i'm enrichment cms from carnegie mellon and

and i'd exp opengl is from toshiba research

so next meeting so we have one hundred i'm sorry have one or how our

from now

i sorry seven

yes so today

it's a is a row of a personal assistant so many tech giants

released dialog-based a person assistance and including i'm google in microsoft the many as a

front and of their whole service s so it's a big deal so now dialogue

system research is

is very core crucial for that kind of service s so we are all dialogue

system researchers

our rock stars and you know society in this era i believe so that's why

the actual we propose this process a session to discuss our next a future direction

or vision

so

let's get started

so this is today's agenda so we're gonna quickly introduce introduction to choose its average

the common ground and is and then print discussion

so we have four not able panelists from academia and industry so let us introduce

maybe later and the and then q and i

so this is actually kind of very flexible kind of discussion so we happy to

get your questions and yelling from your from audience so we happy to have your

opinions anytime

so

by the way so what's our prison system now so you know what it is

so this eerie or contain no make a soft or and a spoke and also

that's so

i am is on so they are maybe we can hold it hold them helplessness

distance but different kinds but

assuming they are presents distance so anyway qtd a is that personal assistant is

like this so if the agent that can perform task or services for individual so

basically that's it so in this session we we're gonna define persons systems something like

this just simply a personalised task management class spoken dialogue i capability

so that it is

that's our

i definition of presents distance has a common ground in deception so it's with that's

it

so

so that's look back level of it

the past

so

i think we think so the current personal assistant has two major streams one is

task and management also spoken dialog researchers i think i'm not the right person to

describe advantages tree

but so that side a purse not personal personalised has management's so because a best

if a region one of the region will be apples knowledge a navigator so that

was very actually very usually a vision so

it's actually it's actually exceptions it presently this eerie just fall of that vision

i believe also in two thousand three hundred arbour announced howl a project which is

very big project so gonna discriminately more

so we are so this the knowledge navigator so

some of you most of you may be already knew this but

i don't is actually be video so very additionally a video even now is very

interesting so we don't have time is i

okay i

this research cheating one just checking

a short circuits last year postech second extension was translated this mary his i am

i right this is sufficient

is that in i started lots of twelve o'clock

you need to take there are actually on schedule and

in some you have not sure exactly for stationary amazon rainforest

this leads to those from last semester

no that's not enough

i need to review more recent literature propose a new articles i haven't read journal

articles only

find your financial gilders has probably still there are two for station is a it

dialogue it is also rainfall some sarah

it also covers a classifier absolute reduction in africa

and increasing importance of so

context you like this one but sorry i'm sorry should increase for secure features it's

of these go to the u two that there is a will be they'll the

feminist video so even also serially for example cannot can be that kind of quality

so the me while so the overall

and nouns and their big project called a hall perceptive assistant

that learns

and then it darpa award it's is a light and cmu so is actually the

ball is a common general architecture and it in col and the rater

and cmu more

each instance of the paul architecture

so and then this was paul slash colon architecture so the main focus was learning

so learn from user

and is a column had a lot of capabilities in terms of task management

so example a one of the most dialogue-related

capability was meeting assistant i think so it has

for example dialogue act detection also summarisation and so on

cell

the other kinds of

verincation there

so and n is the rate or the reader is not i think if you

know this piece i correctly so that was that was not so much a dialogue

system but it was a male management and outer scheduling task agents

so and it's eerie this was the most are used

slice of a serially

so it has it's it was very first agent that had

spoken and in could also be the management's so but

as you can see that the conversational dialog management

interface user to be very small capabilities so this is really that's is you know

that's the past and

but i'd a few comments about where we are now the student maybe them start

the conversation

like many people say we are

going through the spring or artificial intelligence

and we are also transitioning doing you error of intelligent person and the systems

there's many factors that have led to these

and

this can mean

i second that maybe we can bill long-term relationships with whether it is one they

are year a lifetime

and that these a sequence could be so sure

so many factors contribute to this but i think there are two main cochlea mostly

the advancements in hardware we have cheap and powerful hardware and we have

a lot of pervasive smart devices whether the as smartphones during bracelets whatever

so this creates a lot of be

and very use beta so we have

powerful machines and b i that we can make a little as you work

the these enables us to tactile problem that were previously

a little harder to soul

in this is evident by the availability of the tools that we have now in

the web

where there open source or no

so tools like nucleus are for speech recognition or both frameworks over

maniac things

how can we combined of these things into ending a separate utterances and

one of the ways we can think about it and of course this is open

to debate

is that we can simply be assistant into the cognitive functions like it has into

the communication channels

so the ses and in

needs to be able to reason about there were about the knowledge about everything need

to be able to communicate with the u

so it needs human computer interaction

and it needs also a lot of interface is equivalent to devices whether that you

like or smartphone

car syllables anything

i either

so maybe you like to mention are assumed in also mentioned earlier

that an agent needs to handle multiple complex tasks maybe sometimes

i don't characteristics are like the video we showed before

we need seam seamless and context aware

understanding and generation we need maybe start can be in

ability to incorporate new knowledge into what agent knows and so one

and there's are sort of challenges like for example communicated we could of these devices

maybe is not very interesting thing the research where is for you know new students

but it's a very big problem when we try it processes

and that it what it wanted to me

was that

the agent needs to be able to you dark evolving relationship with the user like

i mentioned before maybe select from and they are five maybe it's or maybe for

lifetime

you don't need to be able to reason about it would

in this sort of

backing the context over time like i see here so events change things changing the

world and we need to be able to refer to get passed to the future

we present

and these are all points for

discussion so what is the future of the person of a system

we have here

for topics for discussion just exam star our conversation

what is the current state

of a person and assistance in research and in industry

what are some big technical were connected to be absolutely induce all before we can

get the next generation personal assistance

how can we can't big data

in terms of collecting the data what kind of so we just do we need

how do we manage privacy issues card we you know all of these things learned

from data

i do process

stored in a minute of these things and then we

a topic about the future a revision of the future of a version of this

is constructed kind and what it cannot be

and so we have for notable boundaries that should be for

i would you introduction stored in the interest of nine

the first we have where professors even who is a professional for information engineering

i'd information engineering department division of an invasion of errors in

as a long track record of research on spoken dialogue

particularly speech synthesis recognition dialogue management among other things

because she of numbers of words for each contribution

in this going to name a few

of the signal processing society technical achievement award

middle of scientific achievement from is curve

in other things

james flanagan speech audio processing of interest of time or sorta really one many other

words based a pair of words

on proficient german

we have comments are

what is a senior speech scientist accommodation

what you "'cause" work in several voice in april product of common

in this approach is leading a group of researchers currently working on models for i

think several service and also holds and i don't open to position language technology since

you could at carnegie mellon

because a lot of experience in speech recognition and translation in the plastic as workforce

only multimodal technologies and toshiba research

we have a professor jeffrey become who is an associate professor human-computerinteraction institute of carnegie

mellon

with along track record in crowdsourcing

and crowd power words used in the work for natural language applications

prior to joining carnegie mellon you with an assistant professor to begin with different just

there

because it should main action a word someone weights and then it's of cardio to

were either respond which as well as you can see what one of

it might you can use thirty five innovators under thirty five

and we have a urine wrong

with a coupon there in c e o or for three are good at i

a startup company got developed conversational interfaces

you cause many years of natural language processing a spoken dialogue systems research and development

experience

if you guys formally worked for by do the super research period where

and university college london from bridget good use for each d

so next we will ask our panelist introduce themselves you've a little bit of

the fusion talks

after that we will use seed questions and whenever you want you can raise your

hand and

ask different questions

so

i start okay

there we have an from question which of

it should be written down

current state s and bottleneck to i don't really think this very much

say about that

you may disagree but i think broadly speaking

we have enough in place that we know how to build a different bit

of a system

from speech input through understanding response generation

interface to the backend

i mean when people use serial available now we pick up the edge cases and

we laugh at you know series failure to do this and so on

but if you actually focus on what we system can do

and you comparing what they could to five years ago it actually pretty remarkable progress

in my view

it's mostly engineering

and i think that the over the next few years it will mostly be engineering

but makes these systems ever more capable broader coverage a makes a few a stupid

mistake

within each individual sort of subtopic it's clear we can all we do better

and i think they'll be no is no shortage of things we were interested in

research the focused phone but fundamentally my view is that there isn't a huge kind

of missing piece that no one knows how to the until we still that

we call build a virtual personal assistant not one like in the film which i

haven't that ceasing by the way well i started watching it they also able i

think that fell asleep a twenty minute

people to only with i should watch that than actually manage it but we're not

gonna get to that stage any time soon but i think we can be there's

a long way to go and it's mostly engineering

i think one of the big problems that perhaps is kind the community has

is the data problem

there is

i've worked on in spoken dialogue for a long time

mechanical turk

enough silence revolutionised what we could do

because we

once mechanical turk became widely available we can build a system on we can be

employed and we can pay people to use it

and we can get several files and dialogues if you like and i saw something

doing and i will not run time we can we can measure of performance

but apple now process about a hundred million

also very compensation the week

and i'm show google

and i'm and non-face work or handling similar constantly now the kind the machine learning

you can brew on that would that kind of data throughput really very different so

what any and the academic and even contemplate doing

no

i think one of the issues they really have academia and industry if you like

work together

so that the academics can actually focused on the real datasets

where the real information is on the right real data flow though and find ways

to work

and i mean and that leads me onto what is it my view one the

biggest

questions about taking these systems for words and that is the privacy issue it's something

that different companies have a different take on

but and i think the public at the moment a pretty much asleep

on this issue

many people don't know okay what the privacy issue is

issues are you when you sign up to use very for example you scroll through

this stuff no one reads

well almost no one reads you click the bottom n you agree and most people

have no clear what it to the we agree so

apple has their you really rather strict privacy

protocol

and actually it's researches

don't get to see private information and i can't speak for the other companies

but it seems to me that we'd

be these are issues which you need dealing with and i think many remote transparency

the main evenly with some legislation but without transparently in some clear rules that sort

of everyone's working to i think we're going to get come unstuck because there is

a danger that something happens which is not good

and is the backlash and then be systems become

i don't know

the great the people don't normally use them for reasons that the just a full

of the but vector previously but also some very interesting research that you can do

so one of the things that we work on apple is differentiable privacy

basic idea is if you have one a client and you want to collect data

from them what you can tell which you can collect the data are on the

on from by the device and then you are noise to it you at sufficient

nor use that you can't actually any longer identify the purse more in the

any of the private content but when you take that they tell you what you

aggregating with the simulated a noisy data from a hundred million devices you're effectively filter

the noise and get the statistics you looking for

without ever seeing the private information that was done any individuals device

i think this very interesting research is starting to manage along those lines and maybe

it it's a roots to being able to make the kind of information the more

acceptable to academics

if the right channels with doing this which claiming to protect the individual privacy but

actually allow the data to be more widely engaged i don't know

i'm the final thing i just wanted real probably different way to one on the

vision thing

i think really based scale for the companies that the doing this

because

the what first of all i think we will move to a situation where a

what individuals how one person was system and the use that personal assistant for everything

why would you want to switch if this one personal assistant knows everything about you

by no go history or timeline what you like what you don't like what you

did a week ago a year ago and they can of influence would be a

real service that person

now that's going to be as it

thing than anything that you might have with anything that's being before so face but

want to want to k values is always talking to facebook

and so we try to make facebook a sticky as they can and on and

various will be will try to estimate thing but once the virtual personal assistant really

get yelling then you will have here r one and it'll be very difficult very

high for to think about switching to anyone else's

so if people really start using court on their in rows of this time to

look you know larry data from three or too much

obvious tracking which can we are working for conducted a larry but anyway so

however it's a terrific siri cortana alexia

if you if it people's that get a get really attached plus this they won't

leave and then the money will start to flow in due course

so who owns the purse list is gonna be a very big deal in the

future

thank someone

that with

like word

or not

the with some or all

is to actually greater this topic like

and compared to other processes not exactly hours of the

not a one-to-one it really cool you a bit about that rollment feature vectors goal

of money

that is the different for your processes

nevertheless so alexi voice overs it has to go to bring a lexical everywhere so

i

and also doesn't degrade offering

it's a service

so i to integrate this it will

all different sorts of artwork therefore the that you might actually

have

a lexical or also

it's just all over to like your foot or

in your research

and about the visual the us to

right a space where you actually always have access to one of the person looks

as a system

the one of the bottleneck

there i think it is likely that all of what you have a big

number of devices to what it's looked at all

that you want to you interact with the okay so that many different devices

and you where what is the one can do and

to enable is kind of six q

i've to scale

capabilities of

what the system is able to do

so one way to do that "'cause" i think that

that's right up to

enabling us to develop skills looks a little bit

i think what has something similar to that

of things about which allows

enabling

a number all a pattern it

telling the functionality of the system

to make life of the word

see

comparable

and accuracy problem rather

coupled a perspective

as the most important thing right so you have to a great value for the

people having a process that's one thing but actually utterances to something useful that adding

more or

with respect to data

is a

of privacy is really a

topic also so

we have just companion up when you go through this you can take exactly

what alex over

and request

after an utterance or to get them you'd

so that's of a very important is to keep the crust

customer so i don't know

using the other companies but not on

it's below

all of the utterances

having so many different devices actually or planning to support so many different devices us

a kind of representation

for building statistical model is one of

problem

we have to write an article so you don't wanna send every time you have

you like your

data vector notation

for

from a doctor

what i

feasible

but it's redundant and a wasteful

so

wrong

from the perspective of scalable annotation scheme i think

i think that is

what role

all

otherwise

like i think scale actually and we understand what we want to do or what

to

as i can listen to what

the customer service

and actually

no

like to pose if a customer service

that's what color and they tell us what's wrong

and

but isn't

one way a novel way

is quoted a system for a lot of local

people discuss to go

for make records the true

what

what do cool

this respect to dialogue a cue at lexical walls one

there are several

so it is not likely to the pride

i

the i think dialogue this will definitely something that a lexicon who are right now

it's really want like from a machine so i wanna but on the light can

spectral light

if it doesn't really realise

recognise which like you want to come up cool so many of them will have

a for conversation with you but it's really very

task-oriented in that so

a hurting

a longer conversation to collect at the moment is this

what we are not

and

i think that was to question about how to commercialise

systems okay thank you really have to create the trust in the process

at

that a that

the system will work well so if someone tries that it doesn't work

the goal for it again

you lost a personal to we can with some criterion

and

models one for

what is

actually i think it sometimes

doing something small as well then although promise

and

one of the technical bottlenecks that i'm currently c is related to machine learning if

you get more data

you not always converge to the same local minima

which functions

and

people the a get a better experience but some for some people it breaks right

so you have to have a way to make sure that actually not

too much regression happens for large crowds of people or actually sometimes things

in general are great a very with all

and

from and former four point of view the mechanism to make sure

can fix to use it so it can kind of engineering problem or so

to design a system

allows you to

crow to system

in perception and make it more

maintainable

why have slight of

i is a remember

well i'm just bigram a carnegie mellon and i'm in the hci l t i

institutes there and so i'm here because my group of the past couple of years

have been developing crowd powered dialogue systems with all introduce you to

and over the past couple of years we also been working to automate solve kind

of explain what that means

so i where some questions as was mentioned and so you know where are we

well as we all know people are actually using

using these systems now there's a talking to their devices which is pretty exciting right

many of you raise your hands about the legs i talk to my watch and

i'm not always just talking myself i'm

clearly most people i've interacted with have a few specific function that they use those

devices for that they've learned somehow those devices are pretty good i in so that

the kind of this illustrate this point i was that the local library the other

one open actual value your go

and i found this work

and it's work is called

walking just your right a it's a great book

i will tell you all of the things that you can talk to serious about

and the recently reasonably large but which is pretty impressive at the bottom recorder and

may happen shriek utterance but and it update now maybe it about inspect now

what about what i think that that's where we're at we're at the point where

we have system that can reliably do a few function back to you know pair

number of functions and we're teaching people how

to access those functions

and so well we've been trying to do is put out a system crowd power

system that explores what we might be able to do if our systems could

be as robust as the human system and so on our system call chorus with

we developed if the scrap our system in a way to work is that people

talk to it able to hang out so they can talk to it but in

speech recognition or type to it those messages go to a crowd that we were

route on the man's within a

a minute or so we get a group of workers but then also just responses

and another workers about whether they think those responses are good an effect are good

if for them back to

the user and if the user once they can reply and some of the same

workers a menu new workers of joint because the others have left well actually responded

and they can have a dialogue in this fashion we have explored how we might

maintain consistency over time so there's a memory space over here the crowd workers can

and of remote access a learned about the user as they have a conversation with

him or her so maybe i've learned that you are allergic to still versions of

the next time you ask me for a restaurant recommendation i should not recommend a

thief it'll task

chorus all kinda different things and because the people maybe isn't so surprising that horses

pretty good at responding

and they have a travel at it was some idea of how to make spaghetti

with it i i'll kind of crazy things

and you two can ask really things

by going to talking to the crowd artwork

i hope and i encourage you to try to i would say it's perfect right

we're doing a lot of things in the backend to try to corner responses but

i think it'll be

surprising and even though you know you're its people i'd that you'll be surprised at

the red and robustness of responses of made

also so that the right so that might be what you thinking so what you

this is just people talking to people are not ever know my note that the

most obvious thing in the world okay and you mostly right i mean there are

some challenges when we introduce an improper off and where there are only doing the

short after the never done before

and if you work with mechanical turk might be surprised we can have to get

people quickly and of they do more or less to think they're supposed to do

i don't really surprises on the quality of answers good in back but again so

what well what is i one reason why we might care about this is that

by deploying a system that we wish we could automate

we might learn about what we don't know how people actually want to interact with

the system like this we get a lot of inside i think into that by

deploying a system something that you don't necessarily get an artificial scenarios

we don't a data driven improvement right so actually collecting a bunch of data will

release it is that as we go and it's real data from real people asking

you know questions that you know the first question or two there is an estimate

of the curious eventually because they actually wanna

one of the answer i think maybe the more interesting one though is that are

thinking about hybrid workflows that combine a automation with people talking to two examples of

things that we worked on just the taste to give you a sense of you

know words is going or the person the system called guardian as this the crowd

powered a dialogue system

or web apis and so the just this we use the not mostly non-expert crowd

as a mechanical turk workers to convert the api on programmable web that two little

dialogue system which then the crowd helps to run so they do the slot filling

they do transitioning of states and

and formulating of responses

and the what kind of need about that is that we're collecting data at different

levels right so we're actually having

the crowd provide data not apply you gave me a method and i provided a

response which is may be difficult to learn from but at the lower level that

you trying to start running the dialogue system with the crowd

that then be done to try to push this away from just information queries into

actions a is the have the crowd start to work with the user and a

dialogue to create rule that their phone can then run i don't know how many

people have used something called if t values if t

okay so it's basically a way that you can set if the and roles of

things like you know i was

i was late per meeting this morning the crowd in table why were you latex

the twelve the last night in something along with that particular of my car and

so from that they can work with you just a well if the was what

i had access to be a this api is says it's note overnight than at

i put your i'm may alter your alarms with a little bit earlier so you

wake up and

right so turned out that this idea of using people along with your automated system

is not actually new right so most software company is all words are software company

the many startups have efforts in this paper they have them and sometimes a very

exploded so we already have is but in mention

so that is creativity vc this is one of a call centres what have you

know their crowd their workers through a obviously can and you can guarantee more about

confidentiality in other things well there's another example is likely to be your artwork and

things like that i think what's really interesting about this is that

we don't have to just rely on automation right anymore not rely on just automation

right so whether it a call centre like this or it apple engineering you know

more and more templates that can respond to pacific function but it knows that can

support we are actually relying on people or amazon building out this key all features

of that

it's crowd of developers can build more scale intellects that you we are kind of

relying on this and so here i i'm just saying well maybe we can even

push this vector right so we put it out so that this can happen on

the fly with the complete non-expert crowd and so you're dialogue system with a little

bit of human input

a might be able to do whatever you want the first time you ask

f me thank you

okay and

by nist for a long span prosody and i'm wiry order to be might lead

to this panel and have the opportunity to each and exchange ideas with the three

cindy are pioneers the in this area

and i'm speaking on behalf of the new it is about to start out in

china that i can finally the name that really i so please all need to

briefly introduce what about doing a trail

so what was not writing right now is you we were show chinese conversation platform

for creating conversational interfaces like chuckles

so i mean so like sends it special because

i see a lot of the last the technique that i saw reads as

i see a lot of advance the technology has been developed for a unison a

lot of major you feel languages like chinese is a languages that have a higher

complexity is the independence facts

so it's it would be a lot of more challenges in presiding training for example

like a knowledge while of ice like go up all other become these microphone has

put a lot of average downgrading knowlege graphs and they are what she would agree

class on really need to use but in

chinese i mean you in the that knowlege graphs criterion you can be used i

is all is you much less reliable so there's a novel noise so that's you

know like that of

they the bottleneck of difficulty with face in chinese and also for example

right now we're mining the vibe to find different ways also you scenes and the

for example of a little to say were you in chinese we see like or

force all the different

expressions

that's really disasters still

and is also what we are trying to do for chinese it's a week optimize

the technologies that people had developed that all other languages maybe you know so we

construct right and reshape that to adapt which i need for example to avoid a

big noise the knowledge graph we try to

mining the liability in particular domains and try to quite a great you know

a relatively smaller ontologies or smaller knowledge allows for you each the lion the those

resolve the ambiguity you know higher labeled you wouldn't there are also using the homes

using the information

kind of

then i'll try to use tools to solve the

actual ambiguities and also we are going you know customised solutions for

in to have can unite of things all warfare characters in like computer games i

and animations it's gone seems away unique optimizer the system for our oak lines of

python other companies that require these a conversational interfaces

so

we provide like open-domain she chided style systems like i still being by microphones and

we do task oriented dialogue system that while and the way we also provide a

highway of system solutions like similar to the a

which was is then we by the way i mean i are used to measure

in this product because they are leading chinese system seen in this area and the

also because

another component of our company you actually the particle creator of o one of the

system and we week only the power to the otter project i you previous employer

so

the experience we gain from the previous systems is carried all the to our new

company

and the

now we sing you the you know the relation of the system where you we

believe that the

the future we're choices than the in addition to be to be more

capable

it's to be more human like i mean is shorthand have distinguishable personalities and the

you motions

that's

that's it is it is gain from the you know the previous quite a long

so we see in the previous products weeded out

so for example we see the you know

the chitchat acquire we used actually takes significant the proportion of the that you can

require log in the previous system we maybe would and this is even more you

know

more home and then you task oriented acquire is

so i mean this is

lady partially due to the subculture in a user in china

button release is actually the people looking for complaining more than euros actually solving the

task using this were actually use it

and also we think they the or word resistance would be

actually it's the onto only proactive

i shall try to find the right timings will be the there always died off

you know passed to be very simple

and is

so that's what we are comedian to do at real and the where we are

we are working on a to make it happened

so there are the you little to do that we also have a lot of

challenge to solve all examples all to customise a personality of were treated that's a

very difficult task

and

requires a lot of continuing work and the some sometimes a lot of their work

and but i mean they're always

there will always be solutions so we try some different you know technologies to you

rewrites and as a so reshape sentences to make the language style done to be

resting caleb also you know all

you know human can be you will only in this kind of task like all

can be

you know community and i can actually ask the user real users to come to

be able to these curves are

also

sorry

also we

also for asr everybody we can okay yes prior is used sorry

sorry

like join you

i

approaches it is very important because

what we do is

well not fit well i believe you everybody is doing that a very carefully and

is back to the initial and i you addition to use of the user's privacy

we also face like the political really you critical use rules

region though use rules and the this than you know there's nothing we can award

the visitors have to do with the you know where a cow previous

strong classifiers pastoral are used strict you know the dictionary threshold the ins and mightily

to

you know to solve that the rights to so that's but this is also very

important because what we i actually because we are between the chit chat system we

after hasn't chitchat energies had park a lot

because i

we believe that you know causation is oprah size that generates information and you're in

chitchat all maybe multiply motivated by chitchat a piece of information can be generated here

in a conversation and this piece of information come you're we distribute either you are

not a conversation

and maybe got comment either you not a once a long if we can have

to be made and all design this cycle

the system then be self sufficient

so you that way i mean we don't need to actually

we improve the system anymore and it itself we'll gonna you know can be it

is that the knowledge of self but you'll disability is the propriety c becomes a

very crucial part "'cause" if you disclose it information that the user's cows to

your system will not a person that's a really horrible usually okay moving the whole

product company so we are like i don't know i mean i don't i will

it better solution at this point and the

like to discuss dependability and i had ears of all this

so

that's my predictions all and before share you

a star with the more you're that is

sense

okay much so no we realise that we have thirty minutes or so i

maybe we can

i had a we can extend little bit but

so i thank you for introduction so it looks like it sounds like so we

are like professor steve young sad so we are engineering

pace now so then we don't need more modeling or

it's done or i don't know and so given that so maybe each panel is

that pointed out that there is it a privacy issue

two

to make real

more realistic

the service

so

i was asked the point you think what's i maybe so let's go back to

so

so question one and maybe to what's a bottleneck we are facing is

so

on in terms of technology what's our biggest think

i want to ask again

at the

you're gonna step no issue so i right and i don't and privacy issue so

what's our on technical problem we're facing outs

but you think

i think we have a bunch of silos and i'd love to see them

the together right i don't really useful right

well as i so that i think you know we have all the bits none

of them up of five and all of them can be improved we have been

a bit though putting together systems that work

you can if you use modularity and you if we can seamlessly switch

portable click for maxine over there so maxine building a system that cmu which is

actually and integration of different dialogue systems from research groups something around the well

if there are enough of was and the and a bit like the chorus this

them if the users talking to they sit and it appears to have huge coverage

experts in many different areas

the fact that actually modular and multiple different systems

is completely you know that uses a can't see this there oblivious to it obviously

if you start switching topics the way humans can do we within

within the conversation it might fall apart with the w can do by building modular

systems and scaling the

you know i think there's a long way to go

the wrong i'm not aware of a specific thing that we count the

stopping as building these systems but maybe we should ask the old

yes please

well i think that sometimes something i think

i do well the question is it if it's an engineering problem with the research

community the as i said none of the components we have a perfect why i

mean the

and so as we go is done you know we by the dialog state tracking

challenge there's lots of

there's lots of things one could set so to improve slu and so on

well i think the real challenge is actually how we make the data are available

so that academics can actually work on serious datasets

and not

the something frank tori datasets you know of a thousand a few thousand dialogues

in the list of interesting stuff that's not what you know microsoft or apple or

am doesn't have they have datasets that the several of the magnitude bigger

and it would be really great if we could leave bridge the academic community to

actually be of the work on something that is

really very large know whether the dial poles or something like that can generate similarly

obliged a dataset

but we can work on of the research community that be great but i think

that's probably the major challenge

well i think i the answer at all these questions is unless you have a

system which

real users are motivated to you

then it's very difficult to get they they're watching all quantities so the reason that

but google an apple and so on a have so much data of a certain

time i is the people actually are motivated to use a variant google now on

the lexical so

and now one of the things that we don't know maybe this is the

you know i arg awhile view on it is the degree to which beings the

algorithms we develop a generic and the extent to which we can move them from

one application to another without having large amounts of data no they that you know

separate cases only just come out where we're actually

inviting developers to so essentially attached that third party apps to use a series front-end

a likes has a similar ecosystem the way these things will work is in fact

that is certainly for initial deployment if you have a coming which specialises in dialogue

software to interact with patients

setting aside all about the very real

you know i think of issues that may be those applications have ben but the

algorithm the models we build may well be generic enough to bootstrap a reasonable working

system and then the more data you collect about three get

so i think to some extent this will evolve in time and you'll have better

tools so explore some of those issues that you're so

i agree inhibit the topic of this of this section is virtual best lexus

and are they were defined that the beginning i think that you're pretty much on

the edge of o

that's all that's the v i justification

no that's the vision actually don't know the that that's true you know i think

the what the apple doesn't want you to do is to be locked into a

lexus sufficient that when you buy something

you're gonna user likes the for the advice to by its is and the mechanism

so by also but lee i'm short that's what japanese on this thinking about is

well you'll do not a in how he's going to make sure the amazon is

the channel for buying it

that it does work really well on lex all right now

i

well i don't know going work i

and

cell i would like to raise and grouping users here i think are very annotated

and that little change

an elephant you that had training case you grad asking theory things that from when

he missed seeing ninety year old whenever i got addressed like first lock average and

i found in for a year we went round of entity that's you get and

here we extract lexical where satellite that scares the crap data is

every time and that is wonderful they don't you know interactively and in that we

relevant having to our home or our parking or slightly altering the trajectory of bare

it's any cell i'd like to have what energy in a research why their wedding

picnic excited about

to do research on an outright

regression so i always

e value and to that but also have a possible children and they are able

to talk to lex actually the older one love to talk to him and the

younger one which is always understood it runs are set him

i think that it out how many people were also inspired by the young ladies

primer of diamond age but i think that's pretty fascinating we obviously there are privacy

and confidentiality a concerns but you know children the children are the future and they

will be the ones using these devices and

i think we should be listening to an finding ways that they can shape the

direction of use

devices because they'll be the ones living with them

so maybe what's your onto split

so if you have any bad experiences the as you all children

developed have it's all and things that you from this talking to say every that

you're i really wish they haven't

the injury

along that line is it wasn't article i don't know or remember if it wasn't

go up lexical we're to discuss the behavior kits that they don't use the polite

word like

a lot can you do this the right so because the machine of the data

we like that it would have to techniques or something like that the parents got

very upset about that because the changes

in the is the whole kit okay so i'm not sure how to really fix

that because but every the parent in the floor to the

like that's perfectly and hear something that i think we can data and i think

it's the iceberg

so i have an extant counter most the time we just use its it like

played of adl and listen to that purple ham sound you know don't have a

copy anymore from forty years ago whenever

i

i one thing have noticed that in china tests that recently in several when they

do all the time is set at time if an utterance i don't have to

do that and say you say you know the next you know set a timer

for five minutes or ten minutes to whatever and the echo star and my natural

we actually need to say thanks

since she's open the channel

right i given in this task she's happened the channel she comes actions spanish the

timer and i say thanks and the time we just keep skyline

so that i k

however i thanks

the timing chest each guy i think is thanks that means a turn at the

time there it's not that fast so what you have there is this crossover between

social dialogue behaviour at back like a greeting or thanks and these task oriented and

here and i think we have no idea

have to get pragmatics in that situation in at great others to things and i

think it's it is unlikely that i at and i spoken it in trying to

explore the and it and here's another one and these are kind of maybe it's

just engineering and i'm really next step

and i hate came in aston lex's our and on the top

and i stand that real ran on the topic

i never seen a tuple are so kinda separate that i don't understand your query

i said it's right in the context you know she she's right she's in this

stage is no its not achieve any kind of a state right

i think i hear some or i might pay you have to do next that

this was a nice to christ's h cisco i mean and then she said saddam

stay you know in nineteen fifty seven elvis presley made his first you know whatever

right so i eight o'clock in the money i think and i

i was i hate and make it more accent and she tells me exactly the

same meaning everything time so i think we just have no

we don't really know how to integrate these this kind of social mad with the

task going behaviour that's mightily

and

the that the right thing is probably you love the wifi but that i was

a response from the device itself so it can give you an answer about politeness

that we found out what we did the first spoken dialogue challenge

and i guess by now were allowed to say that there were three systems there

was eighteen tedious cmu and this cambridge and they are served it community of greater

pittsburgh to answer the phone on and we found out that when people spent to

the cambridge system anywhere much more polite

i the dataset a cable we but we it has something to do with the

accent

absolutely awesome

i do not so i

so i have i have a question of a user point of view from all

this i think it there a certain number of checking points okay those in this

you remember when the internet was first used

what we were all using it we're going to general public can use the stuff

and the general and the general public used it when a well made any interface

that with super easy to use this

right now there was another chipping point for is i and with the far-field my

microphone array in cattle and i use it in you know you walk out of

the shower and say what's the weather and then you know what to get out

of the closet

it's i think that is a huge thing and i do not know how the

asian

is going to follow that unless you have cameras in all parts of your house

how visions gonna look around the corner because the mexican here around the corner

and so i think it there are still some other chipping points

e for the user and the user side sentences

g i can use this and i'm gonna buy this has two hundred and some

people approximately have done with

i do so far

so what do you think is checking points are

but you don't sound maxine who or what

i didn't get the vision that

i actual computer vision all your vision you have used

i think that when the expectation goes wrong it doing the you think that i

know it can do to me just assuming that it can do whatever i say

that will be a huge tipping point

i mean also i didn't what we see in china is a the you know

people looking for combining more than the you actual task or the acquire is so

i mean i combined to give social be like a you like of all of

these all than the lunch each accent but also we have a response you we

idea of your require is and you can just it's of the their own goals

very smoothly you can just i tried to be the exact what are what you

take it as a front no you know applied or whatever so then even by

the we i we try to combine these chitchat r is the task oriented are

also make the

were choice a stand and the then we see missus the over significant can the

part of the you know people are looking for you know the chitchat is that

of the already completing the task

so that's i think that actually owns or that you level what's attracting people is

that the human like coref the of this device and also you know how to

solve the you know the this also offline the use of you the system but

it also brings problems comes up to

i think is not the technical secrete item also building the open-domain chitchat the way

we mind you know conversations these social media as from the united and we got

like are you know beat is of human conversations and redesign you know of features

with there will be done of features to

you know to score data is to see how a always replace it with we

require at least for the most suitable replies map almost seems

most suitable replies and run them useful one of them to reply but the problem

is

well you do this the u is really difficult to control what this it some

kind of sight

it's the user in the most sometime it was they you know progress things but

the eventually it may say something bad or something you know you will you are

not extracted it's will say so that's a currently so you are useful to be

solved the i mean my point of view and the u

i mean how these and you know what we see that you know that generate

generative systems like laurent and base the a conversations it could provide a solution in

some qualities you know it's a

reach way to model yourself the already spike to a reference so

i don't know yet i mean that's to

we are exploring that direction

well i think it's a mistake the maxine just to tie to think about things

and associate associating alexi with a the thing that lame puts on account so

all you need is a microphone you need the channel and what you once is

that same voice

that with the same knowledge about you to be accessible in as many different contexts

as possible so when you get a new car and you have the same quest

you know you ask questions you want so that access the same system when you're

in the home wherever you are where the using your file in your watch

you talking to a loudspeaker you're talking to television it wants to go through to

the sign plus new knows assigned things

in the same so that you don't have the land different protocols different

you just want the same interface now

or still not sure the by vision you mind cameras but you know in some

circumstances there will be more inputs and the that is a big thing that's not

really been done very well so far as integrating gestures what you can see around

you into a into these systems

but primarily

the that they personal assistant is detached from hardware

it is just that you know it's maybe in the cloud this may be running

on your personal ecosystem

but it's yours and it belongs to you and it's accessible at wherever you need

to access it

i mean that's one way this could go but it seems kind of automated think

of this embodied agents that are with me at all times but that's very different

than my of lived experience now write like when i'm at home

i interact with people who know me but no many different way in a different

context network from the on travel right

i

i'm not i'm not sure i'm not sure what people want but it is not

clear to me that the same agent everywhere

but so much power more powerful if it is the same age the same that

was the same thing

so the lower than maybe not everybody knows but i'd like a little on the

for like that

also available via we

so would like you have any of the a and a also like the on

the cycle of

well

what i did what the system you want a lexical

and hear the topic

and i can see the shopping the

i get a one problem the other five

and i could also ask for the same dropping the support of what so it

it's the same the information

so in that sense if it would be my car or somewhere else what

got the same propping this work

the sofa following up on those last two points

i think that this issue of personal assistant and what are really means so is

it is it something that only

knows about you when you only know about that and it doesn't interact with other

people at all

or is it more of something that is an assistant for you in a social

interaction of we think about human assistance for executive assistant and so on yes their

report to one person maybe as a as you know

a forty but they have to interact with a lot of people and the issue

about your while shopping list and whether you should have access to that

i think that is both the big scientific rather than engineering question and that

as far as we come really is just say we can have

walks in prevent certain people from accessing devices or certain functions but i don't think

we've got more sophisticated

in terms of saying how they would interact differently except that everybody has their own

personalisation

and you know i'm may wan other people to access some of my information from

my personal system but not everything and how should that works very curious about what's

in

alexi now for managing groups even how do you

how do you deal with people fighting about which music that i want to

but what kind of answer our people bring about in terms of multiple users in

were interacting with multiple people

and

one or more assistance

the with respect to multiple uses the device is assigned to the older but obviously

it is a fairly do what so it's in the living everybody can be

or it my wife wants to put something on the shopping that i follow

sure that of what it disrupting that because everybody can

excessive in the family so it's like the whiteboard right so

but are one of your a virtual want to respond plus the right so to

speak

so one or one point development but it works system was true words presumed

but system should

not respond to complete remote room impulse hmms

random no marketers store parking number system how to understand

but the observations the ones used or talking participants after a while useful realtor machine

that syllables

for stuff are just so do this actually works a lot better

so that maybe we're going through this transitional true people have been not require a

very

proper way to address this time

once the models as culturally establish some researchers who grew where

personal monocle with a rich close to the room response would really precludes ago

room and removed from you

remember system at a machines were introduced

the room but remember to o

behind one person or something wrong understand talking to the machine trying to read through

the correct then

this is not gonna happen to the

the cultural norms what we do this leads to each other we watched

something simple true but you know to some prior to the throat a double point

one from which rooms

basically what is the remote to do

i think it's more dimension room specific query

the some form of words has to do acquisition of norwegian structure

this is because we were able preschooler we do with the room with the parlance

pixel value some criminals

the reason we will use the language going to be homesick how exactly those with

map onto a actions the three but the main the bark number four through four

that's a little work on them

remote for worst possible sorts porcelain rooms room

would be most of these machines could produce the problem the room release brute

will come

work

and remove you have any thoughts i agree okay i

for syllables but the numbers knowledge so removing so

familiar google knowledge order to

from these works

but open remote chance to come from somewhere remote will have to remove solve a

problem

i can't being misquoted so to be clear i didn't say that we all we

have engineering will follow these things i think what place it was we now we

know the engineering can it's can build

systems which of going to be

it's a significantly more capable man they are today

but doesn't mean to say

a lot of the things that the been mentioned here will remain problems that need

folding

i'm just saying i think we can scale them with formal capable mail to they

would just engineering

they still will be able to do what you're us all human

and just to get quite common i was gonna say to lend well maybe now

you realise it doesn't recognise thanks you might just sliced all the time at are

alike so

well that might become sounds such that share your children probably won't erase your children

will be figured out likely to just size l

no i

i

i think you like it slots think it

covering the cost of alright if he wants to say that way you should be

able to do

so

i think lin is exactly right i think we only know the tip of the

iceberg in terms of how to in integrate pragmatics with all this wonderful technology which

i agree it's amazing and wonderful and every time i you see or even though

it doesn't do very much for me it still amazing to me if i remember

way back when i just wasn't possible

but it really is all about both sides i think are mentioning really interesting things

about identity and ideas design partner specific processing and that from me

it's what we only know the chip the iceberg about and so

a fact that a bit more tomorrow but for instance you know my story is

i have all of these devices in my hotel room and somehow i but series

by mistake

and units and might have my own in male and female voice started saying in

almost units and that there is that the right for one and then you know

how many whatever and so you know really their this notion of having the same

character everywhere we deal is an interesting idea and you are trying to go for

a coherent identity

or rollie something that it's still a real problem we don't know how to control

in context like all kinds of things we can go wrong just is trained and

that i mean you have certain expectations of a partner

you know thirty years ago and chai demanding brenda laurel how to handle

that i was sitting in the middle them on and they were fighting

i get fourth about

agents everywhere or agents are evil and immoral okay and so that was the point

if you thirty years the goal is a little psychologist in the ongoing an empirical

question it'll work some of the time it will work at a time

and so that was what we find that thirty years ago there is no doesn't

at least at this conference if no one voice and abilities to be alone was

extension item in around to

to throw water on our parade that

you know they're probably people out there that way then sell

i think it's really important to think about the social things and you're right in

terms of restaurant things in

certain little functions i can do with mind that

you know the big picture the annotation is wonderful but we're so far from here

then we talked a lot about children

and you know the how children want to interact with systems and course children will

adapt

to the language like we're talking about the figure out that well you don't really

just in the way

what i was used as motivation

for a vision and dialogue systems is my father

who's currently years old and

back in when i was a new once in the late to about two thousand

four two thousand five time frame i was really brought over or maybe little rebels

part of our how many hope you system and i

had my father use it in a completely destroyed it you know

this is because he doesn't he's not gonna adapt to the text apology ready he's

and it will be said you know this is kind of nice you know exactly

what the do when children systems are really useful really useful was back in the

forties

as a what you mean because like unigram some of the coda

and the when i with the we had a problem

let's say than refrigerators brain you know making a sum is not pick up the

phone

and louise would pick with answer and choose the telephone operator she everybody in town

little town

it's a lilies you know why refrigerators making this clicking so you know what he

would you which i do infeasible kind of rigidity about the project here we know

bob he'd he services for tutors and let me let me just connect you

bob's he's always over the joes diner at those times let me i mean can

tell whether i and somebody might then still am is there is two thousand four

two thousand five and

and i thought you know that there's wisdom there which is that

we shouldn't have the have people adapt to the technology we so is no we're

gonna go build we use and you know that was the there was a motivation

for louise and then which became cortana but

the this kind of using technology to meet humans where they want to be naturally

that forcing humans to that

it's funny because right here is to thread simultaneously and had ever since that tension

or you can have a nice debate which is we want to make technology more

human like in each and he said

and we want to make people talk like the machines "'cause" that'll be easier for

the machines to understand and we can have to decide if we're going to make

machines more human like

we can we as in years till the humans change or we can integrate pragmatics

and other aspects of natural human conversation in to what we teacher machines

like my questions related to this i don't know very much better and size and

one transmission mentioned many times that right can i so having emotions and high as

it is an important part of that as an assistant will a dialogue system and

so i'm just wondering a what is the kind states

of interacting be interaction between researching has no assistance and active competing in some point

in terms of recognizing emotion from the user and from you know prosody of this

each and other aspects and also in terms of generation

of utterances which contain motions

and ask questions so we actually or

no i don't number in that's really we actually not

research all models i we i mean at this stage at the start off button

will allow them from the email so for like you motion

emotion recognition all you motion generation actually be we actually use you a sheep the

lottery

good performance comes fortunately in there are task if you want to recognise it anymore

so all you want to you know reply i was because it mostly you don't

actually those that for every

replies

so you just keep your procedure you use a battery or

i mean you will double recall you just keep your is here tonight in

by doing that in that way we can achieve likely you know or ninety five

percent accuracy or something like that and we also learn allowed from the research community

like doing the generative model for a chat board

we actually a truly in a channel walter using sound you know