Speech Transcript - Augmenting conversations with a speech understanding anticipatory search engine

alright first let me thank you for the invitation and the opportunity to

to come to all the modes

it's so funny because a friend of mine saying all you going to the middle

of nowhere i said no i'm going to the middle more idea

and i really enjoy coming to new places that i've never been to

so i talk about thirty is

and new trend sort of technology trend that is really stripping merging and taking off

and that is this notion of anticipatory search and how much a speech can contribute

to that

here sort of our region imagine you having a conversation with a friend and she

says only to atone in spitting five minutes and as and putting down the phone

and i'm and i look at the screen this is what i wanna see right

i wanna

basically have the directions of word to any to go and what do we need

to be in five minutes

and if you think about it we can have all the pieces already right would

have user location we have good maps we have good directions we have speech recognition

we have some reasonable understanding and so it's kind of a matter of putting it

all together into one compelling application

so that's kind of the premise we realize that the way that you find information

is changing

and we're moving towards kind of a query free search in the sense that instead

of having to you proactively when you have a something going to find out having

to fire up a browser final

finally a search box and type in your query getting results it can be much

more proactive you when you're context and what you've said and what where you are

the information can come to you as opposed to are you having to find information

but of course we're not alone in this in this idea

recourse well the technology isn't future is that recently joined google had is a has

a pretty similar vision so as search engines may be seen also search engine is

that they one weight to be ask questions

so releasing in our conversations

what we say what we

right but we would we here and they want to submit are needs

and that's

remotes that the same premise that

expect maps was built on

so let's look at some of the enabling trends

for and to separate research

there's mobile devices

there's a i that is making progress

and then and so if you put it together there's applications that can take contextual

information and start making good predictions about what the user what informational needs of the

user might be

so like let's look at these you know in more detail

it's obviously not surprise that

about the whites sre you could as you can probably go anywhere

to and you know a few minutes later there's a couple of

you

videos on youtube already about that event and you know hundreds of pictures the in

fact there's technologies now that are trying to recreate some sort of a three D

map just based on the fact that you have images from different point of view

then there's the amazing sort of growth of mobile devices so this is a statistic

for are smart phones and tablets both running

i O S and an and right and of course the absolute count there's us

in china because of the

population that have the highest up numbers but if you look at the growing market

is basically southeast asia and on and stuff so the merrick and some other a

growing market

we're ending up in a position where pretty much any adult is gonna have

the smart phone in their pockets

and so that really changes to the possibilities of what you can do with that

because this martin this is mobile devices have a lot of sensors and you can

think of well of course we have cameras we have microphones this why there is

a gps

but also if you look closely for example in this so

let's see is for there's gestures sensors proximity sensors covers politics or amateurs

there's even a humility sensor so that you could drop your phone in the water

they can what the warranty

and

barometer

so basically it turns out that this device is that we are not pockets in

so to some extent no more about where we are then we ourselves might be

aware

and there's more right

we all know about sort of logos of that also has

you know bone-conduction transducer in addition to well other stuff and then more futuristic things

right like there's research actually by

and you hear unusual that is able to do recognition just based on the other

facial must look activity right you have these sensors so i could be talking and

i said without formation a you'd be able to still recognise so in fact i

was talking to well to marry that may be an interesting challenge

for some

feature and used evaluation

then there's this more you know three stick a electro and the follow gram headsets

that it still kind of you know not very clear what you can do with

them but they're becoming more stylish so people might start wearing them

and then there's interesting things like this happen application from what roller

where

basically they have this idea that we all the nowhere an electric a tattoo here

nor next

that is gonna have the microphone and you can help also with speech recognition

there's all kinds of ideas about how to

collect more data about what we do in where we are

and then there's sort of progressive in the back and right once we get this

information what can we do with it

and there's been some talk here about how much progress we're making we're all familiar

with this

with this chart of the famous a word error rates for different tasks

no are we reaching some sort of a plateau but we know that that's not

the case because there's working dynamic speaker adaptation there's all these work in the in

the deep neural networks that we've been talking about also work in extremely large language

models that are making the recognition be better

there's also some working and all you not language understanding around conversation and topic modeling

there's a knowledge grabbed all talking a second and so if you put all these

together with some machine learning algorithms we're getting to a point where can be

start to be reasonably good at understanding

a human conversation

this is in this audience this is this is obviously very well known but it

is gonna remarkable that we now have

these a fairly substantial improvements in down to convert accuracy things to these

do you will networks and there's work here from microsoft ibm google and there's

others in the room that are working on this

something that you might not be as familiar which is the fact that deep learning

is also being applied to not a language understanding

and i would

when you to

but to make sure that you're aware of the so called down for sentiment treebank

was recently released by at stanford university

and there's is a nice paper recursive give models for semantic compositional at over sentiment

treebank by other soccer and was also i mean the same group as andrew on

and on chris manning

and what they do is

the

they published made available this corpus all over eleven thousand annotated utterances where they've been

parsed in this binary parse tree and then every node is mean annotated with the

sentiment about whether it's from very negative neutral prosody very positive

and so the and then to the interesting part is

how

they man so the make they make use of theme multiple

layers you know deep neural network to actually model the saying that the levels in

a parse tree

so that bottom-up can composition really fine the sentiment about a value at any you

know by doing these steps

so for example if you look at the sentence this film doesn't care about cleverness

weirder and you know that kind of intelligent humour

there's words like humour the case of plus a very positive one intelligent also so

this whole parse-tree

we sparsity

except when you reach the negation just doesn't

care about these and so the overall sentiment is negative

and this is very powerful because after now the traditional model has been back of

words

a vector space and it's

heart to model these relationships and

we all know that up

language has a deep structure it's kind of a recursive structure and

there's is long distance relationships with

certain modules within the sentence

that are harder to capture enough

in unless you

really get a false sense of the parse tree

so applying this

they gate getting gains of well

you know what twenty five percent

improvement in the

accuracy of the recognition of the sentiment over these this corpus which by the ways

about movies this is from

movie reviews

so that so encouraging that

that this technique that is not popular enough asr can also be transferred to natural

language understanding

then there's another a very important train

the way i seed in how we can improve that which understanding

and

just of all these earlier today with saying well the kind of the you in

asr use gone of missing a bit

i think knowledge graphs a really the answer to that

and wise that well because

we can go from this kind of disembodied strings

two and kurt entities in the real world right there is a nice but possible

that says from strings to thinks

so what that what is what is that

and knowledge graph really you can think of it as these giant network what the

nodes are concepts and then they're slings that really one entity to another for example

you know george cloning appears in ocean's twelve and you know this is movies and

an actors

and how they're really to each other

and the interesting part is if you know some history

you might remember psych

which was an attempt was still open sec still exist

it's an attempt to kind of create these very complex representation of

all known human

knowledge especially strip common sense

but the problem is that one is be able by hand

and they spend a lot of time deciding whether a property of an object is

intrinsic or extrinsic

kind of splitting hairs a something that is not there so it quite relevant the

way that this knowledge graphs are being built now is different

you will start with

start with wikipedia

and there you know

there's a at the data sets of machine readable version would you pdf that you

can ingest and then you can start extracting these entities and the relationships and there's

some certain degree of money alteration we can get pretty far with an automatic process

and so companies are doing this

and

for example has knowledge graph that has ten million entities and thirty million properties in

time you know connections microsoft have their own the court's authority and they have three

hundred billion entities

well five have a five hundred twenty million and it is an eighteen good properties

and then there's also more specialised ones

like factual for example which is a database of places point of interest local businesses

and they're also getting to sixty six million entries

in fifty different kind

and then of course you can take social media

and see their origin of entities and relations use which is people as a as

the version of a knowledge graph and so linked units now what twenty five million

users and facebook is over a billion

if you think carefully about these it means that

anytime the do you relate to what concept

or named entity like a place robotically organisation or person

you could actually you're able to grab that and map it onto one of these

entities

so that the traditional idea more in the linguistic side of

we do part-of-speech and we find this subject and the object

we can is they'll be some relationship

but this is still not really it's groups

i a bit easier material with the knowledge graph you kind of and for these

and say you're referring to this movie you're bring to that person and then there's

all kinds of inferences and disambiguation that you can do all

without knowledge right

i think to the fact that we can start to represent pretty much more human

knowledge

at least in the terms of sir

concepts and entities

in a way that it's read fit you know you know you know a commercial

representation is very important and that's very big step towards real natural language understanding because

it's more grounded

one of the usages

for

for a knowledge graphics for disambiguation and there's is classic sentence from linguistics rate i

saw the men on the keel

with the telescope

that can be interpreted in a variety of ways similar which are depicted in this

funny graph right so it's what the linguists call a prepositional phrase attachment

problem is it

with a telescope is it attached to the hill or to the man

or to me and on the hill again does it types of the manner to

me so

traditionally there's been really no way to solve this except for context but if you

think about imagine that you have access to my amazon purchase history

how do you and you saw

but i just bought a telescope you know two weeks ago pen you would have

a kind of a this idea of the priors right you could have a very

strong prior that it is me who is using the telescope to see the man

on the hill

it's obvious that the more context and the different sources of this context that we

can have access to

gonna help disambiguate natural language

that's context in one aspect and then gonna with different idea is that we also

know that you're intent and what you're looking for also depends on where you are

so that's another

place where

location now is important contextual location

this is this is not new there's a bunch of companies that are using for

example exploring the yours as location local search obviously by sort for japanese restaurants depending

on where i am gonna get different results

one yell for example

then there's also company select employee i that focus on

sort of predicting what you might need based on your calendar entries there's Q at

startup that was recently part by apple also in this space and then there's also

obviously google now

that

sort of

use able to ingest things like your email and makes sense at and understand that

you wanna have a flight or a hotel reservation and then take it makes use

of that information to bring a relevant alerts when the time is right

and finally the last piece is the recommend or systems right we're all familiar with

things that they like and amazon you get recommendations for books depending on the stuff

that you've but for

and the way the systems work is kind of semantic like a lot of spell

data but the users and then they class of the users and see all your

similar to these users so you might also like this on the book and this

is expanding for your net flicks from movies and or an spotty five for music

a link in facebook for people that you might know et cetera so

all these

systems are using context to kind of make predictions or anticipate things that you might

mean

it is within this general context of the emergence of anticipatory sort that we start

this company and expect laps is the technology company based in san francisco

that we start about

twenty five years ago

with this idea of creating a technology platform that especially designed

for

this real-time applications that are gonna be able to ingest a lot of states

give you relevant contextual information

in sort of run step

the way works as we

are able to receive

it's real time and dates about what you are

what you might be saying

what you reading like on a new email

and you can

assign different weights to some of these modalities right so something but i say or

something that i treat is gonna have a higher

wait and something that

i'm an email that i receive which i may just sort of scheme or read

as opposed to

deep

read deeply

so but we take all these inputs in real time and this allows and we

process then we extract important pieces of information from all the sources and that creates

dynamic model our best representation of what the user is doing and their intent and

therefore were able to

all sorts cap for information across many different data sources to try to provide information

there's gonna be useful to that user at that point i

and as a forty example of this platform

which created mine mel

mind meld it's right now and i put our

that understands or conversation

and fines content as you speak

you can think a little bit of the sky where you can invite people and

start talking

and then we'll get

interesting content based on that

and all gonna give a demo in a second

important aspect of the design of my mlps that we wanted it to make it

very easy to share information because if it ever tried to have a kind of

a collaboration session a using sky people quickly find especially the i

on the ipod that it's difficult to say you wanna share a an article you

have to leave the sky at have to find a browser or to some searches

and then you find sort of the url and then to try to send the

url thrust of the sky i am which may or may not be active and

so it's a bit cumbersome so we wanted to

make it very easy for users to be able to discover

to navigate and then to share information

in the stuff that you share becomes a permanent archive of the conversation then you

can look back to use

right so with that things that

when a give it a little demo

my email

see how that

works

so this is my ml and you can see that i have access to

some of the sessions or conversations that have taken place in the past we can

think of you may have a recording meetings like every tuesday you have your update

with your colleagues and so you would joint that section because everybody's already

invited

and plus you can have all the context

all the things to

the shared items and the and the conversation that when that was previously happening that

session

but for now i'm gonna start a new session

and i can give a name

learn what's

i can make it friends only

what can make it public rights invite only

and

it's if the connection works

this is now making a call to facebook

the face at i

that

okay here we go so

let's say that i will invite alex

likes my able

okay

now what i'm the only one in the conversation and so otherwise if as soon

as alex joins you would also see

information about the speaker right

you know the thing that we found when you talk to people like no

web text run to

on the

on some sort of a conference call

people tend to kind of google each other and find the lincoln profile well here

is in which is you that to you right

and this is a discovery screen so i'm the only one seeing this information

but if i decide to share then everybody else in the conversation would see that

which is why for example

you know they find the current location

of the user right here in the

in this whole what's congress hotel

the most interesting parties

when you have multiple speakers but for now i'm just gonna give

so we will real demo of how this looks like

okay mine mel

in mind meld

so was

wondering a whether you by some part about present no batman's brain mapping initiative

i so this new technical clarity that makes brains transparent

that might be a help for L

for these mapping initiative

you can see that you know the we show you that about a ticker items

here of

what we own what we recognise we try to extract some of the of the

key phrases

and

and then we know we do some post processing and bring irrelevant results

see what else

okay mine mel

so we're gonna have some friends over maybe we should cook some italian food

it we can do a mean a strong to so

fitted you know for it

maybe that would be nice

so you can see the mean wait works

if i like this for example i can drag

and share it

and this is what becomes part of the of the archive

which then everybody in a conversation C and also becomes experiment archive but i can

also access through a browser

anybody has a topic or something that might be interested in

i is okay my mel so paper more anyways interested in deep belief neural networks

that's something that we've been talking about

at this L ieee asru

conference in other modes

one of the issues is i think pattern i are not connected in facebook

because otherwise we would have found

the right "'cause" are model

however if we are

not even this one okay

this is but you can see right so something

let's stick to ieee okay i

so one of the things that we do is we do look at the intersection

of the social graph of the different participant you know call

so that we can then

be better at

disambiguating

no named entities right so

so if we had been connected and

pay a pit on brno would have been the real they don't know what in

right here

alright so

but

let me go back to the

presentation real quick here

this is the platform that we've than the we build and

if you wanna sort of

dig a little bit deeper

one of the novelties i think is that were combining the traditional and all P

with a more we call and of search style approach

because the interesting part is that were able to model

semantic relevance

based on the context

the what we're speaker least be easily set and the user model and also from

the different data sources that you can you have access to

so basis something like work we go for dinner and then the other person says

i don't know you like japanese sure any good base around union square

we're building these incremental context

about the overall intent of the conversation

and so

were able to then you know

do natural language processing the usual stuff part-of-speech tagging noun phrase chunking named entity extraction

anaphora resolution semantic parsing topic modeling and some degree of discourse modelling and pragmatics

but then the or the piece is that depending on the signal

that we get from each of these different data sources and you can think of

my social graph that was mentioning

the local businesses that factual or el can give you

personal files right you give us access to drop box or to europe will drive

we can make take that of the data source

and then there's more the more general web with needles and general content and videos

but what's interesting is that even this the response that we get when we do

all these searches

that also informed as about what is relevant and what is not

about that particular

you know conversation

put in other words if for example you work to build an application that only

deals with movies and T V shows an actor stand any reference to something else

that would not find a match

would basically not give you

results

but that also means that would be much more precise right in terms of the

answers the that you give the relevancy of the content

in so this is something that

because we have well

kind of very scalable and fast backend

allows us to do multiple searches

and we have some cash as well but basically these

makes as

be able to compute the semantic relevance of an utterance never a dynamic way

based on context and also based on the type of results that we obtain

so this is a you know technology conference so what tech technical conference some of

the ongoing R and D as you can imagine is quite substantial

in the on the speech side

there's

we have two engines we have an embedded engine that runs on the ad

and also we have passed club a speech processing so an interesting

research is you know how to balance that and how to

how to be able to on the one hand listen continuously put on the other

also be robust to network issues

and then there's in terms of practical usage there's things that you can imagine detecting

sub optimal audio conditions like when the speakers so far on the mic noise environments

as we all know heavy accents are an issue

and then

one of things we found is because is an ipod at it's very natural for

people to kind of leave it on the table and two things happened they speak

to each from far away and also the can be multiple people

speaking on you know to the same device and our models try to do some

speaker adaptation

and sometimes that doesn't work that well

and then sort of the issue with this kind of the holy grail of could

we detect you know a sequence of long

and grammatical works and

when he's gone of you bridge

and of course there's techniques to do that but

we're trying to get

improve the accuracy of that

and then in terms of natural language processing in information retrieval also kind of a

design question are things like the class i cannot P problems like word sense disambiguation

although obviously the knowledge graph helps a lot

and then enough resolution and some of these things we do with the social graph

an important aspect is

these knowledge graph is useful but

how do you dynamically updated how do you keep it fresh

and we have some

some techniques for that but it's

it so

ongoing research

then every important aspect is

deciding that the sorts working this right

as we all know if we if you leave a speech engine on

but i remember an anecdote from are alex waibel that you told me once it

as an engine running in his house and then when he was doing the dishes

with a look cling incline that you know the search engine was spouting all kinds

of the interesting

hypotheses

this is been alluded to of course you can have a fairly robust voice activity

detection

but there's

there's always room for improvement

the search more than is as i mention is not just

understanding that something is speech but also detecting of how relevant something is within this

within the context and this comes of these other point of the interrupt ability and

mind meld is a bit too verbose right this is just a showcase of what

you can do also because the ipod has a lot of real state sequence shoulders

different articles in practice and through the a i'll talk about in a second

you have a lot of control about how like twenty one to be interrupted when

you wanna

a search result for an article to be

to be shown and this is

a function of at least two factors one is

have

in place in the request is how much the user ones to have certain information

and the other one is what i was mentioning about the nature of the information

found how strong is the signal from the data sources about the relevancy of what

i'm gonna show

and what i mean by that is

you can think of

but you by set

the difference between

what is the latest movie by woody allen

versus i've been talking about woody allen in

and i mentioned the that

the keys latest movie et cetera

right so one is a direct question where am the intent is clear more like

a serial like application where and trying to find the specific information the other one

is a reference sort of in passing about

something

i'm and so

that

would be the these understanding of

how eager i am to receive that bit of information

so that's work that is ongoing being able to model that

and then finally

we have a fair amount of feedback from this right especially when the user shares

an article that's a pretty strong signal that was relevant

on the negative side haven't shown you this but you consider flick one of the

entries on the on the right on the left hand side that eager items as

we call them you can delete them so that would be good of negative feedback

about

certain entity or a key phrase that was not

deemed relevant by the user

how to

optimize the learning that we can obtain from taking that user feedback

is also something that

that we working on

especially because

the decision to show certain article based is complex enough that

sometimes it's harder to assign the right sort of credit or blame for how we

got there

so just do well

sort of

twenty five what we're doing there's two products that we're offering

one is that might melt

my not obvious what you see here

and as a matter of fact

the mind meld out

is gonna be alive on the apple store tonight

we've been working need for awhile and it's finally happening

so if a if you're welcome to tried out

i guess will be tonight well

for whatever time zone you're up store a is set to so i think

new zealand users might already be able to download it

and then for the us will be

in a few hours

so that's a mimo but then

the other thing is

were also offering these the same functionality when api about a rest based api

that

you're able to well

give this creates sessions and you and users and give this real time updates so

that and then you can query for what is the most relevant you can also

select the different data sources and so it any given point you can ask for

what are modeled thing system most relevant set of articles

with a certain parameters for ranking et cetera so we're having

already a system

degree of well of scoring

how lots

with comes

for example some or all of our backers which include by the way google ventures

and also sums and

intel top twenty car

liberty mutual

they're all in the backers that we're trying to do some prototypes with

i'm character to try it out and

i was thinking that

because i'm actually gonna be missing the launch party that is happening in san francisco

i'm gonna take our banquet that the bishop's palace as the ones party for might

know

that's what i want to say and we have some time for questions

was at all

the i was wondering i'll the track the users they the example the key we

want to eat something and then

is it is still sticking to the restaurant domain and me and no

what the example you show that's all you're adding information and how about you change

information that you previously used switch to another domain

how you jack use

there's to right of information that we use for that one is simply time right

that sort of as time passes you can of so you decay certain previous and

trees

the other one is some

kind of topic detection clustering the we're doing so that

sentences that still seem to relate to the same topic kind of you know how

help a

sort of ground round that topic

and then there's also

some user modeling about you know you're previous sessions so that we have certain

prior weights

what

well so you know there there's

i'm not gonna sitting some specific algorithm that we use but you can imagine there's

some you know statistical techniques to

to do that modeling

where small startup we can not like reveal everything

so like very much so it's great another question so

i one point you happened mentioned

asr you and all the modes probably enough came out as a that's are you

a slu and columbus

no it's

it would same

the really what

that what you've shown us are ways of organising information at the output and the

process

but also same particularly not example when the beanie the

not only that it's actually well it does know exactly where you work it's without

map

and it might even figure out that you're at this thank all layers are you

but this things we're not being reflected in the lower level transcription process so i

was wondering how the mites you don't have to tell us anything it's buster's father

figured and to train nice things

well it's obviously a that the research issue of how you

make the most of the contextual information and unfortunately

asr specially the well the these cloud based asr

de at this point doesn't

fully support the kind of adaptation and comp and dynamic modification that would like to

but that's kind of a and an obvious thing to do in the same way

that you constructs larger contexts and fine you know the all the people that you're

related to and at that you're specific lexicon having something like the location and the

towns nearby or would be something

very no sort of natural to do

but we're not being this

i have to say your search for better more innocent implement because

the previous one used to be a and the step and it has so that

you made

so when you search for permanent you go okay so this is better

well that the asr was no hundred percent accuracy

which one to use

actually we use the writing including a new ones and cools

sex for a talk also pitch wondering about privacy occurrence i was on those impression

that's the more

i want to

interact with this mind meld at some live in or a need to be transparent

for the for that and my personal data

well i have a

actually a philosophical reflection

that

as a society with this technology we are going to words what i'm calling with

transparent bring

a and

if you think closely about it up

the better we are collecting data but users and modeling their thing intentions

we can get to a point where

you can almost all of complete your thought

right assume that you start typing the query and gonna be knows what you might

one

and of course is just a little bit of science fiction but

we're kind of getting there and so i think the way to address that is

by doing very transparent about this process

and giving you full control that what is it that you wanna share for how

long

because

that's really the only way to modulated it's not just say one gonna opt out

and not just gonna use

any of these

anticipate research because basically will be unavoidable right but so i think it's

it's

what we need to do is how well some clear

settings about

what you wanna share with this out for how long

and then insuring the back and that that's really

the only way the only usage

of that information

but as an example

we're not recording

this

the voice rate

and is the only thing that is permanent in this particular mind all application

are this the articles that you've specifically share

that's the only think that

so i'm happy that maybe if you're looking at pedro if you are task pedro

in police record would you see something

it may be you wouldn't wanna see so is there anyway i like when you're

looking at your space

do you have certain

contexts that you're searching for things when you bring information back like let's say you

know this descending order social setting or some other context

yes so what one of the shortcomings of the little demo idea is first of

all you was only one speaker it's always more interesting when israel conversation

and the second is it wasn't really a long ranging conversation about certain topic where

mine mel excels at least in say you wanna

planned application with you know some of the frames are somewhere else and you sail

well gonna go here then you explore different places you can stay things you can

do and you share that when you have

a long range in conversation with this with the kind of overarching goal

that's where it works the best if you keep sort of switching around then it

becomes more like a serial like search that doesn't have much

in just a quick question so how do you build your pronunciation so if you

look at asr you would spell line out that if you look at icassp you

actually see it doesn't work

that's it's mostly in the lexicon there's certain

abbreviations there are more typically

separated like you know i guess are you or some other ones like need to

alright guys that would be a spoken is a war

it's so it's becomes in the pronunciation lexicon pretty much

you more questions

Augmenting conversations with a speech understanding anticipatory search engine

Applications Day

Marsal Gavalda (Expect Labs)