alright first let me thank you for the invitation and the opportunity to
to come to all the modes
it's so funny because a friend of mine saying all you going to the middle
of nowhere i said no i'm going to the middle more idea
and i really enjoy coming to new places that i've never been to
so i talk about thirty is
and new trend sort of technology trend that is really stripping merging and taking off
and that is this notion of anticipatory search and how much a speech can contribute
to that
here sort of our region imagine you having a conversation with a friend and she
says only to atone in spitting five minutes and as and putting down the phone
and i'm and i look at the screen this is what i wanna see right
i wanna
basically have the directions of word to any to go and what do we need
to be in five minutes
and if you think about it we can have all the pieces already right would
have user location we have good maps we have good directions we have speech recognition
we have some reasonable understanding and so it's kind of a matter of putting it
all together into one compelling application
so that's kind of the premise we realize that the way that you find information
is changing
and we're moving towards kind of a query free search in the sense that instead
of having to you proactively when you have a something going to find out having
to fire up a browser final
finally a search box and type in your query getting results it can be much
more proactive you when you're context and what you've said and what where you are
the information can come to you as opposed to are you having to find information
but of course we're not alone in this in this idea
recourse well the technology isn't future is that recently joined google had is a has
a pretty similar vision so as search engines may be seen also search engine is
that they one weight to be ask questions
so releasing in our conversations
what we say what we
right but we would we here and they want to submit are needs
and that's
remotes that the same premise that
expect maps was built on
so let's look at some of the enabling trends
for and to separate research
there's mobile devices
there's a i that is making progress
and then and so if you put it together there's applications that can take contextual
information and start making good predictions about what the user what informational needs of the
user might be
so like let's look at these you know in more detail
it's obviously not surprise that
about the whites sre you could as you can probably go anywhere
to and you know a few minutes later there's a couple of
you
videos on youtube already about that event and you know hundreds of pictures the in
fact there's technologies now that are trying to recreate some sort of a three D
map just based on the fact that you have images from different point of view
so
then there's the amazing sort of growth of mobile devices so this is a statistic
for are smart phones and tablets both running
i O S and an and right and of course the absolute count there's us
in china because of the
population that have the highest up numbers but if you look at the growing market
is basically southeast asia and on and stuff so the merrick and some other a
growing market
so
we're ending up in a position where pretty much any adult is gonna have
the smart phone in their pockets
and so that really changes to the possibilities of what you can do with that
because this martin this is mobile devices have a lot of sensors and you can
think of well of course we have cameras we have microphones this why there is
a gps
but also if you look closely for example in this so
let's see is for there's gestures sensors proximity sensors covers politics or amateurs
there's even a humility sensor so that you could drop your phone in the water
they can what the warranty
and
barometer
so basically it turns out that this device is that we are not pockets in
so to some extent no more about where we are then we ourselves might be
aware
and there's more right
we all know about sort of logos of that also has
you know bone-conduction transducer in addition to well other stuff and then more futuristic things
right like there's research actually by
and you hear unusual that is able to do recognition just based on the other
facial must look activity right you have these sensors so i could be talking and
i said without formation a you'd be able to still recognise so in fact i
was talking to well to marry that may be an interesting challenge
for some
feature and used evaluation
then there's this more you know three stick a electro and the follow gram headsets
that it still kind of you know not very clear what you can do with
them but they're becoming more stylish so people might start wearing them
and then there's interesting things like this happen application from what roller
where
basically they have this idea that we all the nowhere an electric a tattoo here
nor next
that is gonna have the microphone and you can help also with speech recognition
there's all kinds of ideas about how to
collect more data about what we do in where we are
and then there's sort of progressive in the back and right once we get this
information what can we do with it
and there's been some talk here about how much progress we're making we're all familiar
with this
with this chart of the famous a word error rates for different tasks
no are we reaching some sort of a plateau but we know that that's not
the case because there's working dynamic speaker adaptation there's all these work in the in
the deep neural networks that we've been talking about also work in extremely large language
models that are making the recognition be better
there's also some working and all you not language understanding around conversation and topic modeling
there's a knowledge grabbed all talking a second and so if you put all these
together with some machine learning algorithms we're getting to a point where can be
start to be reasonably good at understanding
a human conversation
so
this is in this audience this is this is obviously very well known but it
is gonna remarkable that we now have
these a fairly substantial improvements in down to convert accuracy things to these
do you will networks and there's work here from microsoft ibm google and there's
others in the room that are working on this
something that you might not be as familiar which is the fact that deep learning
is also being applied to not a language understanding
and i would
when you to
but to make sure that you're aware of the so called down for sentiment treebank
was recently released by at stanford university
and there's is a nice paper recursive give models for semantic compositional at over sentiment
treebank by other soccer and was also i mean the same group as andrew on
and on chris manning
and what they do is
the
they published made available this corpus all over eleven thousand annotated utterances where they've been
parsed in this binary parse tree and then every node is mean annotated with the
sentiment about whether it's from very negative neutral prosody very positive
and so the and then to the interesting part is
how
they man so the make they make use of theme multiple
layers you know deep neural network to actually model the saying that the levels in
a parse tree
so that bottom-up can composition really fine the sentiment about a value at any you
know by doing these steps
so for example if you look at the sentence this film doesn't care about cleverness
weirder and you know that kind of intelligent humour
there's words like humour the case of plus a very positive one intelligent also so
this whole parse-tree
we sparsity
except when you reach the negation just doesn't
care about these and so the overall sentiment is negative
and this is very powerful because after now the traditional model has been back of
words
a vector space and it's
heart to model these relationships and
we all know that up
language has a deep structure it's kind of a recursive structure and
there's is long distance relationships with
certain modules within the sentence
that are harder to capture enough
in unless you
really get a false sense of the parse tree
so applying this
they gate getting gains of well
you know what twenty five percent
improvement in the
accuracy of the recognition of the sentiment over these this corpus which by the ways
about movies this is from
movie reviews
so that so encouraging that
that this technique that is not popular enough asr can also be transferred to natural
language understanding
then there's another a very important train
the way i seed in how we can improve that which understanding
and
just of all these earlier today with saying well the kind of the you in
asr use gone of missing a bit
i think knowledge graphs a really the answer to that
and wise that well because
we can go from this kind of disembodied strings
two and kurt entities in the real world right there is a nice but possible
that says from strings to thinks
so what that what is what is that
and knowledge graph really you can think of it as these giant network what the
nodes are concepts and then they're slings that really one entity to another for example
you know george cloning appears in ocean's twelve and you know this is movies and
an actors
and how they're really to each other
and the interesting part is if you know some history
you might remember psych
which was an attempt was still open sec still exist
it's an attempt to kind of create these very complex representation of
all known human
knowledge especially strip common sense
but the problem is that one is be able by hand
and they spend a lot of time deciding whether a property of an object is
intrinsic or extrinsic
kind of splitting hairs a something that is not there so it quite relevant the
way that this knowledge graphs are being built now is different
you will start with
start with wikipedia
and there you know
there's a at the data sets of machine readable version would you pdf that you
can ingest and then you can start extracting these entities and the relationships and there's
some certain degree of money alteration we can get pretty far with an automatic process
and so companies are doing this
and
for example has knowledge graph that has ten million entities and thirty million properties in
time you know connections microsoft have their own the court's authority and they have three
hundred billion entities
well five have a five hundred twenty million and it is an eighteen good properties
and then there's also more specialised ones
like factual for example which is a database of places point of interest local businesses
and they're also getting to sixty six million entries
in fifty different kind
and then of course you can take social media
and see their origin of entities and relations use which is people as a as
the version of a knowledge graph and so linked units now what twenty five million
users and facebook is over a billion
so
if you think carefully about these it means that
anytime the do you relate to what concept
or named entity like a place robotically organisation or person
you could actually you're able to grab that and map it onto one of these
entities
so that the traditional idea more in the linguistic side of
we do part-of-speech and we find this subject and the object
we can is they'll be some relationship
but this is still not really it's groups
i a bit easier material with the knowledge graph you kind of and for these
and say you're referring to this movie you're bring to that person and then there's
all kinds of inferences and disambiguation that you can do all
without knowledge right
so
i think to the fact that we can start to represent pretty much more human
knowledge
at least in the terms of sir
concepts and entities
in a way that it's read fit you know you know you know a commercial
representation is very important and that's very big step towards real natural language understanding because
it's more grounded
one of the usages
for
for a knowledge graphics for disambiguation and there's is classic sentence from linguistics rate i
saw the men on the keel
with the telescope
that can be interpreted in a variety of ways similar which are depicted in this
funny graph right so it's what the linguists call a prepositional phrase attachment
problem is it
with a telescope is it attached to the hill or to the man
or to me and on the hill again does it types of the manner to
me so
traditionally there's been really no way to solve this except for context but if you
think about imagine that you have access to my amazon purchase history
how do you and you saw
but i just bought a telescope you know two weeks ago pen you would have
a kind of a this idea of the priors right you could have a very
strong prior that it is me who is using the telescope to see the man
on the hill
so
it's obvious that the more context and the different sources of this context that we
can have access to
gonna help disambiguate natural language
that's context in one aspect and then gonna with different idea is that we also
know that you're intent and what you're looking for also depends on where you are
so that's another
place where
location now is important contextual location
this is this is not new there's a bunch of companies that are using for
example exploring the yours as location local search obviously by sort for japanese restaurants depending
on where i am gonna get different results
one yell for example
then there's also company select employee i that focus on
sort of predicting what you might need based on your calendar entries there's Q at
startup that was recently part by apple also in this space and then there's also
obviously google now
that
sort of
use able to ingest things like your email and makes sense at and understand that
you wanna have a flight or a hotel reservation and then take it makes use
of that information to bring a relevant alerts when the time is right
and finally the last piece is the recommend or systems right we're all familiar with
things that they like and amazon you get recommendations for books depending on the stuff
that you've but for
and the way the systems work is kind of semantic like a lot of spell
data but the users and then they class of the users and see all your
similar to these users so you might also like this on the book and this
is expanding for your net flicks from movies and or an spotty five for music
a link in facebook for people that you might know et cetera so
all these
systems are using context to kind of make predictions or anticipate things that you might
mean
so
it is within this general context of the emergence of anticipatory sort that we start
this company and expect laps is the technology company based in san francisco
that we start about
twenty five years ago
with this idea of creating a technology platform that especially designed
for
this real-time applications that are gonna be able to ingest a lot of states
give you relevant contextual information
so
in sort of run step
the way works as we
are able to receive
it's real time and dates about what you are
what you might be saying
what you reading like on a new email
and you can
assign different weights to some of these modalities right so something but i say or
something that i treat is gonna have a higher
wait and something that
i'm an email that i receive which i may just sort of scheme or read
as opposed to
deep
read deeply
so but we take all these inputs in real time and this allows and we
process then we extract important pieces of information from all the sources and that creates
dynamic model our best representation of what the user is doing and their intent and
therefore were able to
all sorts cap for information across many different data sources to try to provide information
there's gonna be useful to that user at that point i
and as a forty example of this platform
which created mine mel
mind meld it's right now and i put our
that understands or conversation
and fines content as you speak
you can think a little bit of the sky where you can invite people and
start talking
and then we'll get
interesting content based on that
and all gonna give a demo in a second
important aspect of the design of my mlps that we wanted it to make it
very easy to share information because if it ever tried to have a kind of
a collaboration session a using sky people quickly find especially the i
on the ipod that it's difficult to say you wanna share a an article you
have to leave the sky at have to find a browser or to some searches
and then you find sort of the url and then to try to send the
url thrust of the sky i am which may or may not be active and
so it's a bit cumbersome so we wanted to
make it very easy for users to be able to discover
to
to navigate and then to share information
in the stuff that you share becomes a permanent archive of the conversation then you
can look back to use
right so with that things that
when a give it a little demo
my email
see how that
works
so this is my ml and you can see that i have access to
some of the sessions or conversations that have taken place in the past we can
think of you may have a recording meetings like every tuesday you have your update
with your colleagues and so you would joint that section because everybody's already
invited
and plus you can have all the context
all the things to
the shared items and the and the conversation that when that was previously happening that
session
but for now i'm gonna start a new session
and i can give a name
learn what's
i can make it friends only
what can make it public rights invite only
and
it's if the connection works
this is now making a call to facebook
the face at i
that
okay here we go so
let's say that i will invite alex
likes my able
okay
so
now what i'm the only one in the conversation and so otherwise if as soon
as alex joins you would also see
information about the speaker right
you know the thing that we found when you talk to people like no
web text run to
on the
on some sort of a conference call
people tend to kind of google each other and find the lincoln profile well here
is in which is you that to you right
and this is a discovery screen so i'm the only one seeing this information
but if i decide to share then everybody else in the conversation would see that
which is why for example
you know they find the current location
of the user right here in the
in this whole what's congress hotel
so
the most interesting parties
when you have multiple speakers but for now i'm just gonna give
so we will real demo of how this looks like
okay mine mel
in mind meld
so was
wondering a whether you by some part about present no batman's brain mapping initiative
i so this new technical clarity that makes brains transparent
that might be a help for L
for these mapping initiative
so
you can see that you know the we show you that about a ticker items
here of
what we own what we recognise we try to extract some of the of the
key phrases
and
and then we know we do some post processing and bring irrelevant results
see what else
okay mine mel
so we're gonna have some friends over maybe we should cook some italian food
it we can do a mean a strong to so
fitted you know for it
maybe that would be nice
so you can see the mean wait works
if i like this for example i can drag
and share it
and this is what becomes part of the of the archive
which then everybody in a conversation C and also becomes experiment archive but i can
also access through a browser
anybody has a topic or something that might be interested in
i is okay my mel so paper more anyways interested in deep belief neural networks
that's something that we've been talking about
at this L ieee asru
conference in other modes
so
one of the issues is i think pattern i are not connected in facebook
because otherwise we would have found
the right "'cause" are model
i
however if we are
not even this one okay
this is but you can see right so something
let's stick to ieee okay i
so one of the things that we do is we do look at the intersection
of the social graph of the different participant you know call
so that we can then
be better at
disambiguating
no named entities right so
so if we had been connected and
pay a pit on brno would have been the real they don't know what in
right here
alright so
but
let me go back to the
presentation real quick here
so
this is the platform that we've than the we build and
if you wanna sort of
dig a little bit deeper
one of the novelties i think is that were combining the traditional and all P
with a more we call and of search style approach
because the interesting part is that were able to model
semantic relevance
based on the context
the what we're speaker least be easily set and the user model and also from
the different data sources that you can you have access to
so basis something like work we go for dinner and then the other person says
i don't know you like japanese sure any good base around union square
we're building these incremental context
about the overall intent of the conversation
and so
were able to then you know
do natural language processing the usual stuff part-of-speech tagging noun phrase chunking named entity extraction
anaphora resolution semantic parsing topic modeling and some degree of discourse modelling and pragmatics
but then the or the piece is that depending on the signal
that we get from each of these different data sources and you can think of
my social graph that was mentioning
the local businesses that factual or el can give you
personal files right you give us access to drop box or to europe will drive
we can make take that of the data source
and then there's more the more general web with needles and general content and videos
but what's interesting is that even this the response that we get when we do
all these searches
that also informed as about what is relevant and what is not
about that particular
you know conversation
put in other words if for example you work to build an application that only
deals with movies and T V shows an actor stand any reference to something else
that would not find a match
would basically not give you
results
but that also means that would be much more precise right in terms of the
answers the that you give the relevancy of the content
in so this is something that
because we have well
kind of very scalable and fast backend
allows us to do multiple searches
and we have some cash as well but basically these
makes as
be able to compute the semantic relevance of an utterance never a dynamic way
based on context and also based on the type of results that we obtain
so this is a you know technology conference so what tech technical conference some of
the ongoing R and D as you can imagine is quite substantial
in the on the speech side
there's
we have two engines we have an embedded engine that runs on the ad
and also we have passed club a speech processing so an interesting
research is you know how to balance that and how to
how to be able to on the one hand listen continuously put on the other
also be robust to network issues
and then there's in terms of practical usage there's things that you can imagine detecting
sub optimal audio conditions like when the speakers so far on the mic noise environments
as we all know heavy accents are an issue
and then
one of things we found is because is an ipod at it's very natural for
people to kind of leave it on the table and two things happened they speak
to each from far away and also the can be multiple people
speaking on you know to the same device and our models try to do some
speaker adaptation
and sometimes that doesn't work that well
and then sort of the issue with this kind of the holy grail of could
we detect you know a sequence of long
and grammatical works and
when he's gone of you bridge
and of course there's techniques to do that but
we're trying to get
improve the accuracy of that
and then in terms of natural language processing in information retrieval also kind of a
design question are things like the class i cannot P problems like word sense disambiguation
although obviously the knowledge graph helps a lot
and then enough resolution and some of these things we do with the social graph
an important aspect is
these knowledge graph is useful but
how do you dynamically updated how do you keep it fresh
and we have some
some techniques for that but it's
it so
ongoing research
then every important aspect is
deciding that the sorts working this right
as we all know if we if you leave a speech engine on
but i remember an anecdote from are alex waibel that you told me once it
as an engine running in his house and then when he was doing the dishes
with a look cling incline that you know the search engine was spouting all kinds
of the interesting
hypotheses
this is been alluded to of course you can have a fairly robust voice activity
detection
but there's
there's always room for improvement
the search more than is as i mention is not just
understanding that something is speech but also detecting of how relevant something is within this
within the context and this comes of these other point of the interrupt ability and
mind meld is a bit too verbose right this is just a showcase of what
you can do also because the ipod has a lot of real state sequence shoulders
different articles in practice and through the a i'll talk about in a second
you have a lot of control about how like twenty one to be interrupted when
you wanna
a search result for an article to be
to be shown and this is
a function of at least two factors one is
have
in place in the request is how much the user ones to have certain information
and the other one is what i was mentioning about the nature of the information
found how strong is the signal from the data sources about the relevancy of what
i'm gonna show
and what i mean by that is
you can think of
but you by set
the difference between
what is the latest movie by woody allen
versus i've been talking about woody allen in
and i mentioned the that
the keys latest movie et cetera
right so one is a direct question where am the intent is clear more like
a serial like application where and trying to find the specific information the other one
is a reference sort of in passing about
something
i'm and so
that
would be the these understanding of
how eager i am to receive that bit of information
so that's work that is ongoing being able to model that
and then finally
we have a fair amount of feedback from this right especially when the user shares
an article that's a pretty strong signal that was relevant
on the negative side haven't shown you this but you consider flick one of the
entries on the on the right on the left hand side that eager items as
we call them you can delete them so that would be good of negative feedback
about
certain entity or a key phrase that was not
deemed relevant by the user
how to
optimize the learning that we can obtain from taking that user feedback
is also something that
that we working on
especially because
the decision to show certain article based is complex enough that
sometimes it's harder to assign the right sort of credit or blame for how we
got there
so just do well
sort of
twenty five what we're doing there's two products that we're offering
one is that might melt
my not obvious what you see here
and as a matter of fact
the mind meld out
is gonna be alive on the apple store tonight
so
we've been working need for awhile and it's finally happening
so if a if you're welcome to tried out
i guess will be tonight well
for whatever time zone you're up store a is set to so i think
new zealand users might already be able to download it
and then for the us will be
in a few hours
so that's a mimo but then
the other thing is
were also offering these the same functionality when api about a rest based api
that
you're able to well
give this creates sessions and you and users and give this real time updates so
that and then you can query for what is the most relevant you can also
select the different data sources and so it any given point you can ask for
what are modeled thing system most relevant set of articles
with a certain parameters for ranking et cetera so we're having
already a system
degree of well of scoring
how lots
with comes
for example some or all of our backers which include by the way google ventures
and also sums and
intel top twenty car
liberty mutual
they're all in the backers that we're trying to do some prototypes with
so
i'm character to try it out and
i was thinking that
because i'm actually gonna be missing the launch party that is happening in san francisco
i'm gonna take our banquet that the bishop's palace as the ones party for might
know
that's what i want to say and we have some time for questions
was at all
the i was wondering i'll the track the users they the example the key we
want to eat something and then
is it is still sticking to the restaurant domain and me and no
what the example you show that's all you're adding information and how about you change
information that you previously used switch to another domain
how you jack use
there's to right of information that we use for that one is simply time right
that sort of as time passes you can of so you decay certain previous and
trees
the other one is some
kind of topic detection clustering the we're doing so that
sentences that still seem to relate to the same topic kind of you know how
help a
sort of ground round that topic
and then there's also
some user modeling about you know you're previous sessions so that we have certain
prior weights
what
well so you know there there's
i'm not gonna sitting some specific algorithm that we use but you can imagine there's
some you know statistical techniques to
to do that modeling
where small startup we can not like reveal everything
so like very much so it's great another question so
i one point you happened mentioned
asr you and all the modes probably enough came out as a that's are you
a slu and columbus
no it's
it would same
the really what
that what you've shown us are ways of organising information at the output and the
process
but also same particularly not example when the beanie the
not only that it's actually well it does know exactly where you work it's without
map
and it might even figure out that you're at this thank all layers are you
but this things we're not being reflected in the lower level transcription process so i
was wondering how the mites you don't have to tell us anything it's buster's father
figured and to train nice things
well it's obviously a that the research issue of how you
make the most of the contextual information and unfortunately
asr specially the well the these cloud based asr
de at this point doesn't
fully support the kind of adaptation and comp and dynamic modification that would like to
do
but that's kind of a and an obvious thing to do in the same way
that you constructs larger contexts and fine you know the all the people that you're
related to and at that you're specific lexicon having something like the location and the
towns nearby or would be something
very no sort of natural to do
but we're not being this
i have to say your search for better more innocent implement because
the previous one used to be a and the step and it has so that
you made
so when you search for permanent you go okay so this is better
well that the asr was no hundred percent accuracy
which one to use
actually we use the writing including a new ones and cools
sex for a talk also pitch wondering about privacy occurrence i was on those impression
that's the more
i want to
interact with this mind meld at some live in or a need to be transparent
for the for that and my personal data
well i have a
actually a philosophical reflection
that
as a society with this technology we are going to words what i'm calling with
transparent bring
a and
if you think closely about it up
the better we are collecting data but users and modeling their thing intentions
we can get to a point where
you can almost all of complete your thought
right assume that you start typing the query and gonna be knows what you might
one
and of course is just a little bit of science fiction but
we're kind of getting there and so i think the way to address that is
by doing very transparent about this process
and giving you full control that what is it that you wanna share for how
long
because
that's really the only way to modulated it's not just say one gonna opt out
and not just gonna use
any of these
anticipate research because basically will be unavoidable right but so i think it's
it's
what we need to do is how well some clear
settings about
what you wanna share with this out for how long
and then insuring the back and that that's really
the only way the only usage
of that information
but as an example
we're not recording
this
the voice rate
and is the only thing that is permanent in this particular mind all application
are this the articles that you've specifically share
that's the only think that
so i'm happy that maybe if you're looking at pedro if you are task pedro
in police record would you see something
it may be you wouldn't wanna see so is there anyway i like when you're
looking at your space
do you have certain
contexts that you're searching for things when you bring information back like let's say you
know this descending order social setting or some other context
yes so what one of the shortcomings of the little demo idea is first of
all you was only one speaker it's always more interesting when israel conversation
and the second is it wasn't really a long ranging conversation about certain topic where
mine mel excels at least in say you wanna
planned application with you know some of the frames are somewhere else and you sail
well gonna go here then you explore different places you can stay things you can
do and you share that when you have
a long range in conversation with this with the kind of overarching goal
that's where it works the best if you keep sort of switching around then it
becomes more like a serial like search that doesn't have much
in just a quick question so how do you build your pronunciation so if you
look at asr you would spell line out that if you look at icassp you
actually see it doesn't work
that's it's mostly in the lexicon there's certain
abbreviations there are more typically
separated like you know i guess are you or some other ones like need to
alright guys that would be a spoken is a war
it's so it's becomes in the pronunciation lexicon pretty much
you more questions