so my name is daniel and i'm fifty fusion at the technical university of unique
and they it on a to prevent you the joint work of my colleagues in
e
about natural language understanding services and evaluation
and this work is part of a bigger project a corporation between our share and
the corporate technology department from the event
and the project is called what's good social software and i would say very much
driven by technology so we try a lot of
new technologies
two libraries and so on and we also do a lot of prototyping and one
of these prototypes happen to be a chequebook because
that's what you do these days
if you want to be cool is a corporation
so this is on a very abstract level yes picture we choose for our chat
about and i don't want to go into detail on every point but i want
to highlight to fink sort of first one is you can see that and contextual
information
plays a quite important role i'm in our chat about
this is because also one of the focuses of the project
because we also tried to build
a context or which
stores processes and distribute
context information among different sources of the applications and this can be everything lied user
profiles
information about hardware or preferences and so on
and why do we think it's important for jackpot also
well if so just like the pipeline with the three steps
and we think
contact information can be very helpful in every of these steps so for example for
the request interpretation
you get a question like i want to how can i get home from
the output
and then obviously in order to generate a query out of this you first
have to replace home with the information like an address city so this would be
one example where contextual information could be useful
then also
so for me home is unique
so from the button to munich you have a lot of different option you can
fly to train
i you can drive
and so how to select which of these options you want to take
i
that's fixed
and
and well how do you decide which of these options you want to take
okay
so and so you have a lot of options and how to choose which option
you can always choose to find a cheap this
or you can take can't into account user preferences maybe i'm afraid of flying
so the checkpoint shouldn't suggest and a flight or
i don't even have a cockroach and suggested five
and just another point where contextual information could be useful
and then holds for the message generation on a very high level why which language
i want to have an output or on which device
am i receiving the message so or language service so if it's without has to
be very short and so on
so contextual information plays a very important role that actually that's not what i want
to talk about today and today i want to focus on this and this is
part
so how can i analyse incoming requests
so here we have an example how can i get from you need to the
portal
so what do we actually want to extract from this would be the first question
so
i think what would be useful is we first need somehow
what is the user actually talking about what is the task
and this would be fine connection from
and then the other important things are i want to start somewhere
in this case newly and i want to travel to somewhere
and this is something like
a concept so when we map just to the concept of natural language understanding services
nearly all of them use intents and entities that's concepts own intent is basically
a label for a whole message
in this case the intent would be
find a connection and entities are labels for part of the message can be a
word it can be character multiple was multiple characters
whatever
and then i can define different entity types
so for this example i could
and
define
an entity type start and set type destination and what i would want to have
from my from a natural language understanding service is when i have a i put
in something like this
i get this information
the intent and the content and
so and that's actually how all of them work so
you can train all of them through a web interface and
you do basically what you can see here so you mark the words in to
select and so on
you also have
a more
so
if you want to train a lot of data obviously have not just to do
all of this and about the phase so most of them also offer like edge
importance function and this is actually the data from a formant of microsoft lose
but they all look kind of similar
okay so i already mentioned microsoft lose and there are a lot of either a
popular services around there i think these are probably most popular one at the moment
so when we started to implement our prototype we asked ourselves
which of these should we use
and has anybody here have a used one of them
okay so has anyone ever tried multiple of them
and
maybe how to decide which one to use
okay so
so we didn't know how to choose so for the first thing we looked into
recent publications because actually
quite a few people are using it
these days so from this year and largely confined quite some papers using one of
them
but none of these labels actually say okay we choose this because of so they
just say we use this
and we wanted to know why
so we also has an ad or industry partner them and they also used in
different and
division different services we all the task
i don't industry partner
and their onset was usually
well we have a contract with this company anyway or we got it for free
so we are using it
and well
how was a valid reasons but still we bought
that's not enough
we want to know which services better
i'm which serve as a better classification
to make more educated decision which serve as we want to use so what we
want to do is compare all of them
and how you do that you train them all of the same data and test
them all
with the same data
so unfortunately
we were not able to compare all of them
because so when we started actually and of the next was to enclose better
i don't know maybe a change today but at this point in time they didn't
offer actually poured function so you have to mark everything web interface
and we
couldn't all we didn't want to do that
i'm with a i a for the batch import function but it was not working
with external data so you could explore data from with the i-th entry for that
according to the issues record it's
unknown but
although i'm not sure if it's really but or feature to look people in actually
but
so i already said that
they all have kind of similar looking
data format
but still of course their oral somewhat different so some use just one file some
distribute information
on different files
some down to position
by character some by works and so on so what we did
because we want to automated
i'm just process as much as possible
we implemented a small i from converter which is able to convert from a generic
representation that we use for corpus
convert them to the different important format
and actually
one thing that is
maybe also interesting
out of these this there are three
services which a three
so at i don't any i and without i
a three as in three so they are free of charge
and a that is free s and freedom because it's open source software
and
another and i think about the other is the rows like and i work with
important formant
from all the other services so that means
when you switch from one of the commercial services rather
you don't have to do any work you can just copy all your data and
it's
so in what we then be a the
with the can
we converted
in the right format we use the api off to services to train them
for the commercial services
just a slight five or ten minutes and you can do it also for the
rest if i am for rather you have to do it on the command line
and i two rows four
so roughly
for hundreds instances that you're training you can
assume it takes about one hour on a reasonably desktop machine
and then
other words
the same
only and the other direction
we took again our corpus and test data from it
send it to all different apis
store the result annotations and then compared them to our
gold standard
so about the car as we used two of them
one was
and
obtain
through chat about that we will before so it was a working a telegram set
what for public transport munich and it was manually checked by as
and so we had twenty six
questions requests from a set what and they had
two different intents and five and a type so we have a lot of state
or for intent and just type you
this data was interesting because it's very natural and it was
so users use the chat bots so it's kind of
hopefully comparable to
link linguistically from the form it would receive with
the chat about
but from the domain obviously and the men's was more interested in
and technical domain that's why we had a second a corpus
which we
i collected from exchange so all programmers
probably no stake overflow and they have a bunch of
different platforms for different and topics
and we took a questions from
their platforms for web application and another platform
core ask wouldn't to which is about questions
about one to
and these where detect with amazon mechanical turk
and the stack exchange corpus is available online
you can find it
as detail
so
and
we also in the corpus you can also find the answers to just questions because
we only so
questions which have a excepted answer although we are not using these utterances for our
evaluation
but it might be useful for somebody else in the future
and also we took the highest ranked questions
because we assume that they have a somewhat good quality
how we do on a mechanical turk then well we basically models
and
the interface that all these services also offer so we presented a sentence and then
utterances
could
highlights a different parts and are entities
and they could choose from a predefined list of intense
and we also asked them to rate how confident they are
about their annotation
and we only took into account annotations
which where
somewhat confident at least
and for which we could find inter annotator agreement
of more than sixty percent
so this is what we get out the distribution of intense and that it is
so the actual numbers a not so important but
if you look at it you can see that there
entities with more training data and less training data
so we have some variety in there
although of course in total it is rather small dataset still
and then before we started our evaluation we had three main hypothesis
so the first one might sound obvious but it was still the reason why we
did all this because we assume that
you should think about which of these services you choose and not just because of
pricing but because of the quality of audiences
or of the annotations
we also assume that commercial products will overall perform better
after all they have probably hundreds of thousands of use feeding and with data
and therefore we also found that and especially for
entities and intends where there's not much training data
they should be
better because they so i'm a values as
machine learning big and moody which comes with
three hundred megabytes of initial data so you would assume if there's not much training
data provided that
lewis watson and on have
lot more data is to start with
and we also thought that the quality of the labels is inference by the domain
so if one service is
load on the corpus about public transport it doesn't necessarily mean that it also good
on the other corpora
so this is on a very high level the
results of collaboration
what you can see
the blue but which is lewis
so this is f-score
across all label so intents and entities combined in the paper you can find
broken down version of it
but so for the guys from microsoft and regulations new was based on every domain
actually what was surprising for us that a rather came second
so across all the domains it has the second best performance
i'm which was quite surprising for us
if you look into detail you can find also quite some interesting reasons why on
some the main some service is useful for example and what's new
was very bad on compared to the others on the public transport data because it
content the
it ignored
so use only example with from into
and
you can have the same words for from into obviously all the time
and
what's and was the only service that was not able to distinguish between from and
to
so
if you are right from you need to the portly or from the put into
really
what's and always gave
both words the label from and to
so this is for example one reason
why we see different
performances on a different domains
so what are the key findings of or evaluation
well as i said news performs best in all the domains we tested
rather second best
an interesting point if you look at intents and entities with
not much training data it's there's no difference so large that is not
better or worse on them then the commercial services
so i'm it seems that there is no big influence all
of the initial training set
that is already there
and well you see that domain matters but the question as to how much so
lose still performs best in and all domains
because that's kind of the question
i'm can we now say okay you should always use lewis
and i would say no
you still have to trying to with your domain with your data
i'm to find out which serve as the best for you
also services might change and you without noticing use so
it is
that's why to think it is very useful to automate just five line with the
scripts we did and so on because then you can do it on all the
services and even redo it constantly to find out
which service is
i'm best from you
the best for you and one
interesting question ridge rose from
these findings
is if the commercial services really
benefit that much from user data because when we talk with industry partners
that was one of the main concern still
we pay the money and prepaid and in data
and
so
i'm not really sure about this so at least for the user defined entity so
if i define my entity is cold start
and i label one thousand datasets
how it is useful for
any of these services so because
it's my user defined label
and the able to extract from it
maybe that's the reason why we don't see what we expected and when it comes
to and
entity types and intense with
this training data that they do not perform
significantly better
thank you
okay so we have a model five minutes for questions
and experiments were great so
full disclosure or someone the greatest rather so i'm slightly biased
and
did you go and
tweak any of the hyper parameters
in the rows a rotational e
the hyper parameters did you just use the default other or did you tweak them
now we use i think you could maybe squeeze that's more performance
sure
things were very talk this question is more common is for some more is that
it seems that there's almost like lacking a baseline which is like one of like
maybe a phd student for a week spending time trying to get the accuracy of
something because these services are really designed for people are technical i think that is
that this guy comparisons is also i just like the c
maybe like you know what happened you just led to take something like a slightly
more under some like that and just see how well you can do without these
like these services are helping you want because like i think that they're that they're
about what you could say well like you like and you very well that using
those where you actually if you want but i'm really get the accuracy a should
get into the details
results
displeasure percent you gotta start
i absolutely loved here i
x i i'm very appreciative that some independent party's taking the time to evaluate independent
some services like lewis possibly the others to have something like active learning they'll suggests
utterances you might wanna go and label once you've collected some utterances
if i just an evaluation correctly you haven't done that here you have a fixed
training set
i'm curious have you looked at that aspect of the services altering any comments
so i mean there are a lot of other aspects which we didn't look at
so this is one point i'm another point is also
and that a lot of these services including we also have like
bill in entity type already
so you have fixed
a pre-trained entity types for look at phone numbers and so on
and i think that's also something you can benefit a lot from to use them
and
but so we looked at them we also so for the ammends we did also
and the comparison the about
the functionalities of some of them include
already giving responses can responses and so on
but so really we were just dataset and we only did this evaluation
on these things and because again if you do it with the suggestions and you
have to do it fruity wrap interface and this means that you have to label
five hundred utterances on all systems
that is something that might be interesting in the future but takes more time
you have any other questions we have a about two minutes left
okay i have a question so
so you this is a chat session so could you it of rate on the
relationship we this work and chapel
well as i said so i think this is one of the parts
or this can be one a useful are you want to double upset but and
what we saw
the sign typical work is so i mean you use all differences and
and if you just evaluate your chat part of the whole the end
then
you might be influenced by these results without knowing it so
your chequebook might perform
better just because you change at your natural language understanding service so i think
it is important
to know about these things and to think about it and also if you do
an operation of a check or as a whole system and to take into account
these things and i also think from an industry perspective
these services i one of the reasons why
set ups became so popular in the last time
because it is really easy so
you have other services which are not as popular with a really offer you to
click together a whole set but without programming is a single line of code
and here you can at least without having any knowledge about language processing machine learning
whatsoever
and i think therefore it's especially
important for type of this double document and inference lot
w one
also
okay click one place
okay
so it's about task so the sum of the speaker