Speech Transcript - Evaluating Natural Language Understanding Services for Conversational Question Answering Systems

so my name is daniel and i'm fifty fusion at the technical university of unique

and they it on a to prevent you the joint work of my colleagues in

about natural language understanding services and evaluation

and this work is part of a bigger project a corporation between our share and

the corporate technology department from the event

and the project is called what's good social software and i would say very much

driven by technology so we try a lot of

new technologies

two libraries and so on and we also do a lot of prototyping and one

of these prototypes happen to be a chequebook because

that's what you do these days

if you want to be cool is a corporation

so this is on a very abstract level yes picture we choose for our chat

about and i don't want to go into detail on every point but i want

to highlight to fink sort of first one is you can see that and contextual

information

plays a quite important role i'm in our chat about

this is because also one of the focuses of the project

because we also tried to build

a context or which

stores processes and distribute

context information among different sources of the applications and this can be everything lied user

profiles

information about hardware or preferences and so on

and why do we think it's important for jackpot also

well if so just like the pipeline with the three steps

and we think

contact information can be very helpful in every of these steps so for example for

the request interpretation

you get a question like i want to how can i get home from

the output

and then obviously in order to generate a query out of this you first

have to replace home with the information like an address city so this would be

one example where contextual information could be useful

then also

so for me home is unique

so from the button to munich you have a lot of different option you can

fly to train

i you can drive

and so how to select which of these options you want to take

that's fixed

and

and well how do you decide which of these options you want to take

okay

so and so you have a lot of options and how to choose which option

you can always choose to find a cheap this

or you can take can't into account user preferences maybe i'm afraid of flying

so the checkpoint shouldn't suggest and a flight or

i don't even have a cockroach and suggested five

and just another point where contextual information could be useful

and then holds for the message generation on a very high level why which language

i want to have an output or on which device

am i receiving the message so or language service so if it's without has to

be very short and so on

so contextual information plays a very important role that actually that's not what i want

to talk about today and today i want to focus on this and this is

part

so how can i analyse incoming requests

so here we have an example how can i get from you need to the

portal

so what do we actually want to extract from this would be the first question

i think what would be useful is we first need somehow

what is the user actually talking about what is the task

and this would be fine connection from

and then the other important things are i want to start somewhere

in this case newly and i want to travel to somewhere

and this is something like

a concept so when we map just to the concept of natural language understanding services

nearly all of them use intents and entities that's concepts own intent is basically

a label for a whole message

in this case the intent would be

find a connection and entities are labels for part of the message can be a

word it can be character multiple was multiple characters

whatever

and then i can define different entity types

so for this example i could

and

define

an entity type start and set type destination and what i would want to have

from my from a natural language understanding service is when i have a i put

in something like this

i get this information

the intent and the content and

so and that's actually how all of them work so

you can train all of them through a web interface and

you do basically what you can see here so you mark the words in to

select and so on

you also have

a more

if you want to train a lot of data obviously have not just to do

all of this and about the phase so most of them also offer like edge

importance function and this is actually the data from a formant of microsoft lose

but they all look kind of similar

okay so i already mentioned microsoft lose and there are a lot of either a

popular services around there i think these are probably most popular one at the moment

so when we started to implement our prototype we asked ourselves

which of these should we use

and has anybody here have a used one of them

okay so has anyone ever tried multiple of them

and

maybe how to decide which one to use

okay so

so we didn't know how to choose so for the first thing we looked into

recent publications because actually

quite a few people are using it

these days so from this year and largely confined quite some papers using one of

them

but none of these labels actually say okay we choose this because of so they

just say we use this

and we wanted to know why

so we also has an ad or industry partner them and they also used in

different and

division different services we all the task

i don't industry partner

and their onset was usually

well we have a contract with this company anyway or we got it for free

so we are using it

and well

how was a valid reasons but still we bought

that's not enough

we want to know which services better

i'm which serve as a better classification

to make more educated decision which serve as we want to use so what we

want to do is compare all of them

and how you do that you train them all of the same data and test

them all

with the same data

so unfortunately

we were not able to compare all of them

because so when we started actually and of the next was to enclose better

i don't know maybe a change today but at this point in time they didn't

offer actually poured function so you have to mark everything web interface

and we

couldn't all we didn't want to do that

i'm with a i a for the batch import function but it was not working

with external data so you could explore data from with the i-th entry for that

according to the issues record it's

unknown but

although i'm not sure if it's really but or feature to look people in actually

but

so i already said that

they all have kind of similar looking

data format

but still of course their oral somewhat different so some use just one file some

distribute information

on different files

some down to position

by character some by works and so on so what we did

because we want to automated

i'm just process as much as possible

we implemented a small i from converter which is able to convert from a generic

representation that we use for corpus

convert them to the different important format

and actually

one thing that is

maybe also interesting

out of these this there are three

services which a three

so at i don't any i and without i

a three as in three so they are free of charge

and a that is free s and freedom because it's open source software

and

another and i think about the other is the rows like and i work with

important formant

from all the other services so that means

when you switch from one of the commercial services rather

you don't have to do any work you can just copy all your data and

it's

so in what we then be a the

with the can

we converted

in the right format we use the api off to services to train them

for the commercial services

just a slight five or ten minutes and you can do it also for the

rest if i am for rather you have to do it on the command line

and i two rows four

so roughly

for hundreds instances that you're training you can

assume it takes about one hour on a reasonably desktop machine

and then

other words

the same

only and the other direction

we took again our corpus and test data from it

send it to all different apis

store the result annotations and then compared them to our

gold standard

so about the car as we used two of them

one was

and

obtain

through chat about that we will before so it was a working a telegram set

what for public transport munich and it was manually checked by as

and so we had twenty six

questions requests from a set what and they had

two different intents and five and a type so we have a lot of state

or for intent and just type you

this data was interesting because it's very natural and it was

so users use the chat bots so it's kind of

hopefully comparable to

link linguistically from the form it would receive with

the chat about

but from the domain obviously and the men's was more interested in

and technical domain that's why we had a second a corpus

which we

i collected from exchange so all programmers

probably no stake overflow and they have a bunch of

different platforms for different and topics

and we took a questions from

their platforms for web application and another platform

core ask wouldn't to which is about questions

about one to

and these where detect with amazon mechanical turk

and the stack exchange corpus is available online

you can find it

as detail

and

we also in the corpus you can also find the answers to just questions because

we only so

questions which have a excepted answer although we are not using these utterances for our

evaluation

but it might be useful for somebody else in the future

and also we took the highest ranked questions

because we assume that they have a somewhat good quality

how we do on a mechanical turk then well we basically models

and

the interface that all these services also offer so we presented a sentence and then

utterances

could

highlights a different parts and are entities

and they could choose from a predefined list of intense

and we also asked them to rate how confident they are

about their annotation

and we only took into account annotations

which where

somewhat confident at least

and for which we could find inter annotator agreement

of more than sixty percent

so this is what we get out the distribution of intense and that it is

so the actual numbers a not so important but

if you look at it you can see that there

entities with more training data and less training data

so we have some variety in there

although of course in total it is rather small dataset still

and then before we started our evaluation we had three main hypothesis

so the first one might sound obvious but it was still the reason why we

did all this because we assume that

you should think about which of these services you choose and not just because of

pricing but because of the quality of audiences

or of the annotations

we also assume that commercial products will overall perform better

after all they have probably hundreds of thousands of use feeding and with data

and therefore we also found that and especially for

entities and intends where there's not much training data

they should be

better because they so i'm a values as

machine learning big and moody which comes with

three hundred megabytes of initial data so you would assume if there's not much training

data provided that

lewis watson and on have

lot more data is to start with

and we also thought that the quality of the labels is inference by the domain

so if one service is

load on the corpus about public transport it doesn't necessarily mean that it also good

on the other corpora

so this is on a very high level the

results of collaboration

what you can see

the blue but which is lewis

so this is f-score

across all label so intents and entities combined in the paper you can find

broken down version of it

but so for the guys from microsoft and regulations new was based on every domain

actually what was surprising for us that a rather came second

so across all the domains it has the second best performance

i'm which was quite surprising for us

if you look into detail you can find also quite some interesting reasons why on

some the main some service is useful for example and what's new

was very bad on compared to the others on the public transport data because it

content the

it ignored

so use only example with from into

and

you can have the same words for from into obviously all the time

and

what's and was the only service that was not able to distinguish between from and

if you are right from you need to the portly or from the put into

really

what's and always gave

both words the label from and to

so this is for example one reason

why we see different

performances on a different domains

so what are the key findings of or evaluation

well as i said news performs best in all the domains we tested

rather second best

an interesting point if you look at intents and entities with

not much training data it's there's no difference so large that is not

better or worse on them then the commercial services

so i'm it seems that there is no big influence all

of the initial training set

that is already there

and well you see that domain matters but the question as to how much so

lose still performs best in and all domains

because that's kind of the question

i'm can we now say okay you should always use lewis

and i would say no

you still have to trying to with your domain with your data

i'm to find out which serve as the best for you

also services might change and you without noticing use so

it is

that's why to think it is very useful to automate just five line with the

scripts we did and so on because then you can do it on all the

services and even redo it constantly to find out

which service is

i'm best from you

the best for you and one

interesting question ridge rose from

these findings

is if the commercial services really

benefit that much from user data because when we talk with industry partners

that was one of the main concern still

we pay the money and prepaid and in data

and

i'm not really sure about this so at least for the user defined entity so

if i define my entity is cold start

and i label one thousand datasets

how it is useful for

any of these services so because

it's my user defined label

and the able to extract from it

maybe that's the reason why we don't see what we expected and when it comes

to and

entity types and intense with

this training data that they do not perform

significantly better

thank you

okay so we have a model five minutes for questions

and experiments were great so

full disclosure or someone the greatest rather so i'm slightly biased

and

did you go and

tweak any of the hyper parameters

in the rows a rotational e

the hyper parameters did you just use the default other or did you tweak them

now we use i think you could maybe squeeze that's more performance

sure

things were very talk this question is more common is for some more is that

it seems that there's almost like lacking a baseline which is like one of like

maybe a phd student for a week spending time trying to get the accuracy of

something because these services are really designed for people are technical i think that is

that this guy comparisons is also i just like the c

maybe like you know what happened you just led to take something like a slightly

more under some like that and just see how well you can do without these

like these services are helping you want because like i think that they're that they're

about what you could say well like you like and you very well that using

those where you actually if you want but i'm really get the accuracy a should

get into the details

results

displeasure percent you gotta start

i absolutely loved here i

x i i'm very appreciative that some independent party's taking the time to evaluate independent

some services like lewis possibly the others to have something like active learning they'll suggests

utterances you might wanna go and label once you've collected some utterances

if i just an evaluation correctly you haven't done that here you have a fixed

training set

i'm curious have you looked at that aspect of the services altering any comments

so i mean there are a lot of other aspects which we didn't look at

so this is one point i'm another point is also

and that a lot of these services including we also have like

bill in entity type already

so you have fixed

a pre-trained entity types for look at phone numbers and so on

and i think that's also something you can benefit a lot from to use them

and

but so we looked at them we also so for the ammends we did also

and the comparison the about

the functionalities of some of them include

already giving responses can responses and so on

but so really we were just dataset and we only did this evaluation

on these things and because again if you do it with the suggestions and you

have to do it fruity wrap interface and this means that you have to label

five hundred utterances on all systems

that is something that might be interesting in the future but takes more time

you have any other questions we have a about two minutes left

okay i have a question so

so you this is a chat session so could you it of rate on the

relationship we this work and chapel

well as i said so i think this is one of the parts

or this can be one a useful are you want to double upset but and

what we saw

the sign typical work is so i mean you use all differences and

and if you just evaluate your chat part of the whole the end

then

you might be influenced by these results without knowing it so

your chequebook might perform

better just because you change at your natural language understanding service so i think

it is important

to know about these things and to think about it and also if you do

an operation of a check or as a whole system and to take into account

these things and i also think from an industry perspective

these services i one of the reasons why

set ups became so popular in the last time

because it is really easy so

you have other services which are not as popular with a really offer you to

click together a whole set but without programming is a single line of code

and here you can at least without having any knowledge about language processing machine learning

whatsoever

and i think therefore it's especially

important for type of this double document and inference lot

w one

also

okay click one place

okay

so it's about task so the sum of the speaker

Evaluating Natural Language Understanding Services for Conversational Question Answering Systems

Second WOCHAT Special Session on Chatbots and Conversational Agents (WOCHAT-SS)

Daniel Braun, Adrian Hernandez-Mendez, Florian Matthes and Manfred Langen