0:00:16 | so my name is daniel and i'm fifty fusion at the technical university of unique |
---|
0:00:21 | and they it on a to prevent you the joint work of my colleagues in |
---|
0:00:24 | e |
---|
0:00:25 | about natural language understanding services and evaluation |
---|
0:00:30 | and this work is part of a bigger project a corporation between our share and |
---|
0:00:34 | the corporate technology department from the event |
---|
0:00:37 | and the project is called what's good social software and i would say very much |
---|
0:00:42 | driven by technology so we try a lot of |
---|
0:00:46 | new technologies |
---|
0:00:48 | two libraries and so on and we also do a lot of prototyping and one |
---|
0:00:53 | of these prototypes happen to be a chequebook because |
---|
0:00:56 | that's what you do these days |
---|
0:00:58 | if you want to be cool is a corporation |
---|
0:01:01 | so this is on a very abstract level yes picture we choose for our chat |
---|
0:01:07 | about and i don't want to go into detail on every point but i want |
---|
0:01:12 | to highlight to fink sort of first one is you can see that and contextual |
---|
0:01:15 | information |
---|
0:01:17 | plays a quite important role i'm in our chat about |
---|
0:01:20 | this is because also one of the focuses of the project |
---|
0:01:25 | because we also tried to build |
---|
0:01:26 | a context or which |
---|
0:01:29 | stores processes and distribute |
---|
0:01:31 | context information among different sources of the applications and this can be everything lied user |
---|
0:01:39 | profiles |
---|
0:01:41 | information about hardware or preferences and so on |
---|
0:01:45 | and why do we think it's important for jackpot also |
---|
0:01:48 | well if so just like the pipeline with the three steps |
---|
0:01:52 | and we think |
---|
0:01:54 | contact information can be very helpful in every of these steps so for example for |
---|
0:01:59 | the request interpretation |
---|
0:02:01 | you get a question like i want to how can i get home from |
---|
0:02:05 | the output |
---|
0:02:06 | and then obviously in order to generate a query out of this you first |
---|
0:02:11 | have to replace home with the information like an address city so this would be |
---|
0:02:17 | one example where contextual information could be useful |
---|
0:02:21 | then also |
---|
0:02:22 | so for me home is unique |
---|
0:02:24 | so from the button to munich you have a lot of different option you can |
---|
0:02:28 | fly to train |
---|
0:02:30 | i you can drive |
---|
0:02:31 | and so how to select which of these options you want to take |
---|
0:02:38 | i |
---|
0:02:40 | that's fixed |
---|
0:02:41 | and |
---|
0:02:45 | and well how do you decide which of these options you want to take |
---|
0:03:05 | okay |
---|
0:03:05 | so and so you have a lot of options and how to choose which option |
---|
0:03:09 | you can always choose to find a cheap this |
---|
0:03:12 | or you can take can't into account user preferences maybe i'm afraid of flying |
---|
0:03:17 | so the checkpoint shouldn't suggest and a flight or |
---|
0:03:22 | i don't even have a cockroach and suggested five |
---|
0:03:25 | and just another point where contextual information could be useful |
---|
0:03:29 | and then holds for the message generation on a very high level why which language |
---|
0:03:33 | i want to have an output or on which device |
---|
0:03:36 | am i receiving the message so or language service so if it's without has to |
---|
0:03:41 | be very short and so on |
---|
0:03:42 | so contextual information plays a very important role that actually that's not what i want |
---|
0:03:47 | to talk about today and today i want to focus on this and this is |
---|
0:03:51 | part |
---|
0:03:53 | so how can i analyse incoming requests |
---|
0:03:58 | so here we have an example how can i get from you need to the |
---|
0:04:01 | portal |
---|
0:04:02 | so what do we actually want to extract from this would be the first question |
---|
0:04:07 | so |
---|
0:04:08 | i think what would be useful is we first need somehow |
---|
0:04:13 | what is the user actually talking about what is the task |
---|
0:04:17 | and this would be fine connection from |
---|
0:04:20 | and then the other important things are i want to start somewhere |
---|
0:04:25 | in this case newly and i want to travel to somewhere |
---|
0:04:28 | and this is something like |
---|
0:04:31 | a concept so when we map just to the concept of natural language understanding services |
---|
0:04:37 | nearly all of them use intents and entities that's concepts own intent is basically |
---|
0:04:43 | a label for a whole message |
---|
0:04:46 | in this case the intent would be |
---|
0:04:48 | find a connection and entities are labels for part of the message can be a |
---|
0:04:54 | word it can be character multiple was multiple characters |
---|
0:04:57 | whatever |
---|
0:04:59 | and then i can define different entity types |
---|
0:05:02 | so for this example i could |
---|
0:05:04 | and |
---|
0:05:05 | define |
---|
0:05:06 | an entity type start and set type destination and what i would want to have |
---|
0:05:12 | from my from a natural language understanding service is when i have a i put |
---|
0:05:16 | in something like this |
---|
0:05:18 | i get this information |
---|
0:05:20 | the intent and the content and |
---|
0:05:24 | so and that's actually how all of them work so |
---|
0:05:28 | you can train all of them through a web interface and |
---|
0:05:31 | you do basically what you can see here so you mark the words in to |
---|
0:05:34 | select and so on |
---|
0:05:36 | you also have |
---|
0:05:38 | a more |
---|
0:05:40 | so |
---|
0:05:41 | if you want to train a lot of data obviously have not just to do |
---|
0:05:46 | all of this and about the phase so most of them also offer like edge |
---|
0:05:51 | importance function and this is actually the data from a formant of microsoft lose |
---|
0:05:59 | but they all look kind of similar |
---|
0:06:03 | okay so i already mentioned microsoft lose and there are a lot of either a |
---|
0:06:10 | popular services around there i think these are probably most popular one at the moment |
---|
0:06:15 | so when we started to implement our prototype we asked ourselves |
---|
0:06:21 | which of these should we use |
---|
0:06:23 | and has anybody here have a used one of them |
---|
0:06:29 | okay so has anyone ever tried multiple of them |
---|
0:06:35 | and |
---|
0:06:36 | maybe how to decide which one to use |
---|
0:06:40 | okay so |
---|
0:06:43 | so we didn't know how to choose so for the first thing we looked into |
---|
0:06:48 | recent publications because actually |
---|
0:06:51 | quite a few people are using it |
---|
0:06:54 | these days so from this year and largely confined quite some papers using one of |
---|
0:06:59 | them |
---|
0:07:00 | but none of these labels actually say okay we choose this because of so they |
---|
0:07:05 | just say we use this |
---|
0:07:07 | and we wanted to know why |
---|
0:07:09 | so we also has an ad or industry partner them and they also used in |
---|
0:07:13 | different and |
---|
0:07:14 | division different services we all the task |
---|
0:07:16 | i don't industry partner |
---|
0:07:18 | and their onset was usually |
---|
0:07:20 | well we have a contract with this company anyway or we got it for free |
---|
0:07:24 | so we are using it |
---|
0:07:26 | and well |
---|
0:07:28 | how was a valid reasons but still we bought |
---|
0:07:32 | that's not enough |
---|
0:07:34 | we want to know which services better |
---|
0:07:38 | i'm which serve as a better classification |
---|
0:07:40 | to make more educated decision which serve as we want to use so what we |
---|
0:07:44 | want to do is compare all of them |
---|
0:07:48 | and how you do that you train them all of the same data and test |
---|
0:07:52 | them all |
---|
0:07:52 | with the same data |
---|
0:07:54 | so unfortunately |
---|
0:07:57 | we were not able to compare all of them |
---|
0:08:00 | because so when we started actually and of the next was to enclose better |
---|
0:08:05 | i don't know maybe a change today but at this point in time they didn't |
---|
0:08:09 | offer actually poured function so you have to mark everything web interface |
---|
0:08:15 | and we |
---|
0:08:17 | couldn't all we didn't want to do that |
---|
0:08:19 | i'm with a i a for the batch import function but it was not working |
---|
0:08:23 | with external data so you could explore data from with the i-th entry for that |
---|
0:08:30 | according to the issues record it's |
---|
0:08:32 | unknown but |
---|
0:08:33 | although i'm not sure if it's really but or feature to look people in actually |
---|
0:08:40 | but |
---|
0:08:42 | so i already said that |
---|
0:08:44 | they all have kind of similar looking |
---|
0:08:48 | data format |
---|
0:08:49 | but still of course their oral somewhat different so some use just one file some |
---|
0:08:54 | distribute information |
---|
0:08:56 | on different files |
---|
0:08:59 | some down to position |
---|
0:09:00 | by character some by works and so on so what we did |
---|
0:09:05 | because we want to automated |
---|
0:09:07 | i'm just process as much as possible |
---|
0:09:10 | we implemented a small i from converter which is able to convert from a generic |
---|
0:09:17 | representation that we use for corpus |
---|
0:09:21 | convert them to the different important format |
---|
0:09:23 | and actually |
---|
0:09:25 | one thing that is |
---|
0:09:27 | maybe also interesting |
---|
0:09:29 | out of these this there are three |
---|
0:09:33 | services which a three |
---|
0:09:34 | so at i don't any i and without i |
---|
0:09:38 | a three as in three so they are free of charge |
---|
0:09:42 | and a that is free s and freedom because it's open source software |
---|
0:09:46 | and |
---|
0:09:48 | another and i think about the other is the rows like and i work with |
---|
0:09:53 | important formant |
---|
0:09:55 | from all the other services so that means |
---|
0:09:57 | when you switch from one of the commercial services rather |
---|
0:10:01 | you don't have to do any work you can just copy all your data and |
---|
0:10:04 | it's |
---|
0:10:06 | so in what we then be a the |
---|
0:10:08 | with the can |
---|
0:10:10 | we converted |
---|
0:10:11 | in the right format we use the api off to services to train them |
---|
0:10:16 | for the commercial services |
---|
0:10:19 | just a slight five or ten minutes and you can do it also for the |
---|
0:10:23 | rest if i am for rather you have to do it on the command line |
---|
0:10:27 | and i two rows four |
---|
0:10:30 | so roughly |
---|
0:10:31 | for hundreds instances that you're training you can |
---|
0:10:36 | assume it takes about one hour on a reasonably desktop machine |
---|
0:10:43 | and then |
---|
0:10:44 | other words |
---|
0:10:46 | the same |
---|
0:10:47 | only and the other direction |
---|
0:10:50 | we took again our corpus and test data from it |
---|
0:10:54 | send it to all different apis |
---|
0:10:57 | store the result annotations and then compared them to our |
---|
0:11:01 | gold standard |
---|
0:11:04 | so about the car as we used two of them |
---|
0:11:08 | one was |
---|
0:11:10 | and |
---|
0:11:11 | obtain |
---|
0:11:12 | through chat about that we will before so it was a working a telegram set |
---|
0:11:16 | what for public transport munich and it was manually checked by as |
---|
0:11:22 | and so we had twenty six |
---|
0:11:25 | questions requests from a set what and they had |
---|
0:11:30 | two different intents and five and a type so we have a lot of state |
---|
0:11:34 | or for intent and just type you |
---|
0:11:37 | this data was interesting because it's very natural and it was |
---|
0:11:42 | so users use the chat bots so it's kind of |
---|
0:11:48 | hopefully comparable to |
---|
0:11:50 | link linguistically from the form it would receive with |
---|
0:11:54 | the chat about |
---|
0:11:55 | but from the domain obviously and the men's was more interested in |
---|
0:12:01 | and technical domain that's why we had a second a corpus |
---|
0:12:04 | which we |
---|
0:12:06 | i collected from exchange so all programmers |
---|
0:12:10 | probably no stake overflow and they have a bunch of |
---|
0:12:13 | different platforms for different and topics |
---|
0:12:17 | and we took a questions from |
---|
0:12:20 | their platforms for web application and another platform |
---|
0:12:24 | core ask wouldn't to which is about questions |
---|
0:12:28 | about one to |
---|
0:12:30 | and these where detect with amazon mechanical turk |
---|
0:12:34 | and the stack exchange corpus is available online |
---|
0:12:37 | you can find it |
---|
0:12:38 | as detail |
---|
0:12:40 | so |
---|
0:12:43 | and |
---|
0:12:44 | we also in the corpus you can also find the answers to just questions because |
---|
0:12:49 | we only so |
---|
0:12:50 | questions which have a excepted answer although we are not using these utterances for our |
---|
0:12:56 | evaluation |
---|
0:12:57 | but it might be useful for somebody else in the future |
---|
0:13:02 | and also we took the highest ranked questions |
---|
0:13:05 | because we assume that they have a somewhat good quality |
---|
0:13:12 | how we do on a mechanical turk then well we basically models |
---|
0:13:16 | and |
---|
0:13:16 | the interface that all these services also offer so we presented a sentence and then |
---|
0:13:22 | utterances |
---|
0:13:24 | could |
---|
0:13:25 | highlights a different parts and are entities |
---|
0:13:29 | and they could choose from a predefined list of intense |
---|
0:13:34 | and we also asked them to rate how confident they are |
---|
0:13:37 | about their annotation |
---|
0:13:39 | and we only took into account annotations |
---|
0:13:43 | which where |
---|
0:13:45 | somewhat confident at least |
---|
0:13:46 | and for which we could find inter annotator agreement |
---|
0:13:50 | of more than sixty percent |
---|
0:13:54 | so this is what we get out the distribution of intense and that it is |
---|
0:14:00 | so the actual numbers a not so important but |
---|
0:14:04 | if you look at it you can see that there |
---|
0:14:06 | entities with more training data and less training data |
---|
0:14:10 | so we have some variety in there |
---|
0:14:13 | although of course in total it is rather small dataset still |
---|
0:14:19 | and then before we started our evaluation we had three main hypothesis |
---|
0:14:24 | so the first one might sound obvious but it was still the reason why we |
---|
0:14:30 | did all this because we assume that |
---|
0:14:33 | you should think about which of these services you choose and not just because of |
---|
0:14:36 | pricing but because of the quality of audiences |
---|
0:14:41 | or of the annotations |
---|
0:14:44 | we also assume that commercial products will overall perform better |
---|
0:14:48 | after all they have probably hundreds of thousands of use feeding and with data |
---|
0:14:54 | and therefore we also found that and especially for |
---|
0:14:58 | entities and intends where there's not much training data |
---|
0:15:02 | they should be |
---|
0:15:03 | better because they so i'm a values as |
---|
0:15:08 | machine learning big and moody which comes with |
---|
0:15:12 | three hundred megabytes of initial data so you would assume if there's not much training |
---|
0:15:16 | data provided that |
---|
0:15:20 | lewis watson and on have |
---|
0:15:22 | lot more data is to start with |
---|
0:15:26 | and we also thought that the quality of the labels is inference by the domain |
---|
0:15:31 | so if one service is |
---|
0:15:33 | load on the corpus about public transport it doesn't necessarily mean that it also good |
---|
0:15:38 | on the other corpora |
---|
0:15:41 | so this is on a very high level the |
---|
0:15:44 | results of collaboration |
---|
0:15:47 | what you can see |
---|
0:15:48 | the blue but which is lewis |
---|
0:15:51 | so this is f-score |
---|
0:15:53 | across all label so intents and entities combined in the paper you can find |
---|
0:15:58 | broken down version of it |
---|
0:16:00 | but so for the guys from microsoft and regulations new was based on every domain |
---|
0:16:08 | actually what was surprising for us that a rather came second |
---|
0:16:13 | so across all the domains it has the second best performance |
---|
0:16:17 | i'm which was quite surprising for us |
---|
0:16:20 | if you look into detail you can find also quite some interesting reasons why on |
---|
0:16:26 | some the main some service is useful for example and what's new |
---|
0:16:30 | was very bad on compared to the others on the public transport data because it |
---|
0:16:37 | content the |
---|
0:16:39 | it ignored |
---|
0:16:40 | so use only example with from into |
---|
0:16:43 | and |
---|
0:16:44 | you can have the same words for from into obviously all the time |
---|
0:16:48 | and |
---|
0:16:50 | what's and was the only service that was not able to distinguish between from and |
---|
0:16:54 | to |
---|
0:16:54 | so |
---|
0:16:55 | if you are right from you need to the portly or from the put into |
---|
0:17:00 | really |
---|
0:17:01 | what's and always gave |
---|
0:17:03 | both words the label from and to |
---|
0:17:06 | so this is for example one reason |
---|
0:17:09 | why we see different |
---|
0:17:11 | performances on a different domains |
---|
0:17:16 | so what are the key findings of or evaluation |
---|
0:17:20 | well as i said news performs best in all the domains we tested |
---|
0:17:24 | rather second best |
---|
0:17:27 | an interesting point if you look at intents and entities with |
---|
0:17:33 | not much training data it's there's no difference so large that is not |
---|
0:17:39 | better or worse on them then the commercial services |
---|
0:17:42 | so i'm it seems that there is no big influence all |
---|
0:17:47 | of the initial training set |
---|
0:17:48 | that is already there |
---|
0:17:51 | and well you see that domain matters but the question as to how much so |
---|
0:17:57 | lose still performs best in and all domains |
---|
0:18:01 | because that's kind of the question |
---|
0:18:03 | i'm can we now say okay you should always use lewis |
---|
0:18:07 | and i would say no |
---|
0:18:09 | you still have to trying to with your domain with your data |
---|
0:18:14 | i'm to find out which serve as the best for you |
---|
0:18:18 | also services might change and you without noticing use so |
---|
0:18:25 | it is |
---|
0:18:25 | that's why to think it is very useful to automate just five line with the |
---|
0:18:31 | scripts we did and so on because then you can do it on all the |
---|
0:18:34 | services and even redo it constantly to find out |
---|
0:18:39 | which service is |
---|
0:18:40 | i'm best from you |
---|
0:18:42 | the best for you and one |
---|
0:18:44 | interesting question ridge rose from |
---|
0:18:47 | these findings |
---|
0:18:49 | is if the commercial services really |
---|
0:18:52 | benefit that much from user data because when we talk with industry partners |
---|
0:18:58 | that was one of the main concern still |
---|
0:19:00 | we pay the money and prepaid and in data |
---|
0:19:03 | and |
---|
0:19:05 | so |
---|
0:19:06 | i'm not really sure about this so at least for the user defined entity so |
---|
0:19:10 | if i define my entity is cold start |
---|
0:19:14 | and i label one thousand datasets |
---|
0:19:18 | how it is useful for |
---|
0:19:21 | any of these services so because |
---|
0:19:23 | it's my user defined label |
---|
0:19:26 | and the able to extract from it |
---|
0:19:30 | maybe that's the reason why we don't see what we expected and when it comes |
---|
0:19:35 | to and |
---|
0:19:38 | entity types and intense with |
---|
0:19:40 | this training data that they do not perform |
---|
0:19:44 | significantly better |
---|
0:19:46 | thank you |
---|
0:19:53 | okay so we have a model five minutes for questions |
---|
0:20:05 | and experiments were great so |
---|
0:20:08 | full disclosure or someone the greatest rather so i'm slightly biased |
---|
0:20:11 | and |
---|
0:20:12 | did you go and |
---|
0:20:14 | tweak any of the hyper parameters |
---|
0:20:16 | in the rows a rotational e |
---|
0:20:19 | the hyper parameters did you just use the default other or did you tweak them |
---|
0:20:23 | now we use i think you could maybe squeeze that's more performance |
---|
0:20:26 | sure |
---|
0:20:32 | things were very talk this question is more common is for some more is that |
---|
0:20:36 | it seems that there's almost like lacking a baseline which is like one of like |
---|
0:20:41 | maybe a phd student for a week spending time trying to get the accuracy of |
---|
0:20:45 | something because these services are really designed for people are technical i think that is |
---|
0:20:48 | that this guy comparisons is also i just like the c |
---|
0:20:53 | maybe like you know what happened you just led to take something like a slightly |
---|
0:20:56 | more under some like that and just see how well you can do without these |
---|
0:20:59 | like these services are helping you want because like i think that they're that they're |
---|
0:21:02 | about what you could say well like you like and you very well that using |
---|
0:21:05 | those where you actually if you want but i'm really get the accuracy a should |
---|
0:21:08 | get into the details |
---|
0:21:10 | results |
---|
0:21:22 | displeasure percent you gotta start |
---|
0:21:25 | i absolutely loved here i |
---|
0:21:31 | x i i'm very appreciative that some independent party's taking the time to evaluate independent |
---|
0:21:37 | some services like lewis possibly the others to have something like active learning they'll suggests |
---|
0:21:44 | utterances you might wanna go and label once you've collected some utterances |
---|
0:21:49 | if i just an evaluation correctly you haven't done that here you have a fixed |
---|
0:21:52 | training set |
---|
0:21:54 | i'm curious have you looked at that aspect of the services altering any comments |
---|
0:21:59 | so i mean there are a lot of other aspects which we didn't look at |
---|
0:22:02 | so this is one point i'm another point is also |
---|
0:22:05 | and that a lot of these services including we also have like |
---|
0:22:10 | bill in entity type already |
---|
0:22:12 | so you have fixed |
---|
0:22:15 | a pre-trained entity types for look at phone numbers and so on |
---|
0:22:19 | and i think that's also something you can benefit a lot from to use them |
---|
0:22:25 | and |
---|
0:22:26 | but so we looked at them we also so for the ammends we did also |
---|
0:22:33 | and the comparison the about |
---|
0:22:35 | the functionalities of some of them include |
---|
0:22:39 | already giving responses can responses and so on |
---|
0:22:43 | but so really we were just dataset and we only did this evaluation |
---|
0:22:49 | on these things and because again if you do it with the suggestions and you |
---|
0:22:54 | have to do it fruity wrap interface and this means that you have to label |
---|
0:22:58 | five hundred utterances on all systems |
---|
0:23:04 | that is something that might be interesting in the future but takes more time |
---|
0:23:15 | you have any other questions we have a about two minutes left |
---|
0:23:21 | okay i have a question so |
---|
0:23:24 | so you this is a chat session so could you it of rate on the |
---|
0:23:28 | relationship we this work and chapel |
---|
0:23:31 | well as i said so i think this is one of the parts |
---|
0:23:37 | or this can be one a useful are you want to double upset but and |
---|
0:23:41 | what we saw |
---|
0:23:43 | the sign typical work is so i mean you use all differences and |
---|
0:23:48 | and if you just evaluate your chat part of the whole the end |
---|
0:23:53 | then |
---|
0:23:54 | you might be influenced by these results without knowing it so |
---|
0:23:58 | your chequebook might perform |
---|
0:24:00 | better just because you change at your natural language understanding service so i think |
---|
0:24:06 | it is important |
---|
0:24:08 | to know about these things and to think about it and also if you do |
---|
0:24:12 | an operation of a check or as a whole system and to take into account |
---|
0:24:17 | these things and i also think from an industry perspective |
---|
0:24:22 | these services i one of the reasons why |
---|
0:24:24 | set ups became so popular in the last time |
---|
0:24:27 | because it is really easy so |
---|
0:24:30 | you have other services which are not as popular with a really offer you to |
---|
0:24:36 | click together a whole set but without programming is a single line of code |
---|
0:24:41 | and here you can at least without having any knowledge about language processing machine learning |
---|
0:24:46 | whatsoever |
---|
0:24:48 | and i think therefore it's especially |
---|
0:24:51 | important for type of this double document and inference lot |
---|
0:24:56 | w one |
---|
0:24:57 | also |
---|
0:25:00 | okay click one place |
---|
0:25:16 | okay |
---|
0:25:17 | so it's about task so the sum of the speaker |
---|