0:00:17 | syllable final taken for all six l two thousand nineteen is about chris of it |
---|
0:00:24 | entitled discourse relation prediction rooms robustness convolutional networks |
---|
0:00:32 | glow a |
---|
0:00:34 | so i'm chris it i'm presenting |
---|
0:00:37 | we have of us start is not able to make it to the conference |
---|
0:00:40 | and |
---|
0:00:41 | i detecting the presence of a discourse relation between two text segments |
---|
0:00:45 | i is important for |
---|
0:00:48 | a lot of downstream applications |
---|
0:00:50 | i've including text level or document level tasks us to just |
---|
0:00:54 | that planning or summarization |
---|
0:00:56 | and one such resource that's labelled with discourse relations is the penn discourse treebank |
---|
0:01:03 | as also mentioned in the previous talk and this |
---|
0:01:06 | to find the shallow a discourse semantics between segments |
---|
0:01:10 | i'm like a framework such as rst which both the full |
---|
0:01:13 | a parse tree over a document |
---|
0:01:15 | and at the top level there are for different classes |
---|
0:01:21 | the comparison relation which includes contrast in concession expansion relation which might include examples |
---|
0:01:27 | a contingency which includes conditional and causal statements |
---|
0:01:31 | and then temporal relations |
---|
0:01:33 | and this can be expressed either explicitly using a discourse connective or implicitly |
---|
0:01:41 | sort of provide an example from the pdtb the first argument is a mister hand |
---|
0:01:45 | began selling non-core businesses such as oil and gas and chemicals the second argument and |
---|
0:01:50 | even sold one unit the made by a chequebook covers |
---|
0:01:53 | and in this case |
---|
0:01:55 | this is an implicit example but this would be the expansion relation with the implicit |
---|
0:02:00 | connective in fact |
---|
0:02:03 | so |
---|
0:02:04 | i'll discuss are the background in using word pair to predict a discourse relations and |
---|
0:02:09 | all talk about the related work and word pairs along with |
---|
0:02:13 | previous work in neural models i'll discuss our method of using a convolutional networks to |
---|
0:02:18 | model word pairs |
---|
0:02:19 | and also compared to the previous work in infrared some analysis of the performance of |
---|
0:02:24 | our model |
---|
0:02:26 | so |
---|
0:02:29 | earlier work by mark you an h a hobby |
---|
0:02:33 | looked at using a word pairs to identify discourse relations and they noted that are |
---|
0:02:38 | absent have very good semantic parsers |
---|
0:02:41 | one way to |
---|
0:02:44 | identify the relationship between text segments is to define word pairs using a very large |
---|
0:02:49 | a corpus |
---|
0:02:51 | so this is the comparison relation set |
---|
0:02:54 | and a word pair such as good in fails as this wouldn't be and antonym |
---|
0:02:59 | in |
---|
0:03:00 | resource like wordnet but we might be able to identify this from a |
---|
0:03:04 | a large unlabeled corpus |
---|
0:03:06 | and so they were averaged at discourse connectives to identify the these word pairs and |
---|
0:03:13 | and then build a model using those word pairs as features |
---|
0:03:17 | so the initial work using word pairs |
---|
0:03:20 | but that using |
---|
0:03:21 | are the cross product of words in either side of connective |
---|
0:03:24 | from some external resource |
---|
0:03:26 | and then using those identified word pairs as features for classifier |
---|
0:03:31 | some work on the pdtb found that the top word pairs in terms of a |
---|
0:03:35 | information gain are discourse connectives and functional words |
---|
0:03:39 | and this may be a product of the frequency of those words as well as |
---|
0:03:43 | the sparsity of workers |
---|
0:03:47 | so in order to handle the sparsity issue |
---|
0:03:50 | we ran and weak un |
---|
0:03:53 | build separate tf-idf features and so they identified word pairs across each connective in the |
---|
0:03:58 | gigaword corpus |
---|
0:04:00 | and then they identify these around a hundred different |
---|
0:04:03 | tf-idf vectors which gave like hundred dot product so they could use as features on |
---|
0:04:07 | the on the labeled data |
---|
0:04:10 | so recently neural models of had a lot of success and the pdtb |
---|
0:04:15 | either a recurrent models or cnns are more recently attention based models |
---|
0:04:19 | and one advantage of these models is that |
---|
0:04:21 | it easier to jointly model either the |
---|
0:04:25 | pdtb with other corpora either labeled to unlabeled data |
---|
0:04:31 | more recent work that using adversarial learning |
---|
0:04:34 | so given the |
---|
0:04:37 | given an implicit connective |
---|
0:04:39 | as well as the model without a connective |
---|
0:04:42 | and then |
---|
0:04:43 | a very recently had i and one |
---|
0:04:45 | i use the adjoint approach using the full |
---|
0:04:49 | paragraph context |
---|
0:04:50 | and jointly modeling explicit and implicit relations |
---|
0:04:53 | using a bidirectional lstm and the crf |
---|
0:04:57 | so the advantages of a the word pairs do that it provides an intuitive way |
---|
0:05:02 | of |
---|
0:05:02 | identifying features |
---|
0:05:04 | but |
---|
0:05:06 | it also tends to use noise the unlabeled external data and then the word pair |
---|
0:05:11 | representations are very sparse a since it's |
---|
0:05:14 | not possible to explicitly model every word pair |
---|
0:05:17 | no on the other hand the |
---|
0:05:18 | the neural models allow us to |
---|
0:05:22 | jointly model other data as well but the downside is that we have to identify |
---|
0:05:27 | a specific architecture |
---|
0:05:30 | and the there's models can be very complex as well |
---|
0:05:33 | so this |
---|
0:05:34 | they just a to research questions whether we can explicitly model these word pairs |
---|
0:05:39 | using neural models |
---|
0:05:40 | and then whether we can i transfer knowledge by joint learning with explicit |
---|
0:05:44 | labeled examples in the pdtb |
---|
0:05:48 | so |
---|
0:05:49 | right and an example so given a sentence i'm late for the meeting because the |
---|
0:05:54 | train was delayed |
---|
0:05:56 | we would split that in to argument one an argument to so where are you |
---|
0:06:01 | mean to start with the explicit discourse connective |
---|
0:06:05 | and then we would take the |
---|
0:06:07 | i the cartesian product of the word pairs on either side of the argument and |
---|
0:06:11 | so this gives as |
---|
0:06:13 | i does matrix of word pairs |
---|
0:06:15 | and then we take the same approach for implicit relations |
---|
0:06:21 | it's the same the c matrix minus the connective |
---|
0:06:26 | and so given this given this grid of word pairs |
---|
0:06:30 | we then take these filters |
---|
0:06:33 | of even link and we slide it over this grid |
---|
0:06:38 | so we initially we take a word and word pairs where we take a single |
---|
0:06:42 | word from either side of the argument |
---|
0:06:44 | and we splattered across to that we get word pair |
---|
0:06:47 | representations |
---|
0:06:50 | we can also do the same thing |
---|
0:06:52 | where larger filter size essentially represent |
---|
0:06:54 | where an n-gram pairs so in this case this is a filter of size eight |
---|
0:06:58 | and a represents |
---|
0:06:59 | a word and a four gram pair |
---|
0:07:02 | from the first argument and then the second argument |
---|
0:07:05 | so |
---|
0:07:05 | we can again take this folder |
---|
0:07:08 | and slighted across the box using us right of two |
---|
0:07:13 | and for the most forever getting word and n-gram pairs accepted row |
---|
0:07:16 | and column boundaries where we end up with multiple word pairs |
---|
0:07:22 | we again do the same thing |
---|
0:07:24 | seven four we were going across the rose we again |
---|
0:07:28 | take these convolutions misled them down the columns |
---|
0:07:31 | so we get arg two an arg one as well as arg one and arg |
---|
0:07:35 | two |
---|
0:07:38 | so this gives us our initial architecture where we have |
---|
0:07:42 | argument one an argument to |
---|
0:07:44 | which are passed into a cn and we do max going over that to extract |
---|
0:07:47 | the features |
---|
0:07:49 | and then we do the same thing argument to an argument one |
---|
0:07:52 | and we concatenate the |
---|
0:07:55 | there's resulting features and this gives us the representation for word pairs |
---|
0:08:00 | and the weights between these two shows the cnns are shared as well |
---|
0:08:06 | so similarly |
---|
0:08:08 | we |
---|
0:08:09 | we take a similar approach for the individual arguments |
---|
0:08:13 | and the reason for this is two fold and the first reason is that you |
---|
0:08:17 | can that's the way to determine the effect of the word pairs and said to |
---|
0:08:21 | evaluate if the word pairs are complementary to individual arguments |
---|
0:08:25 | and then the other motivation for including individual arguments |
---|
0:08:29 | is that many discourse relations |
---|
0:08:32 | contain lexical indicators |
---|
0:08:34 | absence context |
---|
0:08:36 | that are there often indicative of a discourse relation so |
---|
0:08:39 | an example of that are the |
---|
0:08:42 | implicit causal verbs that there might identify like a contingency relation such as maker provide |
---|
0:08:49 | so |
---|
0:08:50 | we use the same architecture here where instead of |
---|
0:08:54 | the cross product of the arguments we have the individual arguments |
---|
0:08:58 | which are passed into a cnn |
---|
0:09:01 | and that gives this |
---|
0:09:03 | i feature representation for the individual arguments which we can concatenate together |
---|
0:09:09 | to ten argument representation |
---|
0:09:12 | so we also want to be able to model of the interaction between the arguments |
---|
0:09:16 | and the way that we do that as with an additional gain layer |
---|
0:09:21 | so we concatenate argument one argument to and path that through a nonlinearity |
---|
0:09:28 | and then we determine how much to we the individual features |
---|
0:09:32 | so this gives us a |
---|
0:09:34 | a weighted representation of the interaction between the two arguments |
---|
0:09:41 | and then in order to model the interaction between the arguments in the word pairs |
---|
0:09:45 | we have an again with an identical architecture |
---|
0:09:49 | where we take |
---|
0:09:51 | the output of the first gay so the argument interaction |
---|
0:09:55 | and you combine that with the word pairs we can pass at their nonlinearity |
---|
0:09:59 | and we predict how much to weight is individual features |
---|
0:10:06 | and then finally this entire architecture |
---|
0:10:09 | is shared between the implicit and explicit relations |
---|
0:10:14 | except for the final classification where |
---|
0:10:17 | so |
---|
0:10:18 | the final classification where we just i have a separate |
---|
0:10:23 | multilayer perceptrons for |
---|
0:10:25 | it's was a relations and for implicit relations |
---|
0:10:27 | and we predict the discourse relation |
---|
0:10:30 | and then we do joint learning over the over the pdtb |
---|
0:10:34 | to predict the discourse relation |
---|
0:10:39 | so overall this gives us a features from argument one an argument to where we |
---|
0:10:43 | have word and word pairs we have word an n-gram pairs |
---|
0:10:46 | and then we have n-gram features |
---|
0:10:48 | and for the word pairs we use even size filters of two four six an |
---|
0:10:53 | eight hour for the n-grams we used for there's of size is two three five |
---|
0:10:57 | and then we use static word embeddings so we fix them in don't update their |
---|
0:11:02 | them during training |
---|
0:11:03 | we just initialise them with |
---|
0:11:05 | we're to back and we use |
---|
0:11:09 | word to back embeddings trained on the pdtb for the out-of-vocabulary words |
---|
0:11:13 | and then finally we concatenate those with one-hot a part of speech encodings |
---|
0:11:17 | and this is the initial input into the network |
---|
0:11:22 | so we evaluated on two different datasets |
---|
0:11:27 | pdtb two point now as well as the ica test datasets for kernel two thousand |
---|
0:11:33 | sixteen |
---|
0:11:34 | and we evaluate on three different tasks the one versus all task |
---|
0:11:39 | the four way classification task in the fifteen way classification |
---|
0:11:44 | and all of these experiments are |
---|
0:11:47 | available in the paper here for this talk all discuss the four way classification results |
---|
0:11:53 | and |
---|
0:11:54 | we use the standard splits |
---|
0:11:56 | so that we can compare to previous work |
---|
0:12:01 | so compared to recent work |
---|
0:12:03 | we obtain improved performance |
---|
0:12:05 | in order to compare to previous work some previous work |
---|
0:12:11 | use the max of a number of different runs some you use the average so |
---|
0:12:14 | we present both so that we can |
---|
0:12:16 | or provide a fair comparison |
---|
0:12:18 | we primarily compared to dine one since they also have a joint model over implicit |
---|
0:12:23 | and explicit relations |
---|
0:12:24 | and so we thought we find |
---|
0:12:27 | improve performance over their model on both |
---|
0:12:30 | on both types |
---|
0:12:33 | compared to convey to other recent work |
---|
0:12:36 | we also find that than the max |
---|
0:12:39 | f one in accuracy is better on implicit relations as well |
---|
0:12:44 | so in order to identify where the |
---|
0:12:48 | improve performance is coming from we conduct a number of ablation experiments |
---|
0:12:52 | so examining the full model |
---|
0:12:56 | with joint planning and compared to the |
---|
0:13:00 | implicit only case we find that most of the improved performance is coming from expansion |
---|
0:13:04 | so there's five point improvement on the expansion class |
---|
0:13:10 | from the joint learning in this improves the microphone and accuracy overall |
---|
0:13:15 | so the |
---|
0:13:17 | the explicit graph representations of expansion relations are helpful for implicit relations |
---|
0:13:25 | we conduct an additional experiment to |
---|
0:13:29 | to determine the effect of the word pairs |
---|
0:13:31 | and so we find that compared to using individual arguments |
---|
0:13:35 | on implicit relations we obtain |
---|
0:13:37 | i increasingly better performances we |
---|
0:13:39 | increase the number of word pairs that we use |
---|
0:13:43 | so |
---|
0:13:45 | in terms of implicit relations we obtain around a two point improvement over all on |
---|
0:13:49 | both f one in accuracy |
---|
0:13:51 | on the other hand with explicit relations we don't find improve performance |
---|
0:13:56 | and a part of that is probably due to the fact that the |
---|
0:14:01 | the connective itself is a very strong baseline and that's difficult to improve upon |
---|
0:14:05 | so even just learning a representation of the connective by itself is it is a |
---|
0:14:10 | pretty strong is a pretty strong model |
---|
0:14:13 | on the other hand we don't do worse to were still able to use this |
---|
0:14:15 | joint model for both |
---|
0:14:20 | if we examine the performance on individual classes in terms of where the word pairs |
---|
0:14:25 | help |
---|
0:14:27 | we find that |
---|
0:14:29 | using |
---|
0:14:30 | word pairs of a up to link for |
---|
0:14:33 | compared to individual arguments improves over |
---|
0:14:39 | improves every just the f one and accuracy on the on the full fourway task |
---|
0:14:43 | a but we find that it especially helps |
---|
0:14:46 | the comparison relations so we obtain a six and have point improvement in comparison relations |
---|
0:14:51 | and small improvements on expansion temporal |
---|
0:14:55 | i where is for contingency we do we do a bit worse |
---|
0:14:59 | and |
---|
0:15:02 | so this is worth investigating further in future work so we find that |
---|
0:15:08 | three of the for high level relations are held by word pairs but can continue |
---|
0:15:12 | c is not |
---|
0:15:15 | so some speculation about why this word pairs might help |
---|
0:15:19 | they expansion comparison that they tend to have words or phrases a similar opposite meaning |
---|
0:15:24 | and it's possible the word pair representations or capturing that |
---|
0:15:29 | whereas contingency |
---|
0:15:31 | since it does much better in the individual arguments case |
---|
0:15:35 | it might be because of these impose a causality verbs that are indicative of the |
---|
0:15:42 | the contingency relation as well |
---|
0:15:47 | so we also conducted a qualitative analysis |
---|
0:15:50 | so it's a look at some examples of where the word pair features there are |
---|
0:15:53 | helping |
---|
0:15:55 | so |
---|
0:15:57 | we conducted an experiment where we removed all the nonlinearities after the convolutional layers so |
---|
0:16:01 | removing the gates |
---|
0:16:02 | and |
---|
0:16:04 | we only have that the features extracted from the word pairs in the arguments concatenated |
---|
0:16:07 | together |
---|
0:16:09 | before |
---|
0:16:10 | making |
---|
0:16:11 | production of |
---|
0:16:13 | using a linear classifier |
---|
0:16:15 | and then the average of the three runs using these two different models |
---|
0:16:19 | it reduces the score by |
---|
0:16:22 | round a pointer so |
---|
0:16:23 | and so the shows both that the gates help |
---|
0:16:28 | with modeling discourse relations |
---|
0:16:29 | a but also that this is a reasonable approximation to what the model is learning |
---|
0:16:34 | so we then take and the arg max of these feature maps instead of instead |
---|
0:16:38 | of doing max pooling |
---|
0:16:40 | and then we map those counts back to the original a word pairs are n-gram |
---|
0:16:44 | features |
---|
0:16:45 | and we identify examples that are recovered by the full model and not by the |
---|
0:16:50 | implicit model |
---|
0:16:51 | only |
---|
0:16:53 | so this is the comparison example a align set it plans to use a microprocessor |
---|
0:16:58 | and it declined to discuss its plans |
---|
0:17:02 | so one of the top word pair features that the model learns in this case |
---|
0:17:05 | is plans and took the client to discuss its plans |
---|
0:17:09 | so here the model |
---|
0:17:12 | it's it seems like it's able to learn that these are the this is a |
---|
0:17:15 | word in a phrase with opposing meaning |
---|
0:17:20 | for we also provide an expansion example it allows most of an to camp to |
---|
0:17:24 | get around campaign spending limits |
---|
0:17:26 | you can spend a legal maximum for his campaign |
---|
0:17:29 | and |
---|
0:17:30 | again one of the top word pair features learned is spending limits and maximum so |
---|
0:17:35 | it seems like it's learning that these |
---|
0:17:37 | these are important features because they are |
---|
0:17:41 | because they have similar meaning |
---|
0:17:45 | so finally we conduct an experiment to compare "'em" our model the previous work in |
---|
0:17:51 | terms of running time and |
---|
0:17:53 | the number of parameters |
---|
0:17:56 | and define the compared to a bidirectional lstm crf model |
---|
0:18:01 | we have around half a number of parameters |
---|
0:18:04 | and then we also ran the model |
---|
0:18:06 | three times for four five epic each |
---|
0:18:10 | so but using pi towards an on the same gpu |
---|
0:18:14 | and we find that our model runs in around half the running time |
---|
0:18:17 | so |
---|
0:18:19 | so |
---|
0:18:20 | we're using a less complex modeling were able to obtain similar better performance |
---|
0:18:26 | so overall we find that word pairs are complementary to individual arguments |
---|
0:18:32 | both |
---|
0:18:33 | but overall |
---|
0:18:35 | and on |
---|
0:18:37 | three of the first |
---|
0:18:40 | three of the four top level classes |
---|
0:18:43 | we also find |
---|
0:18:45 | that |
---|
0:18:46 | joint learning improves the model |
---|
0:18:48 | indicating some share properties between the implicit and explicit |
---|
0:18:51 | it was a relations |
---|
0:18:52 | in particular for the |
---|
0:18:54 | expansion class |
---|
0:18:56 | and for future work we would like to evaluate the impact of contextual embedding such |
---|
0:19:03 | as per |
---|
0:19:04 | so instead of using |
---|
0:19:06 | using just word embeddings add to see if we can obtain improved performance |
---|
0:19:11 | but also to evaluate whether these properties transfer to other corpora as well and either |
---|
0:19:16 | external labeled datasets |
---|
0:19:18 | or unlabeled data sets across |
---|
0:19:21 | no cross explicit connectives |
---|
0:19:26 | so if there any questions |
---|
0:19:28 | feel free to |
---|
0:19:29 | to email us |
---|
0:19:30 | where sp right now |
---|
0:19:33 | and our code is available at the at the following like |
---|
0:19:44 | so you're remotes questions |
---|
0:19:50 | thanks for the talk and so you about the word there's but actually you showed |
---|
0:19:56 | work to n-gram combinations sell |
---|
0:20:01 | we with the end of the n-gram being |
---|
0:20:06 | a priori |
---|
0:20:07 | if you need right i mean |
---|
0:20:09 | within |
---|
0:20:11 | the limits of the longest sentence so why did you do that and did you |
---|
0:20:16 | try a you know with experimentation to which you meet the and you just write |
---|
0:20:24 | the word pairs the actual word pairs and what happened |
---|
0:20:30 | so we did try just word pairs |
---|
0:20:34 | and so we found that improve performance but then |
---|
0:20:38 | modeling like the word and the n-gram pairs was it |
---|
0:20:43 | was the better identified better features |
---|
0:20:48 | so |
---|
0:20:51 | i can see you |
---|
0:20:52 | so here |
---|
0:20:55 | so do so the w p one in this case is just the individual word |
---|
0:20:59 | pairs |
---|
0:21:01 | so |
---|
0:21:03 | the word pairs themselves improve |
---|
0:21:06 | in |
---|
0:21:08 | overall |
---|
0:21:09 | but |
---|
0:21:10 | not as much as when we include like the |
---|
0:21:12 | word an n-gram pair |
---|
0:21:16 | so it's in this case we limited it to for us so that was just |
---|
0:21:20 | an experimental determination like the and four we didn't obtain any |
---|
0:21:24 | improve performance |
---|
0:21:29 | flexible talk i had a question your last example |
---|
0:21:34 | i think about |
---|
0:21:35 | the this one right |
---|
0:21:37 | so if you say he will spend the legal maximum force comparing with the p |
---|
0:21:41 | trample |
---|
0:21:51 | i think it might think it might be both |
---|
0:21:53 | so you can have multiple taxes yes to the pdtb allows for multiple |
---|
0:21:58 | labels for a single and |
---|
0:21:59 | okay it seems to me from your talk and also for the previous talk the |
---|
0:22:03 | temporal relations were more difficult on the other ones is that |
---|
0:22:06 | that's correct and so why |
---|
0:22:11 | i think |
---|
0:22:11 | part of that part of the reason is that the temporal class in the pdtb |
---|
0:22:15 | is very small |
---|
0:22:18 | i |
---|
0:22:19 | i think temporal relations are hard in general i don't know like neural models are |
---|
0:22:23 | particularly getting representing dates and end times so that might be part of the reason |
---|
0:22:28 | but that's just |
---|
0:22:29 | speculation |
---|
0:22:34 | more questions |
---|
0:22:41 | there is a question |
---|
0:22:44 | your estimator is it also able to identify those the relation between the |
---|
0:22:51 | two arguments |
---|
0:22:54 | most meetings as you always assume there is either an explicit or implicit |
---|
0:22:59 | religious right so we just to deaf the four way task |
---|
0:23:03 | so assuming there is a discourse relation |
---|
0:23:13 | rather than that that's a single speaker again |
---|