Speech Transcript - Discourse Relation Prediction: Revisiting Word Pairs with Convolutional Networks

syllable final taken for all six l two thousand nineteen is about chris of it

entitled discourse relation prediction rooms robustness convolutional networks

glow a

so i'm chris it i'm presenting

we have of us start is not able to make it to the conference

and

i detecting the presence of a discourse relation between two text segments

i is important for

a lot of downstream applications

i've including text level or document level tasks us to just

that planning or summarization

and one such resource that's labelled with discourse relations is the penn discourse treebank

as also mentioned in the previous talk and this

to find the shallow a discourse semantics between segments

i'm like a framework such as rst which both the full

a parse tree over a document

and at the top level there are for different classes

the comparison relation which includes contrast in concession expansion relation which might include examples

a contingency which includes conditional and causal statements

and then temporal relations

and this can be expressed either explicitly using a discourse connective or implicitly

sort of provide an example from the pdtb the first argument is a mister hand

began selling non-core businesses such as oil and gas and chemicals the second argument and

even sold one unit the made by a chequebook covers

and in this case

this is an implicit example but this would be the expansion relation with the implicit

connective in fact

i'll discuss are the background in using word pair to predict a discourse relations and

all talk about the related work and word pairs along with

previous work in neural models i'll discuss our method of using a convolutional networks to

model word pairs

and also compared to the previous work in infrared some analysis of the performance of

our model

earlier work by mark you an h a hobby

looked at using a word pairs to identify discourse relations and they noted that are

absent have very good semantic parsers

one way to

identify the relationship between text segments is to define word pairs using a very large

a corpus

so this is the comparison relation set

and a word pair such as good in fails as this wouldn't be and antonym

resource like wordnet but we might be able to identify this from a

a large unlabeled corpus

and so they were averaged at discourse connectives to identify the these word pairs and

and then build a model using those word pairs as features

so the initial work using word pairs

but that using

are the cross product of words in either side of connective

from some external resource

and then using those identified word pairs as features for classifier

some work on the pdtb found that the top word pairs in terms of a

information gain are discourse connectives and functional words

and this may be a product of the frequency of those words as well as

the sparsity of workers

so in order to handle the sparsity issue

we ran and weak un

build separate tf-idf features and so they identified word pairs across each connective in the

gigaword corpus

and then they identify these around a hundred different

tf-idf vectors which gave like hundred dot product so they could use as features on

the on the labeled data

so recently neural models of had a lot of success and the pdtb

either a recurrent models or cnns are more recently attention based models

and one advantage of these models is that

it easier to jointly model either the

pdtb with other corpora either labeled to unlabeled data

more recent work that using adversarial learning

so given the

given an implicit connective

as well as the model without a connective

and then

a very recently had i and one

i use the adjoint approach using the full

paragraph context

and jointly modeling explicit and implicit relations

using a bidirectional lstm and the crf

so the advantages of a the word pairs do that it provides an intuitive way

identifying features

but

it also tends to use noise the unlabeled external data and then the word pair

representations are very sparse a since it's

not possible to explicitly model every word pair

no on the other hand the

the neural models allow us to

jointly model other data as well but the downside is that we have to identify

a specific architecture

and the there's models can be very complex as well

so this

they just a to research questions whether we can explicitly model these word pairs

using neural models

and then whether we can i transfer knowledge by joint learning with explicit

labeled examples in the pdtb

right and an example so given a sentence i'm late for the meeting because the

train was delayed

we would split that in to argument one an argument to so where are you

mean to start with the explicit discourse connective

and then we would take the

i the cartesian product of the word pairs on either side of the argument and

so this gives as

i does matrix of word pairs

and then we take the same approach for implicit relations

it's the same the c matrix minus the connective

and so given this given this grid of word pairs

we then take these filters

of even link and we slide it over this grid

so we initially we take a word and word pairs where we take a single

word from either side of the argument

and we splattered across to that we get word pair

representations

we can also do the same thing

where larger filter size essentially represent

where an n-gram pairs so in this case this is a filter of size eight

and a represents

a word and a four gram pair

from the first argument and then the second argument

we can again take this folder

and slighted across the box using us right of two

and for the most forever getting word and n-gram pairs accepted row

and column boundaries where we end up with multiple word pairs

we again do the same thing

seven four we were going across the rose we again

take these convolutions misled them down the columns

so we get arg two an arg one as well as arg one and arg

two

so this gives us our initial architecture where we have

argument one an argument to

which are passed into a cn and we do max going over that to extract

the features

and then we do the same thing argument to an argument one

and we concatenate the

there's resulting features and this gives us the representation for word pairs

and the weights between these two shows the cnns are shared as well

so similarly

we take a similar approach for the individual arguments

and the reason for this is two fold and the first reason is that you

can that's the way to determine the effect of the word pairs and said to

evaluate if the word pairs are complementary to individual arguments

and then the other motivation for including individual arguments

is that many discourse relations

contain lexical indicators

absence context

that are there often indicative of a discourse relation so

an example of that are the

implicit causal verbs that there might identify like a contingency relation such as maker provide

we use the same architecture here where instead of

the cross product of the arguments we have the individual arguments

which are passed into a cnn

and that gives this

i feature representation for the individual arguments which we can concatenate together

to ten argument representation

so we also want to be able to model of the interaction between the arguments

and the way that we do that as with an additional gain layer

so we concatenate argument one argument to and path that through a nonlinearity

and then we determine how much to we the individual features

so this gives us a

a weighted representation of the interaction between the two arguments

and then in order to model the interaction between the arguments in the word pairs

we have an again with an identical architecture

where we take

the output of the first gay so the argument interaction

and you combine that with the word pairs we can pass at their nonlinearity

and we predict how much to weight is individual features

and then finally this entire architecture

is shared between the implicit and explicit relations

except for the final classification where

the final classification where we just i have a separate

multilayer perceptrons for

it's was a relations and for implicit relations

and we predict the discourse relation

and then we do joint learning over the over the pdtb

to predict the discourse relation

so overall this gives us a features from argument one an argument to where we

have word and word pairs we have word an n-gram pairs

and then we have n-gram features

and for the word pairs we use even size filters of two four six an

eight hour for the n-grams we used for there's of size is two three five

and then we use static word embeddings so we fix them in don't update their

them during training

we just initialise them with

we're to back and we use

word to back embeddings trained on the pdtb for the out-of-vocabulary words

and then finally we concatenate those with one-hot a part of speech encodings

and this is the initial input into the network

so we evaluated on two different datasets

pdtb two point now as well as the ica test datasets for kernel two thousand

sixteen

and we evaluate on three different tasks the one versus all task

the four way classification task in the fifteen way classification

and all of these experiments are

available in the paper here for this talk all discuss the four way classification results

and

we use the standard splits

so that we can compare to previous work

so compared to recent work

we obtain improved performance

in order to compare to previous work some previous work

use the max of a number of different runs some you use the average so

we present both so that we can

or provide a fair comparison

we primarily compared to dine one since they also have a joint model over implicit

and explicit relations

and so we thought we find

improve performance over their model on both

on both types

compared to convey to other recent work

we also find that than the max

f one in accuracy is better on implicit relations as well

so in order to identify where the

improve performance is coming from we conduct a number of ablation experiments

so examining the full model

with joint planning and compared to the

implicit only case we find that most of the improved performance is coming from expansion

so there's five point improvement on the expansion class

from the joint learning in this improves the microphone and accuracy overall

so the

the explicit graph representations of expansion relations are helpful for implicit relations

we conduct an additional experiment to

to determine the effect of the word pairs

and so we find that compared to using individual arguments

on implicit relations we obtain

i increasingly better performances we

increase the number of word pairs that we use

in terms of implicit relations we obtain around a two point improvement over all on

both f one in accuracy

on the other hand with explicit relations we don't find improve performance

and a part of that is probably due to the fact that the

the connective itself is a very strong baseline and that's difficult to improve upon

so even just learning a representation of the connective by itself is it is a

pretty strong is a pretty strong model

on the other hand we don't do worse to were still able to use this

joint model for both

if we examine the performance on individual classes in terms of where the word pairs

help

we find that

using

word pairs of a up to link for

compared to individual arguments improves over

improves every just the f one and accuracy on the on the full fourway task

a but we find that it especially helps

the comparison relations so we obtain a six and have point improvement in comparison relations

and small improvements on expansion temporal

i where is for contingency we do we do a bit worse

and

so this is worth investigating further in future work so we find that

three of the for high level relations are held by word pairs but can continue

c is not

so some speculation about why this word pairs might help

they expansion comparison that they tend to have words or phrases a similar opposite meaning

and it's possible the word pair representations or capturing that

whereas contingency

since it does much better in the individual arguments case

it might be because of these impose a causality verbs that are indicative of the

the contingency relation as well

so we also conducted a qualitative analysis

so it's a look at some examples of where the word pair features there are

helping

we conducted an experiment where we removed all the nonlinearities after the convolutional layers so

removing the gates

and

we only have that the features extracted from the word pairs in the arguments concatenated

together

before

making

production of

using a linear classifier

and then the average of the three runs using these two different models

it reduces the score by

round a pointer so

and so the shows both that the gates help

with modeling discourse relations

a but also that this is a reasonable approximation to what the model is learning

so we then take and the arg max of these feature maps instead of instead

of doing max pooling

and then we map those counts back to the original a word pairs are n-gram

features

and we identify examples that are recovered by the full model and not by the

implicit model

only

so this is the comparison example a align set it plans to use a microprocessor

and it declined to discuss its plans

so one of the top word pair features that the model learns in this case

is plans and took the client to discuss its plans

so here the model

it's it seems like it's able to learn that these are the this is a

word in a phrase with opposing meaning

for we also provide an expansion example it allows most of an to camp to

get around campaign spending limits

you can spend a legal maximum for his campaign

and

again one of the top word pair features learned is spending limits and maximum so

it seems like it's learning that these

these are important features because they are

because they have similar meaning

so finally we conduct an experiment to compare "'em" our model the previous work in

terms of running time and

the number of parameters

and define the compared to a bidirectional lstm crf model

we have around half a number of parameters

and then we also ran the model

three times for four five epic each

so but using pi towards an on the same gpu

and we find that our model runs in around half the running time

we're using a less complex modeling were able to obtain similar better performance

so overall we find that word pairs are complementary to individual arguments

both

but overall

and on

three of the first

three of the four top level classes

we also find

that

joint learning improves the model

indicating some share properties between the implicit and explicit

it was a relations

in particular for the

expansion class

and for future work we would like to evaluate the impact of contextual embedding such

as per

so instead of using

using just word embeddings add to see if we can obtain improved performance

but also to evaluate whether these properties transfer to other corpora as well and either

external labeled datasets

or unlabeled data sets across

no cross explicit connectives

so if there any questions

feel free to

to email us

where sp right now

and our code is available at the at the following like

so you're remotes questions

thanks for the talk and so you about the word there's but actually you showed

work to n-gram combinations sell

we with the end of the n-gram being

a priori

if you need right i mean

within

the limits of the longest sentence so why did you do that and did you

try a you know with experimentation to which you meet the and you just write

the word pairs the actual word pairs and what happened

so we did try just word pairs

and so we found that improve performance but then

modeling like the word and the n-gram pairs was it

was the better identified better features

i can see you

so here

so do so the w p one in this case is just the individual word

pairs

the word pairs themselves improve

overall

but

not as much as when we include like the

word an n-gram pair

so it's in this case we limited it to for us so that was just

an experimental determination like the and four we didn't obtain any

improve performance

flexible talk i had a question your last example

i think about

the this one right

so if you say he will spend the legal maximum force comparing with the p

trample

i think it might think it might be both

so you can have multiple taxes yes to the pdtb allows for multiple

labels for a single and

okay it seems to me from your talk and also for the previous talk the

temporal relations were more difficult on the other ones is that

that's correct and so why

i think

part of that part of the reason is that the temporal class in the pdtb

is very small

i think temporal relations are hard in general i don't know like neural models are

particularly getting representing dates and end times so that might be part of the reason

but that's just

speculation

Discourse Relation Prediction: Revisiting Word Pairs with Convolutional Networks

Oral Session 7: Discourse

Siddharth Varia, Christopher Hidey and Tuhin Chakrabarty