syllable final taken for all six l two thousand nineteen is about chris of it
entitled discourse relation prediction rooms robustness convolutional networks
glow a
so i'm chris it i'm presenting
we have of us start is not able to make it to the conference
and
i detecting the presence of a discourse relation between two text segments
i is important for
a lot of downstream applications
i've including text level or document level tasks us to just
that planning or summarization
and one such resource that's labelled with discourse relations is the penn discourse treebank
as also mentioned in the previous talk and this
to find the shallow a discourse semantics between segments
i'm like a framework such as rst which both the full
a parse tree over a document
and at the top level there are for different classes
the comparison relation which includes contrast in concession expansion relation which might include examples
a contingency which includes conditional and causal statements
and then temporal relations
and this can be expressed either explicitly using a discourse connective or implicitly
sort of provide an example from the pdtb the first argument is a mister hand
began selling non-core businesses such as oil and gas and chemicals the second argument and
even sold one unit the made by a chequebook covers
and in this case
this is an implicit example but this would be the expansion relation with the implicit
connective in fact
so
i'll discuss are the background in using word pair to predict a discourse relations and
all talk about the related work and word pairs along with
previous work in neural models i'll discuss our method of using a convolutional networks to
model word pairs
and also compared to the previous work in infrared some analysis of the performance of
our model
so
earlier work by mark you an h a hobby
looked at using a word pairs to identify discourse relations and they noted that are
absent have very good semantic parsers
one way to
identify the relationship between text segments is to define word pairs using a very large
a corpus
so this is the comparison relation set
and a word pair such as good in fails as this wouldn't be and antonym
in
resource like wordnet but we might be able to identify this from a
a large unlabeled corpus
and so they were averaged at discourse connectives to identify the these word pairs and
and then build a model using those word pairs as features
so the initial work using word pairs
but that using
are the cross product of words in either side of connective
from some external resource
and then using those identified word pairs as features for classifier
some work on the pdtb found that the top word pairs in terms of a
information gain are discourse connectives and functional words
and this may be a product of the frequency of those words as well as
the sparsity of workers
so in order to handle the sparsity issue
we ran and weak un
build separate tf-idf features and so they identified word pairs across each connective in the
gigaword corpus
and then they identify these around a hundred different
tf-idf vectors which gave like hundred dot product so they could use as features on
the on the labeled data
so recently neural models of had a lot of success and the pdtb
either a recurrent models or cnns are more recently attention based models
and one advantage of these models is that
it easier to jointly model either the
pdtb with other corpora either labeled to unlabeled data
more recent work that using adversarial learning
so given the
given an implicit connective
as well as the model without a connective
and then
a very recently had i and one
i use the adjoint approach using the full
paragraph context
and jointly modeling explicit and implicit relations
using a bidirectional lstm and the crf
so the advantages of a the word pairs do that it provides an intuitive way
of
identifying features
but
it also tends to use noise the unlabeled external data and then the word pair
representations are very sparse a since it's
not possible to explicitly model every word pair
no on the other hand the
the neural models allow us to
jointly model other data as well but the downside is that we have to identify
a specific architecture
and the there's models can be very complex as well
so this
they just a to research questions whether we can explicitly model these word pairs
using neural models
and then whether we can i transfer knowledge by joint learning with explicit
labeled examples in the pdtb
so
right and an example so given a sentence i'm late for the meeting because the
train was delayed
we would split that in to argument one an argument to so where are you
mean to start with the explicit discourse connective
and then we would take the
i the cartesian product of the word pairs on either side of the argument and
so this gives as
i does matrix of word pairs
and then we take the same approach for implicit relations
it's the same the c matrix minus the connective
and so given this given this grid of word pairs
we then take these filters
of even link and we slide it over this grid
so we initially we take a word and word pairs where we take a single
word from either side of the argument
and we splattered across to that we get word pair
representations
we can also do the same thing
where larger filter size essentially represent
where an n-gram pairs so in this case this is a filter of size eight
and a represents
a word and a four gram pair
from the first argument and then the second argument
so
we can again take this folder
and slighted across the box using us right of two
and for the most forever getting word and n-gram pairs accepted row
and column boundaries where we end up with multiple word pairs
we again do the same thing
seven four we were going across the rose we again
take these convolutions misled them down the columns
so we get arg two an arg one as well as arg one and arg
two
so this gives us our initial architecture where we have
argument one an argument to
which are passed into a cn and we do max going over that to extract
the features
and then we do the same thing argument to an argument one
and we concatenate the
there's resulting features and this gives us the representation for word pairs
and the weights between these two shows the cnns are shared as well
so similarly
we
we take a similar approach for the individual arguments
and the reason for this is two fold and the first reason is that you
can that's the way to determine the effect of the word pairs and said to
evaluate if the word pairs are complementary to individual arguments
and then the other motivation for including individual arguments
is that many discourse relations
contain lexical indicators
absence context
that are there often indicative of a discourse relation so
an example of that are the
implicit causal verbs that there might identify like a contingency relation such as maker provide
so
we use the same architecture here where instead of
the cross product of the arguments we have the individual arguments
which are passed into a cnn
and that gives this
i feature representation for the individual arguments which we can concatenate together
to ten argument representation
so we also want to be able to model of the interaction between the arguments
and the way that we do that as with an additional gain layer
so we concatenate argument one argument to and path that through a nonlinearity
and then we determine how much to we the individual features
so this gives us a
a weighted representation of the interaction between the two arguments
and then in order to model the interaction between the arguments in the word pairs
we have an again with an identical architecture
where we take
the output of the first gay so the argument interaction
and you combine that with the word pairs we can pass at their nonlinearity
and we predict how much to weight is individual features
and then finally this entire architecture
is shared between the implicit and explicit relations
except for the final classification where
so
the final classification where we just i have a separate
multilayer perceptrons for
it's was a relations and for implicit relations
and we predict the discourse relation
and then we do joint learning over the over the pdtb
to predict the discourse relation
so overall this gives us a features from argument one an argument to where we
have word and word pairs we have word an n-gram pairs
and then we have n-gram features
and for the word pairs we use even size filters of two four six an
eight hour for the n-grams we used for there's of size is two three five
and then we use static word embeddings so we fix them in don't update their
them during training
we just initialise them with
we're to back and we use
word to back embeddings trained on the pdtb for the out-of-vocabulary words
and then finally we concatenate those with one-hot a part of speech encodings
and this is the initial input into the network
so we evaluated on two different datasets
pdtb two point now as well as the ica test datasets for kernel two thousand
sixteen
and we evaluate on three different tasks the one versus all task
the four way classification task in the fifteen way classification
and all of these experiments are
available in the paper here for this talk all discuss the four way classification results
and
we use the standard splits
so that we can compare to previous work
so compared to recent work
we obtain improved performance
in order to compare to previous work some previous work
use the max of a number of different runs some you use the average so
we present both so that we can
or provide a fair comparison
we primarily compared to dine one since they also have a joint model over implicit
and explicit relations
and so we thought we find
improve performance over their model on both
on both types
compared to convey to other recent work
we also find that than the max
f one in accuracy is better on implicit relations as well
so in order to identify where the
improve performance is coming from we conduct a number of ablation experiments
so examining the full model
with joint planning and compared to the
implicit only case we find that most of the improved performance is coming from expansion
so there's five point improvement on the expansion class
from the joint learning in this improves the microphone and accuracy overall
so the
the explicit graph representations of expansion relations are helpful for implicit relations
we conduct an additional experiment to
to determine the effect of the word pairs
and so we find that compared to using individual arguments
on implicit relations we obtain
i increasingly better performances we
increase the number of word pairs that we use
so
in terms of implicit relations we obtain around a two point improvement over all on
both f one in accuracy
on the other hand with explicit relations we don't find improve performance
and a part of that is probably due to the fact that the
the connective itself is a very strong baseline and that's difficult to improve upon
so even just learning a representation of the connective by itself is it is a
pretty strong is a pretty strong model
on the other hand we don't do worse to were still able to use this
joint model for both
if we examine the performance on individual classes in terms of where the word pairs
help
we find that
using
word pairs of a up to link for
compared to individual arguments improves over
improves every just the f one and accuracy on the on the full fourway task
a but we find that it especially helps
the comparison relations so we obtain a six and have point improvement in comparison relations
and small improvements on expansion temporal
i where is for contingency we do we do a bit worse
and
so this is worth investigating further in future work so we find that
three of the for high level relations are held by word pairs but can continue
c is not
so some speculation about why this word pairs might help
they expansion comparison that they tend to have words or phrases a similar opposite meaning
and it's possible the word pair representations or capturing that
whereas contingency
since it does much better in the individual arguments case
it might be because of these impose a causality verbs that are indicative of the
the contingency relation as well
so we also conducted a qualitative analysis
so it's a look at some examples of where the word pair features there are
helping
so
we conducted an experiment where we removed all the nonlinearities after the convolutional layers so
removing the gates
and
we only have that the features extracted from the word pairs in the arguments concatenated
together
before
making
production of
using a linear classifier
and then the average of the three runs using these two different models
it reduces the score by
round a pointer so
and so the shows both that the gates help
with modeling discourse relations
a but also that this is a reasonable approximation to what the model is learning
so we then take and the arg max of these feature maps instead of instead
of doing max pooling
and then we map those counts back to the original a word pairs are n-gram
features
and we identify examples that are recovered by the full model and not by the
implicit model
only
so this is the comparison example a align set it plans to use a microprocessor
and it declined to discuss its plans
so one of the top word pair features that the model learns in this case
is plans and took the client to discuss its plans
so here the model
it's it seems like it's able to learn that these are the this is a
word in a phrase with opposing meaning
for we also provide an expansion example it allows most of an to camp to
get around campaign spending limits
you can spend a legal maximum for his campaign
and
again one of the top word pair features learned is spending limits and maximum so
it seems like it's learning that these
these are important features because they are
because they have similar meaning
so finally we conduct an experiment to compare "'em" our model the previous work in
terms of running time and
the number of parameters
and define the compared to a bidirectional lstm crf model
we have around half a number of parameters
and then we also ran the model
three times for four five epic each
so but using pi towards an on the same gpu
and we find that our model runs in around half the running time
so
so
we're using a less complex modeling were able to obtain similar better performance
so overall we find that word pairs are complementary to individual arguments
both
but overall
and on
three of the first
three of the four top level classes
we also find
that
joint learning improves the model
indicating some share properties between the implicit and explicit
it was a relations
in particular for the
expansion class
and for future work we would like to evaluate the impact of contextual embedding such
as per
so instead of using
using just word embeddings add to see if we can obtain improved performance
but also to evaluate whether these properties transfer to other corpora as well and either
external labeled datasets
or unlabeled data sets across
no cross explicit connectives
so if there any questions
feel free to
to email us
where sp right now
and our code is available at the at the following like
so you're remotes questions
thanks for the talk and so you about the word there's but actually you showed
work to n-gram combinations sell
we with the end of the n-gram being
a priori
if you need right i mean
within
the limits of the longest sentence so why did you do that and did you
try a you know with experimentation to which you meet the and you just write
the word pairs the actual word pairs and what happened
so we did try just word pairs
and so we found that improve performance but then
modeling like the word and the n-gram pairs was it
was the better identified better features
so
i can see you
so here
so do so the w p one in this case is just the individual word
pairs
so
the word pairs themselves improve
in
overall
but
not as much as when we include like the
word an n-gram pair
so it's in this case we limited it to for us so that was just
an experimental determination like the and four we didn't obtain any
improve performance
flexible talk i had a question your last example
i think about
the this one right
so if you say he will spend the legal maximum force comparing with the p
trample
i think it might think it might be both
so you can have multiple taxes yes to the pdtb allows for multiple
labels for a single and
okay it seems to me from your talk and also for the previous talk the
temporal relations were more difficult on the other ones is that
that's correct and so why
i think
part of that part of the reason is that the temporal class in the pdtb
is very small
i
i think temporal relations are hard in general i don't know like neural models are
particularly getting representing dates and end times so that might be part of the reason
but that's just
speculation
more questions
there is a question
your estimator is it also able to identify those the relation between the
two arguments
most meetings as you always assume there is either an explicit or implicit
religious right so we just to deaf the four way task
so assuming there is a discourse relation
rather than that that's a single speaker again