okay so
i talk is on discourse relation annotation
in very research for modeling discourse relations
i realise corpora annotated with such relations so for example we have
the rst dt corpus based on the rst the penn discourse treebank pdtb based on
at a time
and the and adjust corpus are based on sdrt their other corpora as well
so in other frameworks
i'm not covering all of them here
the penn discourse treebank which is the focus of my talk is a large-scale annotated
corpus
annotated over a one million word a wall street journal corpus
it's been used widely in the community for a lot of experimental work
as well as the framework we now apply to annotate
other text including of the champs and languages
however the current version of the corpus pdtb to does not
providing supposed to validation of its source text
there's work on going to address these gaps
in between speech
using the next version of the corpus pdtb three we do so
either a current work addressing the gaps
focuses on intra sentential relations which are relations of arguments in the same sentence
along with some modifications to existing annotations most of which involve modifications
resulting
in the sense hierarchy that out show later on
i talk focuses on the critical kind of gal in the
class of intra sentential relations which and relations with arguments and different sentences
so just a very quick overview of the annotation framework for those of you are
not familiar with it
the pdtb follows the lexically grounded but purely neutral approach to the representation of discourse
relations
which means that the annotation shallow
without committing to dependencies or structures beyond individuals relations
i discourse relations hold between two abstract object arguments that are named r one and
arg two using syntactic conventions
in the example that you see here and another example of the arg one is
in fact alex and r two is in bold
relation such a good either by explicit connectives in which case the
the get the relation type label of explicit
so in this example but is the explicit discourse connectives that relates this two sentences
in the relation of contrast
and that's because
these two attorneys offices men had from manhattan teachers e
they average of different number of criminal cases
and when relations are not triggered by explicit connectives and that of the by decency
between sentences and you're multiple things can happen first be made hidden for discourse relation
for which we can insert a connective
and such that the resulting text sounds reasonably readable and coherent
the relation type label for this is implicit
so in this example here
they're talking about the mac issues being hit but with investors and then the second
are two examples
talks about how this company just goes offer of the ventures was oversubscribed
the annotator inferred what we call the institution relation for most of the inserted the
connective for example it sounds
reasonably readable and coherent
in other cases we infer a discourse relation
but inserting a connective explicit relations used for just under c and that's because the
relation is been expressed in some other man and not for the connective
the relation type here is labeled all flex
so in this example below
and the we have a subject verb sequence that prompted expressing the relation of results
between the two sentences
basically the plant that sometimes been destroyed and people at this other company know what
it the same thing's gonna happen to them and then the r two sentence talks
about
what they are going to do as a result of that vary
to other of relation types can be got mean these that used in context and
jaw and nora and actually later on and when to talk about how they're done
some revisions to these two labels but basically and real are entity gazed relations which
means that
you cannot insurgent any explicit connective
and the sentences are related by virtue of some and a for reference
but just like to some entity across the two sentences
some of these relations actually do involve coherence relations
all the pdtb doesn't regard them as such
and they involve a background relation or of continuation relations so this example
the first one for injured is actually sort of a background where you have this
demonstrable these as the art for it link up for the r two sentence giving
some background about that are from nikos and their derivatives
and lastly we get an oral where not to mean intensity based relation holds
and this is due there are some changes is to how the sleepless finally drawn
with respect to arguments of a relations arguments can the annotated depends upon the type
of relation
so the r two of explicit relations is always
some part of the sentence or clause containing connective
but the art one can be anywhere in the private x
for all of the relation types like i said are what are not or only
annotated when adjacent
arguments can be extended to include additional clauses and sentences in all cases except nor
l
but there's a strong minimalistic constraints that wise inclusion of only the minimally necessary text
need to train i
in section finally the sense hierarchy the in the in the work that we did
be using the modified since hierarchy
in before pdtb three which was presented that law last year
one going to law the details you want to the more have some slides for
later on
but basically at the top level four classes from pdtb to a reading
temporal
comparison contingency an expansion of their been changes
at level two and level three
most of these changes involve a one-to-one mapping
from and to be to be due to its tree which we have implemented automatically
others are reviewed and annotated manually
in this what we came up to new senses that we got evidence for
one is have a for all four question answer pairs
and the other is introducing level three sentences for the asymmetric instantiations relation
okay so back to the focus of this talk
the as i said just in a critical gap in the class intersentential relation so
if you look at the current version of the corpus
you'll find that all sentences at containing an explicit connective
that's really that sentence to something in the prior text
have been annotated but almost all there are some gaps
and then
within paragraphs all the sentences without such a connective have been annotated
but
in the paragraph the first sentence of the paragraph to process the paragraph boundary
remain an annotated in the current corpus
so in this example here which shows the for six sentences of an article the
last one six as an explicit connective at a paragraph boundaries the mt lines indicate
paragraph boundaries
that has been annotated are one is not shown but the sense contrast indicates that
the annotation in the corpus
if you look at this third paragraph the internal implicit relations
between svms fourth conjunction and between that's for ns five conjunction is also been annotated
what's not annotated what are not annotator is to industry in and that's because they
are at the paragraph boundaries
there are more than twelve thousand such an annotated tokens in the card version of
the corpus their total for almost forty thousand tokens the corpus
these an annotated tokens constitute thirty percent of all intersentential discourse context
and eighty seven percent of all across paragraph intra sentential context the remaining thirty percent
being a transcriber explicit relations
so why worry about things for the forces automatic prediction there's been some work to
show that
they we can get improvements and very hard task of implicit relations stance classification
with the sequence model and also other work that incorporates features but no neighboring relations
but there's also the goal of understanding and global discourse structure so the a shallow
analysis the pdtb is also in service of the emergent discourse a global discourse structure
which you can get by combining the individual variations together but in order to do
that we need the complete sequence of relations over texas which is not their corpus
currently
so our goals are to identify a challenges and explore the feasibility of annotating
these course paragraph implicit relations on a large scale
and to produce a set of guidelines to annotate such relations reliably and also a
representative subset of pdtb text
annotated with complete sequences of intra sentential relations
and this can be done by merging the existing interest relations in the pdtb across
paragraph implies that are currently annotating
in our experiments we selected a fifty four texts from the pdtb corpus to cover
a range of sub genres and lines
they contain four hundred and forty paragraph initial sentences which we call
current hyper for sentence si pfs
and that are not already related to the prior text by an intersentential explicit connective
and the experiments were spread over three phases
that's just how things happened we didn't pan adapt
in phase one we study text to develop it can be understanding of the task
two expert annotators which is basically myself and kate forbes the second order
we work together to discuss annotate ten text
containing hundred and thirty tokens
a but we did not enforce the pdtb adjacency constraint for implicit because we wanted
to explore the full complexity of the task
each token was annotated for its relation type
sense and minimum spans
what refinement phase one was that fifty two percent of the paragraph initial sentences to
their prior are one arguments from
and adjacent unit involving a
prior paragraphs last sentence which is p l s for short
the remaining forty eight percent form the non-adjacent relation
this argument distribution is similar to that of course a graph express it's which are
also non-adjacent roughly half the chart
so whether this would be shown more generally something that we wanted to explore
with for their annotation in the next phase
we also found that working together we could isolate and agree upon b r one
of not only the adjacent relations but also the non-adjacent ones
so second hypothesis was to explore whether both adjacent and on adjacent relations could be
annotated reliably on a large scale
this led us to a big out
another hundred and three tokens over ten text
in which we did doubled like the annotation that was
that would give us
the results to
to understand whether
this would be advantageous large scale
and be annotated these tokens regardless of whether the arguments adjacent or non-adjacent
is the results from phase two
the first thing you know what is that the agreement on whether and it not
relation is adjacent or non-adjacent just that
binary decision
was reasonably high at seventy six percent
but when we looked into each of these groups are within the ones that on
which we agreed to be adjacent and the ones on which would be to be
non adjacent
and we found that generally exact match agreement in which the tokens for you need
for type sense an argument spans
with low for both
which shows the general difficulty of the task
of annotating and discourse paragraph in place it's
when you relax
at the argument matching
to relax them in a multi constraint so we did two kinds of relaxation on
the arg min max and one with sentence-level max with you disagreed at the sentence
level on some part of a span
we allowed that to quantize agreement
and also relaxing that even for the to allow for soup residential overlap
lead to further both of these like to further boost an agreement
but what what's interesting what's actually agreement what's much worse for
the non-adjacent a relation than for adjacent relations of the non-adjacent but the forty seven
percent and the adjacent relations where a texas sixty one percent
so that is to so that and also when we discuss the disagreements
we found that while it was almost possible to reach consensus
the time and effort that was required for and you to getting the non adjacent
relations was twice greater than for it you to get better adjacent relations
this led us to conclude that annotating the arg one of identifying be are gonna
on it uses was
in the with the current
state of the guidelines and the baby doing things is prohibitive so large scale annotations
therefore for now a decision was made to maintain the pdtb adjacent against change pitches
you know
you know we consistent with the existing constraints for adjacency
and focused on full annotation of only adjacent relations
but we also wanted to annotate the presence of a not reduce it implicit relation
which is not there right down the pdtb
with some kind of underspecified marking and we use the label of north somewhere else
for that
this led us to going back to what is that evaluated we consider
the way the labels of interest and we're are assigned in the current version of
the pdtb
so in the current assignments we get an enter a if there is an entity
based on here installation
holding between i one and r two and the discourse that expanded around some entity
you're not too
either by continuing the narrative around it
or supplied background about
but we also did not intra currently
is that does not hold if this was inactive well here installation of background a
continuation
but it's just some entity coreference between the two arguments
and this is the case even if r two forms an on and you also
upon the non-adjacent implicit relation
we didn't know we have if and rather or no discourse relation holds
but this is the case even if r two is also part of a non-adjacent
implicit relation
and we get an oral when the r two is not part of a discourse
at all
this happens the by lines like to do you have to alter sort of information
or if the start of a new article in a single was you general file
which can happen sometimes
so that our goal to encode the presence of non jews an implicit relations
the current assignments are problem
because this information is spread across vote labels so we
da
presence of an implicit non-adjacent relation
it's better cross enter eleanor l so we cannot tell
identify that
an ambiguous lee
the current assignments also confound the presence of
a semantic into debates coherence relation with the presence of milk reference and that's the
problem with in be here
so what we want to do is to unambiguously identified non-adjacent an implicit relations just
the presence of it
but which we use the label was them
this also allows us to get
that semantic
entity based coherence relations unambiguously
and also a unambiguously identify the two pieces of nowhere
and which are two is not related to anything in the project
and one is get this is an example of an underspecified non-adjacent implicit relation but
if i start to talk about it is usable the five minutes
so employing the decisions and enhancement made in phase two in phase three the remaining
two hundred and seven
have a cross paragraph tokens from thirty four text what double blind the annotated again
the be enhanced guidelines
and these of the results from face three and in order to do a consistency
comparison to see the differences of given the phase two results here as well
the first thing to know what is that the agreement on whether that relation was
reduced and are not adjacent that binary decision
was approximately the same which is good
the second thing that over the agreed tokens at the proportion of non adjacent relations
was also approximately the same as in phase two
and this supports the hypothesis about the high frequency of knowledge is an implicit and
therefore
what suggesting that the word pair annotating
overall agreement with the most relaxed metric what an argument spans with higher phase three
and sixty two percent and phase two features that forty three percent and this is
partly because of the back-off to underspecified annotation of non adjacent relations
but also
but also a we have a high agreement on the scent annotation of the adjacent
relations which is sixty nine percent and based you from sixty one percent in phase
two
other just partly due to were enhanced guidelines for annotating the same injured relations
the improvement on the sensors also better reflect argument agreement so there's an increase in
exact match to forty two percent from twenty four percent and phase two there is
less agreement due to the super sentential argument overlap
thirty percent reduction to thirteen percent and thirty percent and face to
there is more disagreement of the sentence level so we have fourteen percent disagreement
at sentence level from seven percent face to but these are not close the loop
they showed that people minus syntactic differences upper example one attitude included or excluded an
adjunct or an attribution trees with the other didn't
so that's not such a major semantic difference the final distributions over the all the
fifty four text you also but back to the phase one interface to detect and
reality that the enhanced guideline
additional in the talk table there
as you can see there is and the final glued data shows an equal proportion
of adjacent a non adjacent relations again supporting hypothesis about the distribution
the senses show that forty percent of the these course parameters it's have are elaboration
relations to start with detail
forty five percent on five senses with greater than five percent frequency
and the remaining fifteen percent sentence sentences with less and pipes and frequencies are spread
across nine different senses
in conclusion
adjacent implicit discourse relations across paragraphs can be annotated reliably
are gold standard sense distribution
together with the frequency of the semantic and rows suggest that was paragraph implicit relations
carry
very semantic content and
standard proportions
and are therefore what annotating
the current goal is to annotate approximately two hundred pdtb which is about seven hundred
tokens a two hundred text with these guidelines and which is estimated which we have
estimated required three minutes per token on average it's approximately thirty five minute thirty five
hours of annotation time parameter
the annotations will be distributed publicly by a get hard hopefully by the end of
this man
most of the text and the subset are also annotated in rst dt corpus so
it will allow for useful comparisons of relation structures across the two frameworks
a few juggles include a studying the distribution of sensors and patterns of sentences in
the text along the lines previous work
but now able for text relations sequences
we also want to develop guidelines of identifying
the arg ones of the more difficult non-adjacent implicit relations to ensure that it can
be done reliably and efficiently
and to this end we're looking at enhancements
to the pdtb annotation to better lower formant in visualization which is not possible currently
the tool
all these intra sentential relations and their arguments in the text
we also want to explore a two pass annotation methodology that would allow the more
difficult across paragraph
non adjacent relations to be annotated in the second pass
because the sequences of intra sentential relations from the first pass the adjacent once and
then trivial systematic structures to inform the second pass annotation
thank you
you very much having question
i'll start
so
it is not this and annotating a non-adjacent relation is a very difficult task for
so you see
i want to build a model trained on the state it takes relations with distinct
properties this model
have to
to be able to accurately predict
these non-adjacent
right so in these sequences models
kind of approaches
they try to do joint modeling aware there trying to predict entire sequences so the
the contextual information the neighboring relations are very would be a very important feature in
the production of these knowledge is an implicit relations
so although it's not the case all of the time
but in many of these cases
you get these non-adjacent relations where the intervening material is just
at
is a real operations all of what's annotated as the non-adjacent arg one
so if you can
if you can get that in the structure of the relations
labels correctly for this for that intervening material
then when you get to the next sentence that itself gives you
the information to sort of course that's to the next higher level
that's one of things
and then there's a very useful feature is enough around
so there's a lot of discourse the axis
that appears in these non-adjacent context
because when you want to refer to any binned event eventuality that's non-adjacent you end
up using these definite descriptions that there are data they take nature
thank you
okay let's think the speaker