0:00:16okay so
0:00:18i talk is on discourse relation annotation
0:00:21in very research for modeling discourse relations
0:00:24i realise corpora annotated with such relations so for example we have
0:00:29the rst dt corpus based on the rst the penn discourse treebank pdtb based on
0:00:35at a time
0:00:36and the and adjust corpus are based on sdrt their other corpora as well
0:00:41so in other frameworks
0:00:42i'm not covering all of them here
0:00:45the penn discourse treebank which is the focus of my talk is a large-scale annotated
0:00:49corpus
0:00:51annotated over a one million word a wall street journal corpus
0:00:55it's been used widely in the community for a lot of experimental work
0:00:59as well as the framework we now apply to annotate
0:01:04other text including of the champs and languages
0:01:08however the current version of the corpus pdtb to does not
0:01:12providing supposed to validation of its source text
0:01:15there's work on going to address these gaps
0:01:18in between speech
0:01:19using the next version of the corpus pdtb three we do so
0:01:24either a current work addressing the gaps
0:01:28focuses on intra sentential relations which are relations of arguments in the same sentence
0:01:34along with some modifications to existing annotations most of which involve modifications
0:01:40resulting
0:01:42in the sense hierarchy that out show later on
0:01:45i talk focuses on the critical kind of gal in the
0:01:48class of intra sentential relations which and relations with arguments and different sentences
0:01:53so just a very quick overview of the annotation framework for those of you are
0:01:57not familiar with it
0:01:59the pdtb follows the lexically grounded but purely neutral approach to the representation of discourse
0:02:04relations
0:02:04which means that the annotation shallow
0:02:07without committing to dependencies or structures beyond individuals relations
0:02:11i discourse relations hold between two abstract object arguments that are named r one and
0:02:16arg two using syntactic conventions
0:02:19in the example that you see here and another example of the arg one is
0:02:22in fact alex and r two is in bold
0:02:25relation such a good either by explicit connectives in which case the
0:02:29the get the relation type label of explicit
0:02:32so in this example but is the explicit discourse connectives that relates this two sentences
0:02:37in the relation of contrast
0:02:38and that's because
0:02:40these two attorneys offices men had from manhattan teachers e
0:02:44they average of different number of criminal cases
0:02:52and when relations are not triggered by explicit connectives and that of the by decency
0:02:56between sentences and you're multiple things can happen first be made hidden for discourse relation
0:03:02for which we can insert a connective
0:03:05and such that the resulting text sounds reasonably readable and coherent
0:03:09the relation type label for this is implicit
0:03:12so in this example here
0:03:15they're talking about the mac issues being hit but with investors and then the second
0:03:19are two examples
0:03:20talks about how this company just goes offer of the ventures was oversubscribed
0:03:25the annotator inferred what we call the institution relation for most of the inserted the
0:03:30connective for example it sounds
0:03:32reasonably readable and coherent
0:03:34in other cases we infer a discourse relation
0:03:37but inserting a connective explicit relations used for just under c and that's because the
0:03:42relation is been expressed in some other man and not for the connective
0:03:46the relation type here is labeled all flex
0:03:48so in this example below
0:03:51and the we have a subject verb sequence that prompted expressing the relation of results
0:03:56between the two sentences
0:03:59basically the plant that sometimes been destroyed and people at this other company know what
0:04:03it the same thing's gonna happen to them and then the r two sentence talks
0:04:07about
0:04:09what they are going to do as a result of that vary
0:04:13to other of relation types can be got mean these that used in context and
0:04:18jaw and nora and actually later on and when to talk about how they're done
0:04:22some revisions to these two labels but basically and real are entity gazed relations which
0:04:27means that
0:04:28you cannot insurgent any explicit connective
0:04:30and the sentences are related by virtue of some and a for reference
0:04:35but just like to some entity across the two sentences
0:04:37some of these relations actually do involve coherence relations
0:04:41all the pdtb doesn't regard them as such
0:04:43and they involve a background relation or of continuation relations so this example
0:04:48the first one for injured is actually sort of a background where you have this
0:04:52demonstrable these as the art for it link up for the r two sentence giving
0:04:57some background about that are from nikos and their derivatives
0:05:01and lastly we get an oral where not to mean intensity based relation holds
0:05:06and this is due there are some changes is to how the sleepless finally drawn
0:05:14with respect to arguments of a relations arguments can the annotated depends upon the type
0:05:18of relation
0:05:19so the r two of explicit relations is always
0:05:21some part of the sentence or clause containing connective
0:05:24but the art one can be anywhere in the private x
0:05:27for all of the relation types like i said are what are not or only
0:05:30annotated when adjacent
0:05:33arguments can be extended to include additional clauses and sentences in all cases except nor
0:05:38l
0:05:38but there's a strong minimalistic constraints that wise inclusion of only the minimally necessary text
0:05:44need to train i
0:05:48in section finally the sense hierarchy the in the in the work that we did
0:05:53be using the modified since hierarchy
0:05:56in before pdtb three which was presented that law last year
0:06:00one going to law the details you want to the more have some slides for
0:06:04later on
0:06:05but basically at the top level four classes from pdtb to a reading
0:06:10temporal
0:06:12comparison contingency an expansion of their been changes
0:06:15at level two and level three
0:06:16most of these changes involve a one-to-one mapping
0:06:20from and to be to be due to its tree which we have implemented automatically
0:06:25others are reviewed and annotated manually
0:06:28in this what we came up to new senses that we got evidence for
0:06:32one is have a for all four question answer pairs
0:06:35and the other is introducing level three sentences for the asymmetric instantiations relation
0:06:42okay so back to the focus of this talk
0:06:45the as i said just in a critical gap in the class intersentential relation so
0:06:50if you look at the current version of the corpus
0:06:53you'll find that all sentences at containing an explicit connective
0:06:57that's really that sentence to something in the prior text
0:07:01have been annotated but almost all there are some gaps
0:07:05and then
0:07:06within paragraphs all the sentences without such a connective have been annotated
0:07:13but
0:07:14in the paragraph the first sentence of the paragraph to process the paragraph boundary
0:07:20remain an annotated in the current corpus
0:07:22so in this example here which shows the for six sentences of an article the
0:07:26last one six as an explicit connective at a paragraph boundaries the mt lines indicate
0:07:31paragraph boundaries
0:07:32that has been annotated are one is not shown but the sense contrast indicates that
0:07:37the annotation in the corpus
0:07:39if you look at this third paragraph the internal implicit relations
0:07:45between svms fourth conjunction and between that's for ns five conjunction is also been annotated
0:07:50what's not annotated what are not annotator is to industry in and that's because they
0:07:55are at the paragraph boundaries
0:07:57there are more than twelve thousand such an annotated tokens in the card version of
0:08:00the corpus their total for almost forty thousand tokens the corpus
0:08:05these an annotated tokens constitute thirty percent of all intersentential discourse context
0:08:11and eighty seven percent of all across paragraph intra sentential context the remaining thirty percent
0:08:15being a transcriber explicit relations
0:08:20so why worry about things for the forces automatic prediction there's been some work to
0:08:24show that
0:08:25they we can get improvements and very hard task of implicit relations stance classification
0:08:29with the sequence model and also other work that incorporates features but no neighboring relations
0:08:35but there's also the goal of understanding and global discourse structure so the a shallow
0:08:40analysis the pdtb is also in service of the emergent discourse a global discourse structure
0:08:46which you can get by combining the individual variations together but in order to do
0:08:50that we need the complete sequence of relations over texas which is not their corpus
0:08:55currently
0:08:57so our goals are to identify a challenges and explore the feasibility of annotating
0:09:01these course paragraph implicit relations on a large scale
0:09:05and to produce a set of guidelines to annotate such relations reliably and also a
0:09:10representative subset of pdtb text
0:09:13annotated with complete sequences of intra sentential relations
0:09:17and this can be done by merging the existing interest relations in the pdtb across
0:09:22paragraph implies that are currently annotating
0:09:26in our experiments we selected a fifty four texts from the pdtb corpus to cover
0:09:30a range of sub genres and lines
0:09:33they contain four hundred and forty paragraph initial sentences which we call
0:09:39current hyper for sentence si pfs
0:09:41and that are not already related to the prior text by an intersentential explicit connective
0:09:46and the experiments were spread over three phases
0:09:50that's just how things happened we didn't pan adapt
0:09:54in phase one we study text to develop it can be understanding of the task
0:10:01two expert annotators which is basically myself and kate forbes the second order
0:10:05we work together to discuss annotate ten text
0:10:10containing hundred and thirty tokens
0:10:11a but we did not enforce the pdtb adjacency constraint for implicit because we wanted
0:10:16to explore the full complexity of the task
0:10:19each token was annotated for its relation type
0:10:22sense and minimum spans
0:10:26what refinement phase one was that fifty two percent of the paragraph initial sentences to
0:10:31their prior are one arguments from
0:10:33and adjacent unit involving a
0:10:35prior paragraphs last sentence which is p l s for short
0:10:40the remaining forty eight percent form the non-adjacent relation
0:10:44this argument distribution is similar to that of course a graph express it's which are
0:10:49also non-adjacent roughly half the chart
0:10:51so whether this would be shown more generally something that we wanted to explore
0:10:57with for their annotation in the next phase
0:10:59we also found that working together we could isolate and agree upon b r one
0:11:04of not only the adjacent relations but also the non-adjacent ones
0:11:07so second hypothesis was to explore whether both adjacent and on adjacent relations could be
0:11:14annotated reliably on a large scale
0:11:18this led us to a big out
0:11:21another hundred and three tokens over ten text
0:11:24in which we did doubled like the annotation that was
0:11:29that would give us
0:11:30the results to
0:11:32to understand whether
0:11:34this would be advantageous large scale
0:11:37and be annotated these tokens regardless of whether the arguments adjacent or non-adjacent
0:11:43is the results from phase two
0:11:46the first thing you know what is that the agreement on whether and it not
0:11:52relation is adjacent or non-adjacent just that
0:11:55binary decision
0:11:56was reasonably high at seventy six percent
0:12:00but when we looked into each of these groups are within the ones that on
0:12:05which we agreed to be adjacent and the ones on which would be to be
0:12:08non adjacent
0:12:09and we found that generally exact match agreement in which the tokens for you need
0:12:14for type sense an argument spans
0:12:16with low for both
0:12:18which shows the general difficulty of the task
0:12:20of annotating and discourse paragraph in place it's
0:12:25when you relax
0:12:26at the argument matching
0:12:30to relax them in a multi constraint so we did two kinds of relaxation on
0:12:35the arg min max and one with sentence-level max with you disagreed at the sentence
0:12:38level on some part of a span
0:12:41we allowed that to quantize agreement
0:12:44and also relaxing that even for the to allow for soup residential overlap
0:12:49lead to further both of these like to further boost an agreement
0:12:56but what what's interesting what's actually agreement what's much worse for
0:13:02the non-adjacent a relation than for adjacent relations of the non-adjacent but the forty seven
0:13:07percent and the adjacent relations where a texas sixty one percent
0:13:11so that is to so that and also when we discuss the disagreements
0:13:17we found that while it was almost possible to reach consensus
0:13:21the time and effort that was required for and you to getting the non adjacent
0:13:26relations was twice greater than for it you to get better adjacent relations
0:13:31this led us to conclude that annotating the arg one of identifying be are gonna
0:13:35on it uses was
0:13:37in the with the current
0:13:39state of the guidelines and the baby doing things is prohibitive so large scale annotations
0:13:44therefore for now a decision was made to maintain the pdtb adjacent against change pitches
0:13:49you know
0:13:50you know we consistent with the existing constraints for adjacency
0:13:54and focused on full annotation of only adjacent relations
0:13:57but we also wanted to annotate the presence of a not reduce it implicit relation
0:14:03which is not there right down the pdtb
0:14:05with some kind of underspecified marking and we use the label of north somewhere else
0:14:10for that
0:14:12this led us to going back to what is that evaluated we consider
0:14:17the way the labels of interest and we're are assigned in the current version of
0:14:22the pdtb
0:14:23so in the current assignments we get an enter a if there is an entity
0:14:26based on here installation
0:14:28holding between i one and r two and the discourse that expanded around some entity
0:14:32you're not too
0:14:33either by continuing the narrative around it
0:14:36or supplied background about
0:14:38but we also did not intra currently
0:14:40is that does not hold if this was inactive well here installation of background a
0:14:45continuation
0:14:46but it's just some entity coreference between the two arguments
0:14:49and this is the case even if r two forms an on and you also
0:14:54upon the non-adjacent implicit relation
0:14:57we didn't know we have if and rather or no discourse relation holds
0:15:01but this is the case even if r two is also part of a non-adjacent
0:15:05implicit relation
0:15:06and we get an oral when the r two is not part of a discourse
0:15:10at all
0:15:12this happens the by lines like to do you have to alter sort of information
0:15:16or if the start of a new article in a single was you general file
0:15:22which can happen sometimes
0:15:27so that our goal to encode the presence of non jews an implicit relations
0:15:31the current assignments are problem
0:15:33because this information is spread across vote labels so we
0:15:37da
0:15:38presence of an implicit non-adjacent relation
0:15:40it's better cross enter eleanor l so we cannot tell
0:15:45identify that
0:15:46an ambiguous lee
0:15:48the current assignments also confound the presence of
0:15:50a semantic into debates coherence relation with the presence of milk reference and that's the
0:15:55problem with in be here
0:15:57so what we want to do is to unambiguously identified non-adjacent an implicit relations just
0:16:03the presence of it
0:16:04but which we use the label was them
0:16:07this also allows us to get
0:16:09that semantic
0:16:10entity based coherence relations unambiguously
0:16:13and also a unambiguously identify the two pieces of nowhere
0:16:19and which are two is not related to anything in the project
0:16:23and one is get this is an example of an underspecified non-adjacent implicit relation but
0:16:27if i start to talk about it is usable the five minutes
0:16:32so employing the decisions and enhancement made in phase two in phase three the remaining
0:16:38two hundred and seven
0:16:40have a cross paragraph tokens from thirty four text what double blind the annotated again
0:16:46the be enhanced guidelines
0:16:48and these of the results from face three and in order to do a consistency
0:16:52comparison to see the differences of given the phase two results here as well
0:16:56the first thing to know what is that the agreement on whether that relation was
0:17:00reduced and are not adjacent that binary decision
0:17:02was approximately the same which is good
0:17:06the second thing that over the agreed tokens at the proportion of non adjacent relations
0:17:11was also approximately the same as in phase two
0:17:14and this supports the hypothesis about the high frequency of knowledge is an implicit and
0:17:18therefore
0:17:19what suggesting that the word pair annotating
0:17:22overall agreement with the most relaxed metric what an argument spans with higher phase three
0:17:28and sixty two percent and phase two features that forty three percent and this is
0:17:32partly because of the back-off to underspecified annotation of non adjacent relations
0:17:36but also
0:17:39but also a we have a high agreement on the scent annotation of the adjacent
0:17:44relations which is sixty nine percent and based you from sixty one percent in phase
0:17:48two
0:17:48other just partly due to were enhanced guidelines for annotating the same injured relations
0:17:55the improvement on the sensors also better reflect argument agreement so there's an increase in
0:17:59exact match to forty two percent from twenty four percent and phase two there is
0:18:03less agreement due to the super sentential argument overlap
0:18:08thirty percent reduction to thirteen percent and thirty percent and face to
0:18:12there is more disagreement of the sentence level so we have fourteen percent disagreement
0:18:17at sentence level from seven percent face to but these are not close the loop
0:18:21they showed that people minus syntactic differences upper example one attitude included or excluded an
0:18:26adjunct or an attribution trees with the other didn't
0:18:31so that's not such a major semantic difference the final distributions over the all the
0:18:36fifty four text you also but back to the phase one interface to detect and
0:18:39reality that the enhanced guideline
0:18:43additional in the talk table there
0:18:44as you can see there is and the final glued data shows an equal proportion
0:18:49of adjacent a non adjacent relations again supporting hypothesis about the distribution
0:18:54the senses show that forty percent of the these course parameters it's have are elaboration
0:19:00relations to start with detail
0:19:02forty five percent on five senses with greater than five percent frequency
0:19:06and the remaining fifteen percent sentence sentences with less and pipes and frequencies are spread
0:19:11across nine different senses
0:19:14in conclusion
0:19:16adjacent implicit discourse relations across paragraphs can be annotated reliably
0:19:21are gold standard sense distribution
0:19:23together with the frequency of the semantic and rows suggest that was paragraph implicit relations
0:19:29carry
0:19:30very semantic content and
0:19:31standard proportions
0:19:33and are therefore what annotating
0:19:35the current goal is to annotate approximately two hundred pdtb which is about seven hundred
0:19:40tokens a two hundred text with these guidelines and which is estimated which we have
0:19:45estimated required three minutes per token on average it's approximately thirty five minute thirty five
0:19:50hours of annotation time parameter
0:19:53the annotations will be distributed publicly by a get hard hopefully by the end of
0:19:58this man
0:20:00most of the text and the subset are also annotated in rst dt corpus so
0:20:04it will allow for useful comparisons of relation structures across the two frameworks
0:20:10a few juggles include a studying the distribution of sensors and patterns of sentences in
0:20:15the text along the lines previous work
0:20:18but now able for text relations sequences
0:20:22we also want to develop guidelines of identifying
0:20:25the arg ones of the more difficult non-adjacent implicit relations to ensure that it can
0:20:29be done reliably and efficiently
0:20:32and to this end we're looking at enhancements
0:20:35to the pdtb annotation to better lower formant in visualization which is not possible currently
0:20:41the tool
0:20:42all these intra sentential relations and their arguments in the text
0:20:46we also want to explore a two pass annotation methodology that would allow the more
0:20:50difficult across paragraph
0:20:51non adjacent relations to be annotated in the second pass
0:20:55because the sequences of intra sentential relations from the first pass the adjacent once and
0:21:00then trivial systematic structures to inform the second pass annotation
0:21:05thank you
0:21:12you very much having question
0:21:20i'll start
0:21:21so
0:21:24it is not this and annotating a non-adjacent relation is a very difficult task for
0:21:33so you see
0:21:36i want to build a model trained on the state it takes relations with distinct
0:21:43properties this model
0:21:46have to
0:21:48to be able to accurately predict
0:21:52these non-adjacent
0:21:55right so in these sequences models
0:21:59kind of approaches
0:22:01they try to do joint modeling aware there trying to predict entire sequences so the
0:22:10the contextual information the neighboring relations are very would be a very important feature in
0:22:15the production of these knowledge is an implicit relations
0:22:18so although it's not the case all of the time
0:22:21but in many of these cases
0:22:24you get these non-adjacent relations where the intervening material is just
0:22:28at
0:22:29is a real operations all of what's annotated as the non-adjacent arg one
0:22:34so if you can
0:22:36if you can get that in the structure of the relations
0:22:40labels correctly for this for that intervening material
0:22:44then when you get to the next sentence that itself gives you
0:22:48the information to sort of course that's to the next higher level
0:22:53that's one of things
0:22:55and then there's a very useful feature is enough around
0:22:58so there's a lot of discourse the axis
0:23:01that appears in these non-adjacent context
0:23:05because when you want to refer to any binned event eventuality that's non-adjacent you end
0:23:10up using these definite descriptions that there are data they take nature
0:23:19thank you
0:23:27okay let's think the speaker