0:00:35okay
0:00:37i'm marilyn walker and the word that i'm presenting his phd work about my denominator
0:00:44media who can be here
0:00:48i'm gonna talk about summarizing biologic arguments and social media and the first thing i
0:00:53guess they wanna say about this you know negotiation session is it's not clear how
0:00:57much negotiation it's gary actually carried on
0:01:00in these argumentative dialogues
0:01:04although they definitely thing to be to go see at something
0:01:08so
0:01:09the current data that are in summaries of argumentative dialogues really human state-of-the-art
0:01:17so websites have curators to manually curate argument summaries and so lots of different debate
0:01:25websites
0:01:26well have curated argument so i debate has you know these kinds of points for
0:01:30points against and
0:01:32and pro conduct work has the top ten pro and con arguments so on these
0:01:37websites they kind of summarise like what are the repeated arguments that people make about
0:01:42a particular
0:01:43social issue these examples are this one example here about gay marriage and another one
0:01:49about gun control
0:01:52and
0:01:53when you go when you look at the natural human dialogues where people discussing the
0:01:58same kinds of issues it's really striking how difficult it would be to actually produce
0:02:03a summary
0:02:04of these dialogues
0:02:06i'll give you minutes a kind you know i know you're gonna read it anyway
0:02:10i
0:02:11i give you
0:02:13and then it to me to a
0:02:15you know people are very emotional there not necessarily logical they make fun of each
0:02:21other they're sarcastic
0:02:22there's all kinds of stuff going on in these dialogues
0:02:27they do kind of fit in with your notion of what you would
0:02:31happens a summary of an argument especially when you compare these to the curated arguments
0:02:35that are produced by
0:02:37by professionals and so the question the first question that we had was obviously it
0:02:41would be great if you could actually summarize the whole bunch of conversations out there
0:02:45and social media like what is it that the person on this tree
0:02:49is thing about gay marriage and what is it that the person on the straight
0:02:52is saying about gun control or portion or even lucien or
0:02:56any kind of issue that's be constantly debated only social media website
0:03:01and i would claim that you're interested not just in like what are the kind
0:03:05of arguments that lawyer a constitutional expert would actually make but you're actually interested to
0:03:10know
0:03:11what is that people are saying you know you
0:03:14it's everybody can vote these days right you have a whether or not you're in
0:03:19the top one percent of the population that's actually educated
0:03:22how to argue logically
0:03:24so it's be good thing to actually no you know what it is that people
0:03:28are saying what kinds of arguments that they're making
0:03:30when you look at the easiest thing you know
0:03:34what should the summary contain what kind of information should we pull out of these
0:03:38conversations in order to make a summary
0:03:40and you know the common convergence don't agree and so you know to you seems
0:03:46like you would at least need to represent both sides of the arguments of that's
0:03:50would be like that may be a first criteria
0:03:53but you know you want to represent the opposing stances
0:03:57then do you want to include some kind of emotional information in it do you
0:04:01want to include socio-emotional relationship like that the second speaker
0:04:05making fun of the first speaker that they're being sarcastic do you should that kind
0:04:10of information going to
0:04:12a summary or
0:04:13you know do you want to take like the philosophical argumentation logical argumentation you and
0:04:18say well i'm gonna i'm gonna consider all this to be just the flame or
0:04:22troll or
0:04:23whatever i'm not really interested in any part of this argument that doesn't actually fit
0:04:28in with the logical you of argumentation
0:04:32and there has been previous work on dialogue summarisation but there hasn't been anywhere on
0:04:39summarizing argumentative dialogues automatically
0:04:42and so all the other high dialogue summarisation that are that's out there some of
0:04:47which i think spin but done by some of people in this room
0:04:51they all have very different properties
0:04:53and they're not merely as
0:04:55i don't really as these argumentative dialogues are
0:05:01so our goal is to automatically produced summaries of argumentative dialogues then
0:05:06we're taking and extractive summarization perspective at this point although would clearly be nice if
0:05:12we could do abstract of summarisation
0:05:14and so one that step that we're trying to do in this paper is we're
0:05:18trying to identify and extract what are the most important
0:05:22arguments
0:05:23on each side
0:05:25for an issue
0:05:27and are
0:05:29our initial starting point is that as i pointed out of previous slides actually really
0:05:33difficult to figure out what information these summary should contain and so we start from
0:05:38the standpoint that summarization is something that any native speaker knows how to do they
0:05:43don't have to have any training
0:05:44and so are initial concept this is that we're gonna
0:05:49collect summaries of that humans produce a piece conversations and see what people pick out
0:05:55and then we're gonna take these summaries that we collected and we're gonna apply the
0:05:59here amid method which is been used like in duck summarization task
0:06:03and we're gonna assume that the arguments that up here in model summaries that those
0:06:08of the most important argument so we're
0:06:10a kind of applying a standard summary extractive summarization and evaluation approach
0:06:15to these argumentative dialogues
0:06:18so we have gold standard training data
0:06:21we have
0:06:22collected five human summaries
0:06:24for me for each of about fifty dialogues on the topics the gay marriage gun
0:06:28control and abortion
0:06:30and that the
0:06:34a lot of this is described in more detail in our paper in a whole
0:06:38ami this paper in a twenty fifteen
0:06:41the but the summaries look like and what their properties are then we trained undergraduate
0:06:46linguists to use the pyramid method to identify important arguments in the dialogue
0:06:51so they construct spearman's for each set of five summaries
0:06:55and the idea that the repeated elements of the summaries and upon the higher here's
0:06:59of the peer minute are gonna give you an example in a minute some cases
0:07:02this is all probably
0:07:03go to you are so that will be clear
0:07:05after next slide
0:07:07then so then we have we have this human dialogues we have five summaries for
0:07:12each dialogue
0:07:14and then we have these purim it's that are constructed on top of each of
0:07:18those summaries look you know what
0:07:19elements get repeated
0:07:21then we still have a problem where we know which of the important concepts in
0:07:25the dialogue because those of the once it appeared in the model summaries
0:07:28we have to map it was actually original dialogues if we want to develop an
0:07:32extractive summarizer we want to be able to operate on the original dialogue texts and
0:07:37not that intermediate summary representation which we collected right
0:07:41so that's the third step of
0:07:43getting this mapping back and then once we have that making
0:07:47characterize our problem is a binary problem or ranking problem of identifying the most important
0:07:54utterances in the dialogues that we want to go into the extractive summary
0:08:00this is what kind of samples summaries look like this is from a gay marriage
0:08:05dialogue
0:08:06you know so there
0:08:07the these summarizers they're really good quality and the ones for gay marriage are currently
0:08:12available
0:08:13on our website at not a thought as so we just for gay marriage the
0:08:17new ones that we collected better talked about in this paper about abortion
0:08:22and then ensure we will be releasing
0:08:24soon
0:08:25but if you want to see what they look like just for gay marriage these
0:08:28were released a few years ago with our
0:08:31previous paper
0:08:32so
0:08:33this is what the data looks like so we have the summaries for these different
0:08:38fifty different about fifty different conversations for each topic
0:08:42and let them
0:08:45human does when they make the pyramid label
0:08:48is the kind of read through all the summaries they decide what are the important
0:08:52concepts kind of distinct from the words that are actually used by the summarizers
0:08:58and they make their own human label so they come up with the human label
0:09:01which is the paraphrase
0:09:03no one has been able to prove that gun owners are safer than on gun
0:09:06owners
0:09:07and then they identify for each summary how this summarizer phrase that particular argument that
0:09:13particular concept
0:09:14and i think i'm a concept in more than one of the
0:09:18summaries up to five because we have five summaries
0:09:21then that means that that's very important concept so that
0:09:25represented in this tier right
0:09:26so the arguments
0:09:28that multiple summarizers picked out
0:09:31and put in their summaries and
0:09:34having more contributors right in this human label and they end up being ranked is
0:09:39more important argument
0:09:44okay so
0:09:45so that we're on step three where we now have these
0:09:49we have these are summary contributors which again as i said they're removed from the
0:09:54language of the original dialogues
0:09:56and we have these human labels
0:09:58and what we want to do is to figure out in the original dialogue
0:10:02what i utterance is actually correspond to these things that ended up really highly ranked
0:10:06in the in the peer in it
0:10:08and
0:10:09where only collected this data like two or three years ago we well we're gonna
0:10:13be able to do this automatically once we had this space
0:10:17and after multiple different attempts we decided that we could impact of it automatically
0:10:22because the language of the summarizers and the language of the
0:10:26of the human labels from the pure images
0:10:28two different
0:10:29from the original language in the original dialogues so we did speaker
0:10:34and their mechanical turk tasks
0:10:37something try actually we didn't do it we didn't right on mechanical turk we couldn't
0:10:41get mechanical turkers to do this task reliably of map back from the summary labels
0:10:46to the original dialogues
0:10:48so i forgot this we added that we recruited to graduate linguists into undergraduate linguists
0:10:53to actually do this mapping forced in order to get good quality data
0:10:58so we have we presented them with the original conversations and thus have the labels
0:11:07other produce the highest tier labels
0:11:09and we ask for each utterance of the conversation
0:11:12to pick one or more of the labels that correspond to the content of that
0:11:17conversation and again where only interested
0:11:20where only interested in the
0:11:22utterances that have a score of three or higher that are considered most important by
0:11:27the original summarizers
0:11:29and we get pretty good reliability on this
0:11:32once we started using our own internal train people we could get
0:11:37turkers
0:11:38to do this reliably
0:11:40so
0:11:41so that i
0:11:43three
0:11:44we have the fifty dialogues for each you know size that we had about a
0:11:47fifty for each one
0:11:49so effective dialogue twenty fifty summaries
0:11:51five for each dialogue how we pull out the important sentences and the not important
0:11:56sentences for each dialogue and we frame is that the as a binary classification task
0:12:01again we could rate have framed as the ranking task
0:12:05and you just use the peer label but
0:12:08we decided to just frame it is binary classification
0:12:12so we group the labels liked here we compute
0:12:14compute the average tear label and then we define any sense with an average are
0:12:19scored very high risk been an important
0:12:21so we believe that we provided a well motivated and theoretically grounded definition of what
0:12:27is an important argument by going through this whole process
0:12:30and now we have this binary classification problem we're trying to do
0:12:34so we have a three different off-the-shelf summarizers
0:12:39that we apply this to see how standard summary algorithms work so we use some
0:12:43basic
0:12:44which is a algorithm by think open about inventing we use this kl divergence summarization
0:12:51which is from heidi in front of any these are all available off-the-shelf and we
0:12:56used lex rank this of these are all different kind of the algorithm selects rank
0:13:00is the one that was
0:13:02most successful at the most recent document understanding
0:13:07competition
0:13:08and all of these rank utterances instead of classify them
0:13:11so what we did with we apply them to the dialogues
0:13:15we get the ranking and then the
0:13:19we take e
0:13:20number of utterances that are in the task so we kind of say let's do
0:13:25that come up with the ranking at the point where
0:13:27the length of the extractive summaries the same
0:13:30as what we expect
0:13:32we have a bunch different models
0:13:35we tried support vector machines with the linear kernel for packet learned we use cross
0:13:40validation
0:13:41for tuning the parameters and then we also tried a combination
0:13:45bidirectional lstm with the convolutional neural net with the biased
0:13:50and we split our data into training and test
0:13:53for features we have hundreds
0:13:55two different kinds of word embeddings
0:13:57google work to back and low and then we have some other things that we
0:14:02think that are lean more linguistically motivated that we expected to
0:14:06problem possibly help
0:14:08so we have readability scores would expect that utterances that are more readable would be
0:14:12better and that be more important
0:14:15we thought sentiment might be important we thought the position sentence position in the summary
0:14:20might be important like the first sentences was summary might be more important of the
0:14:24first sentences in the dialogue
0:14:26and then we have linguistic intra in query word count which gives us a lot
0:14:30of lexical categories with three different representations of the context once one in terms of
0:14:35blue one in terms of the dialogue act classification of the previous utterances the previous
0:14:41two utterances in the dialogue
0:14:43and then we ran stanford co graph
0:14:45which
0:14:46i expected to not produce anything so that's a little foreshadowing it works it actually
0:14:52helps amazingly
0:14:55and these are our results
0:14:57so let's rank was our very best baseline some not tell you what the other
0:15:01baselines were
0:15:03and so for lex rank we getting a weighted f-score on the test that
0:15:08in my upper fifties
0:15:10when we
0:15:12have just
0:15:14as the are very best model svm using features so as just with word embeddings
0:15:20is not just well but if we put all these linguistic features and we see
0:15:24that
0:15:24for both gun control and for the abortion topics that all the shock reference engine
0:15:31applied to these very things very noisy dialogues actually improves performance of having representation of
0:15:36the context
0:15:37and we get all you know we get better results for gun control
0:15:41that we do for gay marriage and abortion and we had that result repeatedly over
0:15:46and over and over it and we think that the reason this is the same
0:15:50arguments get repeated and gun control
0:15:53and it's not it's not have created
0:15:55i am about other topics
0:15:59so this cn and with the by lstm with just the word embeddings gets their
0:16:03in that kind of in those
0:16:05in the sixties
0:16:06and then we get our best model using
0:16:11i one along with features and what the gun control what this one shows here
0:16:15to let me remind you with these features are
0:16:17so l c p is lou
0:16:20with the context representation that is also a liu are is the readability ga c
0:16:25is a dialogue act score and then the colour out
0:16:29so for gun control having three different representations of context
0:16:35give gives us the best model
0:16:37and both for gay marriage and abortion as well just having this loop
0:16:42the categories of the a previous utterance also gives as good performance
0:16:47and so i think it's interesting have a pretty simple representation context it's not a
0:16:53sequential model that we do have something that shows that the context helps
0:16:59one minute
0:17:00okay
0:17:01the like let's right where very well
0:17:04this work very well because of all this reputation in dialogue
0:17:08so the assumption of blacks rank for like newspaper corpora is this something gets repeated
0:17:13it's important
0:17:14but it might infer from like the previous speaker talking about alignment there's lots of
0:17:19repetition in conversation that doesn't indicate that the information is actually important and it's based
0:17:25on lexical repetitions so it doesn't really help the it's interesting about
0:17:30sentiment is that something be positive sentiment actually turns out to be a very good
0:17:34predictor that is not important
0:17:36and it's not for the reason necessarily that you think it would be it's because
0:17:40sentiment classifiers think that anything that's conversational were data at any time
0:17:46is positive sentiment
0:17:48so it just rules out anything right today
0:17:51you know that is no where did you know no it is right it just
0:17:55rules out a lot of stuff it's just purely conversational and that
0:17:59and that's why
0:18:00sentiment house
0:18:02and then four categories we get some you know some loop categories that are different
0:18:07for each topic search shows that is
0:18:09some of the stuff that we're learning for the loop
0:18:11is actually topic specific
0:18:13"'cause" it's learning to use particular
0:18:16look categories okay
0:18:18so absent and a novel method for summarizing argumentative dialogues should our results speak several
0:18:23summarization baselines
0:18:25we compare the svm with the nor deep learning model
0:18:30show that the linguistic features actually really how
0:18:34the context based features
0:18:36improve
0:18:36over the sentence alone
0:18:38and then we wanna do more work exploring
0:18:41whether this could be topic-independent so i wouldn't want to point out that our baseline
0:18:46summary baselines are all topic-independent that don't need any
0:18:49training
0:18:50okay
0:18:52questions
0:19:01e
0:19:18that's really good point i needed that recently didn't we distinguish between
0:19:27conversations with there was more or less agreement and we have a
0:19:30we haven't looked at that so i think it should i should be interesting because
0:19:33you would think that it would be easier to summarize the conversation
0:19:37where they were segment seven where people where more on the same stance side
0:19:45yes
0:19:47i dunno uni you had you can you
0:19:59it's in the paper
0:20:04it seems like it seems like they would be pretty
0:20:09can you rephrase just me for a given the model when you features you still
0:20:16yes or them simultaneously for our method that pretty no i don't we tried that
0:20:25i don't think we did we tried word to back
0:20:27embeddings and then weighted glove embeddings we didn't put in both in
0:20:31and that we looked at both of those with
0:20:34i mean in like which features make a difference
0:20:37so there's a there's a hole in fact probably not all the results are in
0:20:40the paper
0:20:41but there is a pretty decent set of laplacian results in the paper about how
0:20:46much each feature country
0:20:50david you give a quick question
0:20:54sorry
0:20:57wait
0:21:08so trained on
0:21:10abortion to the store on goals
0:21:14also we have done here we had a
0:21:17paper
0:21:20a few years back where we did some cross domain experiments versus subset of this
0:21:24problem
0:21:25which is just
0:21:26trying to identify
0:21:29some
0:21:30sentences which are more likely to be understandable it's good arguments out of context in
0:21:35that
0:21:36that paper which has first author swanson i can tell you about it afterwards we
0:21:41did some cross domain experiments
0:21:43and of course doesn't work as well
0:21:46and it is interesting "'cause" you would think we have thought that we
0:21:50that most of the features that we're using would not be domain
0:21:53specific
0:21:54but every time we do that cross domain thing the results are about ten percent
0:21:59worse
0:22:00okay so you're the most domain-specific fisheries in buildings
0:22:05the embeddings and also that look and that you know you give it all the
0:22:09look features but the ones that the model
0:22:12learns the pay attention to our topic specific
0:22:22let's think the speaker again