okay
i'm marilyn walker and the word that i'm presenting his phd work about my denominator
media who can be here
i'm gonna talk about summarizing biologic arguments and social media and the first thing i
guess they wanna say about this you know negotiation session is it's not clear how
much negotiation it's gary actually carried on
in these argumentative dialogues
although they definitely thing to be to go see at something
so
the current data that are in summaries of argumentative dialogues really human state-of-the-art
so websites have curators to manually curate argument summaries and so lots of different debate
websites
well have curated argument so i debate has you know these kinds of points for
points against and
and pro conduct work has the top ten pro and con arguments so on these
websites they kind of summarise like what are the repeated arguments that people make about
a particular
social issue these examples are this one example here about gay marriage and another one
about gun control
and
when you go when you look at the natural human dialogues where people discussing the
same kinds of issues it's really striking how difficult it would be to actually produce
a summary
of these dialogues
i'll give you minutes a kind you know i know you're gonna read it anyway
i
i give you
and then it to me to a
you know people are very emotional there not necessarily logical they make fun of each
other they're sarcastic
there's all kinds of stuff going on in these dialogues
they do kind of fit in with your notion of what you would
happens a summary of an argument especially when you compare these to the curated arguments
that are produced by
by professionals and so the question the first question that we had was obviously it
would be great if you could actually summarize the whole bunch of conversations out there
and social media like what is it that the person on this tree
is thing about gay marriage and what is it that the person on the straight
is saying about gun control or portion or even lucien or
any kind of issue that's be constantly debated only social media website
and i would claim that you're interested not just in like what are the kind
of arguments that lawyer a constitutional expert would actually make but you're actually interested to
know
what is that people are saying you know you
it's everybody can vote these days right you have a whether or not you're in
the top one percent of the population that's actually educated
how to argue logically
so it's be good thing to actually no you know what it is that people
are saying what kinds of arguments that they're making
when you look at the easiest thing you know
what should the summary contain what kind of information should we pull out of these
conversations in order to make a summary
and you know the common convergence don't agree and so you know to you seems
like you would at least need to represent both sides of the arguments of that's
would be like that may be a first criteria
but you know you want to represent the opposing stances
then do you want to include some kind of emotional information in it do you
want to include socio-emotional relationship like that the second speaker
making fun of the first speaker that they're being sarcastic do you should that kind
of information going to
a summary or
you know do you want to take like the philosophical argumentation logical argumentation you and
say well i'm gonna i'm gonna consider all this to be just the flame or
troll or
whatever i'm not really interested in any part of this argument that doesn't actually fit
in with the logical you of argumentation
and there has been previous work on dialogue summarisation but there hasn't been anywhere on
summarizing argumentative dialogues automatically
and so all the other high dialogue summarisation that are that's out there some of
which i think spin but done by some of people in this room
they all have very different properties
and they're not merely as
i don't really as these argumentative dialogues are
so our goal is to automatically produced summaries of argumentative dialogues then
we're taking and extractive summarization perspective at this point although would clearly be nice if
we could do abstract of summarisation
and so one that step that we're trying to do in this paper is we're
trying to identify and extract what are the most important
arguments
on each side
for an issue
and are
our initial starting point is that as i pointed out of previous slides actually really
difficult to figure out what information these summary should contain and so we start from
the standpoint that summarization is something that any native speaker knows how to do they
don't have to have any training
and so are initial concept this is that we're gonna
collect summaries of that humans produce a piece conversations and see what people pick out
and then we're gonna take these summaries that we collected and we're gonna apply the
here amid method which is been used like in duck summarization task
and we're gonna assume that the arguments that up here in model summaries that those
of the most important argument so we're
a kind of applying a standard summary extractive summarization and evaluation approach
to these argumentative dialogues
so we have gold standard training data
we have
collected five human summaries
for me for each of about fifty dialogues on the topics the gay marriage gun
control and abortion
and that the
a lot of this is described in more detail in our paper in a whole
ami this paper in a twenty fifteen
the but the summaries look like and what their properties are then we trained undergraduate
linguists to use the pyramid method to identify important arguments in the dialogue
so they construct spearman's for each set of five summaries
and the idea that the repeated elements of the summaries and upon the higher here's
of the peer minute are gonna give you an example in a minute some cases
this is all probably
go to you are so that will be clear
after next slide
then so then we have we have this human dialogues we have five summaries for
each dialogue
and then we have these purim it's that are constructed on top of each of
those summaries look you know what
elements get repeated
then we still have a problem where we know which of the important concepts in
the dialogue because those of the once it appeared in the model summaries
we have to map it was actually original dialogues if we want to develop an
extractive summarizer we want to be able to operate on the original dialogue texts and
not that intermediate summary representation which we collected right
so that's the third step of
getting this mapping back and then once we have that making
characterize our problem is a binary problem or ranking problem of identifying the most important
utterances in the dialogues that we want to go into the extractive summary
this is what kind of samples summaries look like this is from a gay marriage
dialogue
you know so there
the these summarizers they're really good quality and the ones for gay marriage are currently
available
on our website at not a thought as so we just for gay marriage the
new ones that we collected better talked about in this paper about abortion
and then ensure we will be releasing
soon
but if you want to see what they look like just for gay marriage these
were released a few years ago with our
previous paper
so
this is what the data looks like so we have the summaries for these different
fifty different about fifty different conversations for each topic
and let them
human does when they make the pyramid label
is the kind of read through all the summaries they decide what are the important
concepts kind of distinct from the words that are actually used by the summarizers
and they make their own human label so they come up with the human label
which is the paraphrase
no one has been able to prove that gun owners are safer than on gun
owners
and then they identify for each summary how this summarizer phrase that particular argument that
particular concept
and i think i'm a concept in more than one of the
summaries up to five because we have five summaries
then that means that that's very important concept so that
represented in this tier right
so the arguments
that multiple summarizers picked out
and put in their summaries and
having more contributors right in this human label and they end up being ranked is
more important argument
okay so
so that we're on step three where we now have these
we have these are summary contributors which again as i said they're removed from the
language of the original dialogues
and we have these human labels
and what we want to do is to figure out in the original dialogue
what i utterance is actually correspond to these things that ended up really highly ranked
in the in the peer in it
and
where only collected this data like two or three years ago we well we're gonna
be able to do this automatically once we had this space
and after multiple different attempts we decided that we could impact of it automatically
because the language of the summarizers and the language of the
of the human labels from the pure images
two different
from the original language in the original dialogues so we did speaker
and their mechanical turk tasks
something try actually we didn't do it we didn't right on mechanical turk we couldn't
get mechanical turkers to do this task reliably of map back from the summary labels
to the original dialogues
so i forgot this we added that we recruited to graduate linguists into undergraduate linguists
to actually do this mapping forced in order to get good quality data
so we have we presented them with the original conversations and thus have the labels
other produce the highest tier labels
and we ask for each utterance of the conversation
to pick one or more of the labels that correspond to the content of that
conversation and again where only interested
where only interested in the
utterances that have a score of three or higher that are considered most important by
the original summarizers
and we get pretty good reliability on this
once we started using our own internal train people we could get
turkers
to do this reliably
so
so that i
three
we have the fifty dialogues for each you know size that we had about a
fifty for each one
so effective dialogue twenty fifty summaries
five for each dialogue how we pull out the important sentences and the not important
sentences for each dialogue and we frame is that the as a binary classification task
again we could rate have framed as the ranking task
and you just use the peer label but
we decided to just frame it is binary classification
so we group the labels liked here we compute
compute the average tear label and then we define any sense with an average are
scored very high risk been an important
so we believe that we provided a well motivated and theoretically grounded definition of what
is an important argument by going through this whole process
and now we have this binary classification problem we're trying to do
so we have a three different off-the-shelf summarizers
that we apply this to see how standard summary algorithms work so we use some
basic
which is a algorithm by think open about inventing we use this kl divergence summarization
which is from heidi in front of any these are all available off-the-shelf and we
used lex rank this of these are all different kind of the algorithm selects rank
is the one that was
most successful at the most recent document understanding
competition
and all of these rank utterances instead of classify them
so what we did with we apply them to the dialogues
we get the ranking and then the
we take e
number of utterances that are in the task so we kind of say let's do
that come up with the ranking at the point where
the length of the extractive summaries the same
as what we expect
we have a bunch different models
we tried support vector machines with the linear kernel for packet learned we use cross
validation
for tuning the parameters and then we also tried a combination
bidirectional lstm with the convolutional neural net with the biased
and we split our data into training and test
for features we have hundreds
two different kinds of word embeddings
google work to back and low and then we have some other things that we
think that are lean more linguistically motivated that we expected to
problem possibly help
so we have readability scores would expect that utterances that are more readable would be
better and that be more important
we thought sentiment might be important we thought the position sentence position in the summary
might be important like the first sentences was summary might be more important of the
first sentences in the dialogue
and then we have linguistic intra in query word count which gives us a lot
of lexical categories with three different representations of the context once one in terms of
blue one in terms of the dialogue act classification of the previous utterances the previous
two utterances in the dialogue
and then we ran stanford co graph
which
i expected to not produce anything so that's a little foreshadowing it works it actually
helps amazingly
and these are our results
so let's rank was our very best baseline some not tell you what the other
baselines were
and so for lex rank we getting a weighted f-score on the test that
in my upper fifties
when we
have just
as the are very best model svm using features so as just with word embeddings
is not just well but if we put all these linguistic features and we see
that
for both gun control and for the abortion topics that all the shock reference engine
applied to these very things very noisy dialogues actually improves performance of having representation of
the context
and we get all you know we get better results for gun control
that we do for gay marriage and abortion and we had that result repeatedly over
and over and over it and we think that the reason this is the same
arguments get repeated and gun control
and it's not it's not have created
i am about other topics
so this cn and with the by lstm with just the word embeddings gets their
in that kind of in those
in the sixties
and then we get our best model using
i one along with features and what the gun control what this one shows here
to let me remind you with these features are
so l c p is lou
with the context representation that is also a liu are is the readability ga c
is a dialogue act score and then the colour out
so for gun control having three different representations of context
give gives us the best model
and both for gay marriage and abortion as well just having this loop
the categories of the a previous utterance also gives as good performance
and so i think it's interesting have a pretty simple representation context it's not a
sequential model that we do have something that shows that the context helps
one minute
okay
the like let's right where very well
this work very well because of all this reputation in dialogue
so the assumption of blacks rank for like newspaper corpora is this something gets repeated
it's important
but it might infer from like the previous speaker talking about alignment there's lots of
repetition in conversation that doesn't indicate that the information is actually important and it's based
on lexical repetitions so it doesn't really help the it's interesting about
sentiment is that something be positive sentiment actually turns out to be a very good
predictor that is not important
and it's not for the reason necessarily that you think it would be it's because
sentiment classifiers think that anything that's conversational were data at any time
is positive sentiment
so it just rules out anything right today
you know that is no where did you know no it is right it just
rules out a lot of stuff it's just purely conversational and that
and that's why
sentiment house
and then four categories we get some you know some loop categories that are different
for each topic search shows that is
some of the stuff that we're learning for the loop
is actually topic specific
"'cause" it's learning to use particular
look categories okay
so absent and a novel method for summarizing argumentative dialogues should our results speak several
summarization baselines
we compare the svm with the nor deep learning model
show that the linguistic features actually really how
the context based features
improve
over the sentence alone
and then we wanna do more work exploring
whether this could be topic-independent so i wouldn't want to point out that our baseline
summary baselines are all topic-independent that don't need any
training
okay
questions
e
that's really good point i needed that recently didn't we distinguish between
conversations with there was more or less agreement and we have a
we haven't looked at that so i think it should i should be interesting because
you would think that it would be easier to summarize the conversation
where they were segment seven where people where more on the same stance side
yes
i dunno uni you had you can you
it's in the paper
it seems like it seems like they would be pretty
can you rephrase just me for a given the model when you features you still
yes or them simultaneously for our method that pretty no i don't we tried that
i don't think we did we tried word to back
embeddings and then weighted glove embeddings we didn't put in both in
and that we looked at both of those with
i mean in like which features make a difference
so there's a there's a hole in fact probably not all the results are in
the paper
but there is a pretty decent set of laplacian results in the paper about how
much each feature country
david you give a quick question
sorry
wait
so trained on
abortion to the store on goals
also we have done here we had a
paper
a few years back where we did some cross domain experiments versus subset of this
problem
which is just
trying to identify
some
sentences which are more likely to be understandable it's good arguments out of context in
that
that paper which has first author swanson i can tell you about it afterwards we
did some cross domain experiments
and of course doesn't work as well
and it is interesting "'cause" you would think we have thought that we
that most of the features that we're using would not be domain
specific
but every time we do that cross domain thing the results are about ten percent
worse
okay so you're the most domain-specific fisheries in buildings
the embeddings and also that look and that you know you give it all the
look features but the ones that the model
learns the pay attention to our topic specific
let's think the speaker again