hi everybody
so
creating in characterizing the diverse corpus of sarcasm the dialogue
they want to start by explaining why we study sarcasm
and then the need for a large-scale corpus of sarcasm
different examples of sarcasm in the wild
followed by how we build our corpus some experimental results and linguistic analysis and then
conclusions
so why study sarcasm
well it's as we all kind of no it's creative complex and diverse here are
some examples
things like this or missing the point
i love it when you bash people for stating opinions and no facts then you
turn around to do the same thing
and even more complex my pyramidal tinfoil hat is an antenna for knowledge and truth
it reflects idiocy and this into deep space
as we can see
it's very creative it's very diverse
and
it gets more and more ambiguous in complex
very long tell problem
so further motivation is it's very prevalent so estimated around ten percent in debate forums
dialogue which is kind of our domain of interest
and this sort of dialogue is very different from traditional mediums like independent tweets or
reviews for products things like that
so it's very interesting to our group
also part of the motivation is that things like sentiment analysis systems are supported by
misleading sarcastic postal people
being sarcastic thinking something is really great about their product and then it's very misleading
also for question answer systems it's important to know when things are not sarcastic to
use that it's good data right so it's also important to differentiate between
the classes sometimes you wanna look at the not sarcastic post sometimes you care about
the sarcastic once
so some examples of sarcasm the wild
so sarcasm is clearly not a unitary phenomenon gives into thousands developed a taxonomy of
five different categories of sarcasm on conversations between friends
so you talks about sarcasm as speaking positively to convey negative intent
this is kind of a generally accepted way
to define sarcasm
but you also defines different categories where sarcasm is probably things like rhetorical questions so
somebody asking a question implying a humorous are critical assertion
things like hyperbole expressing a non-literal meeting by exaggeration
on the other side of the scale understatement so under playing the reality of a
situation
and jock hilarity so humouring teasing humours weights
so this is a little bit more fine grained
as a taxonomy for sarcasm
and it's kind of
accepted that people use the term sarcasm to meet all of these things as like
a big rollback for anything that could be sarcastic
but the okay theoretical models side that there is often a contrast between what is
said
and a literal description of the actual situation
so that's a very common thing that characterizes much of sarcasm in different domains
so no previous work has really operationalize these different categories that it gives is defined
gives an other work people have defined
so that kind of the focus of our corpus building
so we explore in great detail rhetorical questions and hyperbole us to very prevalent
subcategories of sarcasm in our online debate
every probably in our debate forums and they can be used in fact sarcastically or
it not sarcastically sounds interesting binary
classification question
so to kind of showcase why that's true here are examples of rhetorical questions answers
that in the top row is used sarcastically in the bottom row not sarcastically
so
something like then what you call politician who ran such measures liberal
yes it's "'cause" you're public and you're a conservative at all
what without proof we would certainly show that it animal adapted to blah more of
like an informative sort of thing
so rhetorical questions exist in both categories
similarly for hyperbole
something like thank you from
making my point better that i never do
or again i'm astonished by the fact that you think i will do this
so there's kind of different ways that you can use these categories in both sarcastic
or not sarcastic
with sarcastic or not sarcastic intent
so kind of going into why do we need a large scale
scale corpus of sarcasm
first of all like i tried creativity and diversity make it difficult to model generalizations
and subjectivity makes it very difficult to get high agreement annotation and we see that
from lots of previous work on sarcasm
people often use hash like sarcasm or use you know positive or negative
a sentiment in different mediums to try to
highlight where sarcasm exists
because it's very difficult to get high agreement annotations
and these annotations are costly and they require kind of expert workers
so for example in and out of the blue context something like got your sosa
think simple found i think i love you it's hard to tell if that's really
sarcastic right
out of the blue
something like humans are nominal mammal that the fact it you just this in the
real schools
very subtle we don't know right
so it's pretty hard to ask people to do this sort of annotations you have
to be a little bit clever about it that kind of what we try to
do
so we need a way to get more labeled data and the short-term to study
sarcasm
to allow for better linguistic generalisations
more powerful classifiers in the long term not kind of the promise of our corpus
building stage
how do we do it
so we do bootstrapping
we begin by replicating looking and walker's bootstrapping setup from twenty thirteen
and the idea behind this is that
you begin with a small set of annotated sarcastic and not sarcastic post
and use some kind of the linguistic pattern extractor to find
cues that you think are highly
precise indicators of sarcasm and not sarcasm in the data
once you have these sorts of cues you can go out against huge sets of
an annotated data look for those cues
and anything that matches we're gonna call the bootstrap data
drop it back in the original annotated data and then kind of iteratively expand your
data set that way
that's kind of the premise that we use
well i really the crux of this is that
did you could bootstrapping we need this
portion right here
which requires the high-precision linguistic patterns to be really good we need really good high
precision patterns so we try to get them using or
using the linguistic patterns are out of slot t s
so others log on the well by relevant ninety six is a weakly supervised pattern
learner
and we use it's extract lexical syntactic patterns highly associated with both sarcastic and not
sarcastic utterances
so the way that works is that it has a bunch of patterns templates that
are defined so things like
some sort of a subject followed by a passive verb phrase et cetera
and it uses these patterns to then find instantiations in the text and then brings
these different instantiations based on probability of occurrence in a certain class and frequency of
occurrence
so something like if you had the sentence in your data there are millions of
people saying all sorts of stupid things about the president
and you know run out of soggy we match this
for example would match this but noun phrase proposition
noun phrase pattern
millions of people
and then if this pattern was very frequent and highly
probably occurring in sarcasm and then you would float up to the top of are
ranked list
so we do this
and give each extraction pattern of frequency data off in a probability therapy
and we classify post as belonging to a class it has at least and of
those patterns existing
so the first round that we observe
looking at the small sample data
is
so here's some examples so something like say about your head get over a current
sarcastic posts
with these frequencies and probabilities of association
and things like
natural signal selection big thing area of our probabilities not sarcastic post
and just to kind of sparse
we find that the not sarcastic class contains
a lot of very technical jargon scientific language topic specific things
and then we can get
high precision when classifying post based on just these templates
it's up to about eighty percent
whereas the sarcastic classes you can see are much more varied not high precision
thirty percent
and so it's difficult and you bootstrapping
on data where the precision of these patterns is relatively low
so
we decided to make use of this high precision not sarcastic set of patterns that
we can collect
so actually expand our data trying to find post that would be good to get
annotated
that we think would have a higher probability than ten percent of being sarcastic and
based on that original metric from a sample of to be forms data
so using a pull of thirty k we filter out tools that we think contain
not sarcastic post
so pos that containing a any of those not sarcastic patterns that we identified
and we end up with about eleven k posts that we believe have higher likelihood
of being sarcastic and we put those out for annotation on mechanical turk
and the way the kind of annotation task looks as they get a definition of
sarcasm and an example of responses that contain sarcasm
and don't contain sarcasm
and then we show them a quote response pair so this is like a dialogic
pair where we have a dialogic parent and the response and we asked them to
identify sarcasm in the response
so that's what are annotators are seeing
and then using this method were able to skew the distribution of sarcasm to from
ten percent up to thirty one percent
so kind of getting annotated that pair that poll of eleven k
depending on where we set our agreement threshold where he was askew this distribution quite
high
so here from nineteen to twenty three percent using this relatively can sort of conservative
threshold of six out of nine annotators agreeing
that posttest sarcastic
we kind of since it so subjective and diverse we wanna make sure that or
annotations are
clean
so that's why use a relatively high threshold
so having more data
means we're we can do better at the boot-strapping task but we still need we
still observe some of the same trends
so highly-precise not sarcastic patterns less precise sarcastic
but and were still not quite at the point we wanted to be a propose
trapping
so
kind of given up
the diversity of the data we decide to revisit that
categorization i talked about earlier
so sarcasm rhetorical questions hyperbole understatement regularity
so we make this observation that somebody's lexical syntactic cues are frequently used sarcastically
so for example
i
well
let's all copper that great argument revolution as well
well
the what's your plan how to
how to realistic my friend
interesting someone hijacked your account role
central
so pretty funny and really a combination of words expel an arm mean to expel
arms louse use the creative genius
so kind of these different terms that are pretty probably in the terms like this
are pretty probably in sarcastic post and we try to make use of this observation
in our data collection
so the way we do that is
we develop projects a search for different patterns in or an annotated data
so we get annotations for different things that we think are quite probably the data
things like well
and things like
all the all of these ones pretty much fantastic et cetera
and we find that were able to get again distributions that are much higher than
ten percent searching for post that only contains a single cues so it's interesting to
note that just a single q have such a well large distribution of sarcasm so
something like well
used forty four percent of the time about
about something post
so using these observations we begin constructing are sub corpora
one for rhetorical questions and one for hyperbole
and the way we gather more data for this is that we observe that their
use both sarcastically and not sarcastically for argumentation
and we use this middle of posts heuristic to estimate whether a post is whether
questions actually use rhetorically or not
i'm so one a speaker
ask the question then continues on with their turn their not giving it a chance
whether
the listener at actually respond and so it's a question at least that doesn't require
at answering from someone else
in the view of the writer
so we do a little pilot annotation find that seventy five percent of these rhetorical
questions that we gather in this way are in fact used
artifact annotated to be
rhetorical
and we do annotations of these new post ending up with eight hundred fifty one
post per class so something like do you wish to not have a logical to
be
already then god bless you anyway
proof that you can't prove that i got
and given anything but in salt et cetera so these things where someone this is
the same post some was asking questions going on with their turn
the second subcorpus we look at is hyperbole so hyperbole exaggerated situation we use intensive
fires to capture these sorts of instances and we can get more annotations so calls
in an o'brien side this sort of situational scale this contrast in fact i was
talking about earlier
so hyperbole can shift utterances across the scale so chipped something into extremely positive i
don't way from literal and also into strictly negative and away from literal and so
intensify or is kind of sort of this purpose
so something like wow i'm so amazed by your comeback skills
do you go on "'em" so impressed by or intellectual argument
things like that
so the statistics for a final corpus we get around six thousand
five hundred post for are generic sarcasm corpus
and then rhetorical questions and hyperbole with this distribution and more information on the dataset
is available there
it's in the paper
so it's kind of validate the quality of our corpus
we do simple experiments using very simple features bag of words about features
noting previous work has achieved about seventy percent with more complex features
and we end up with distributions that are higher than that so we get we
do this kind of
segmented
segmented set of experiments where we test at different dataset sizes
and we see that are f-measure is continue to increase above our peak right now
seventy four with these simple features
so that warrants you know expanding our dataset even more
also we do again r weakly supervised experiments with other slot its just see what
sorts of precisions we can get now for bootstrapping
and we see much higher precision is that we were getting before at reasonable because
for bootstrapping so that's good use as well
so now we could expander method to be weakly supervised and gather more data more
quickly
and this is the numbers of new patterns that we learned so patterns that we
never searched for in the original data
so we're get we're learning a lot of new patterns that we didn't originally search
for
for all of the datasets
and then some linguistic analysis quickly
so we aim to characterize the differences between our datasets so again user some others
love instantiations still and are generic data we see these
creative sorts of different instantiations were sarcastic posts whereas again the not sarcastic pose that
these highly technically
technical jargon sort of terminologies
for the rhetorical questions we observe a lot of the same properties for the not
sarcastic class
but for the start has the class we observe that
there's a lot of attack on basic human abilities right on these debate forum dialogue
some people say things like can you read it can you write
do you understand
so we kind of went through looking at some of the dependency parses on these
sorts of questions
i'm just found a lot of things that really relate to basic human ability so
people are attacking people
not really attacking their argument that's very probably on rgb boards data
and finally for probably we find that the adjective an adverb patterns are really common
even though we don't search for these of originally in our metrics experiments
so
when there and things like contrast by exclusion used
samples of hyperbole are really interesting that we pick up
so in conclusion we develop a large-scale highly reliable corpus of sarcasm we reduced annotation
cost and effort by skewing the distribution of waiting having to annotate huge boobs of
data
and we operationalize lexical syntactic cues for rhetorical questions and hyperbole
and verify the quality of our corpus empirically qualitatively
for future directions you wanna do more feature engineering more model selection based on our
linguistic observations
develop more generalisable models of different categories of sarcasm that we haven't looked at
and explore characteristics of our lower agreement data see if there's anything interesting there as
well
thanks
questions
so first of all so we have we began with not looking at those categories
right so we start with this really generic sarcasm so definitely there it's kind of
it so long tail right so there's a lotta different exaggerations
definitely the problem
we began initially talked leaving was just sarcasm a general but it kind of interesting
to get into the more refined categories and look at how those are different and
yes there's also different sorts of things that we could look at the understatement is
quite prevalent as well
so it's it doesn't only existing to be formed it just quite pronounced in the
form so
good to look at their
right so the question is the question is about the words of x features so
do we train them
we train the word back model on our corpus are to be use existing model
so we don't both on these results that are reporting are actually on the google
news trained actors which is kind of
it correlates with our with our data as well it
the debate forums
we have used our own trained model it today perform as well as this probably
because that the smaller amount of data
and the google news is trained on a huge amount of data so that definitely
worth exploring in the future as well
right so actually i didn't mention the numbers here are there's more detail in our
in our paper but are level of agreement were about seventy percent for each of
the for each of the tasks and they were actually better for the smaller tasks
where what in generic sarcasm is a little bit are more constraint
i think it's
no that's actually agreement with the majority label so just
and
so is actually better for the sub categories in fact then the and the generics
are can talk it's pretty hard to
to get high agreement annotations rhetorical
so i was wondering about the idea of twelve contrast try so you set
these somewhere that highlights the fact that the entire time and it is some contrast
between let us thing and what you think that element and so i guess
that
and also this idea that is that the t seven a meaning that is non
leader right yes so i was thinking about the possible connection with method for and
with the task of metaphor detection right and so here you are focusing on trying
to find patterns that can act the rights a constant
but for instance in some working metaphor detection the goal is to
to
to capture contrast rice to what makes a particular use different from the little use
so by looking at how the sarcastic intended indicate actually be far from the regular
used by was wondering into the so it's a very open question i was wondering
that they have you had thought about the task in
in this their arms
that's really interesting so looking at kind of
maybe trying to measure how far away tonic sort of a contrast scale that would
definitely be interesting we haven't
do not explicitly but i mean
like the different intensified can have different affect so it's kind of
trying to map it across the scale
other questions
a question
it when you're doing the mining of the data
and you're identifying different
phrases that removes some more socially with sarcasm and non sarcasm
did you do things to make sure that the dataset was not biased you know
for "'cause" it utilizing portals kind of phrases
so that if they don't later someone wanted to build an automated system to detect
sarcasm an hour and sarcasm they would just
reader paper and they are gonna go after these phrases "'cause" this was used to
construct the corpus
right so far are generic sarcasm corpus that was a random sample
so all of that is not sampled anyway the for the rhetorical questions of hyperbole
we would select those posts but
the poster actually contain all sorts of other cues and it's important to note that
if we ever selected a cue it would exist in both sarcastic i'm not sarcastic
both
so it's not like you would only find the mid one and that kind of
what made it interesting that you can use those think used in both sorts of
infatuation so it would be by so that lee