Speech Transcript - Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue

hi everybody

creating in characterizing the diverse corpus of sarcasm the dialogue

they want to start by explaining why we study sarcasm

and then the need for a large-scale corpus of sarcasm

different examples of sarcasm in the wild

followed by how we build our corpus some experimental results and linguistic analysis and then

conclusions

so why study sarcasm

well it's as we all kind of no it's creative complex and diverse here are

some examples

things like this or missing the point

i love it when you bash people for stating opinions and no facts then you

turn around to do the same thing

and even more complex my pyramidal tinfoil hat is an antenna for knowledge and truth

it reflects idiocy and this into deep space

as we can see

it's very creative it's very diverse

and

it gets more and more ambiguous in complex

very long tell problem

so further motivation is it's very prevalent so estimated around ten percent in debate forums

dialogue which is kind of our domain of interest

and this sort of dialogue is very different from traditional mediums like independent tweets or

reviews for products things like that

so it's very interesting to our group

also part of the motivation is that things like sentiment analysis systems are supported by

misleading sarcastic postal people

being sarcastic thinking something is really great about their product and then it's very misleading

also for question answer systems it's important to know when things are not sarcastic to

use that it's good data right so it's also important to differentiate between

the classes sometimes you wanna look at the not sarcastic post sometimes you care about

the sarcastic once

so some examples of sarcasm the wild

so sarcasm is clearly not a unitary phenomenon gives into thousands developed a taxonomy of

five different categories of sarcasm on conversations between friends

so you talks about sarcasm as speaking positively to convey negative intent

this is kind of a generally accepted way

to define sarcasm

but you also defines different categories where sarcasm is probably things like rhetorical questions so

somebody asking a question implying a humorous are critical assertion

things like hyperbole expressing a non-literal meeting by exaggeration

on the other side of the scale understatement so under playing the reality of a

situation

and jock hilarity so humouring teasing humours weights

so this is a little bit more fine grained

as a taxonomy for sarcasm

and it's kind of

accepted that people use the term sarcasm to meet all of these things as like

a big rollback for anything that could be sarcastic

but the okay theoretical models side that there is often a contrast between what is

said

and a literal description of the actual situation

so that's a very common thing that characterizes much of sarcasm in different domains

so no previous work has really operationalize these different categories that it gives is defined

gives an other work people have defined

so that kind of the focus of our corpus building

so we explore in great detail rhetorical questions and hyperbole us to very prevalent

subcategories of sarcasm in our online debate

every probably in our debate forums and they can be used in fact sarcastically or

it not sarcastically sounds interesting binary

classification question

so to kind of showcase why that's true here are examples of rhetorical questions answers

that in the top row is used sarcastically in the bottom row not sarcastically

something like then what you call politician who ran such measures liberal

yes it's "'cause" you're public and you're a conservative at all

what without proof we would certainly show that it animal adapted to blah more of

like an informative sort of thing

so rhetorical questions exist in both categories

similarly for hyperbole

something like thank you from

making my point better that i never do

or again i'm astonished by the fact that you think i will do this

so there's kind of different ways that you can use these categories in both sarcastic

or not sarcastic

with sarcastic or not sarcastic intent

so kind of going into why do we need a large scale

scale corpus of sarcasm

first of all like i tried creativity and diversity make it difficult to model generalizations

and subjectivity makes it very difficult to get high agreement annotation and we see that

from lots of previous work on sarcasm

people often use hash like sarcasm or use you know positive or negative

a sentiment in different mediums to try to

highlight where sarcasm exists

because it's very difficult to get high agreement annotations

and these annotations are costly and they require kind of expert workers

so for example in and out of the blue context something like got your sosa

think simple found i think i love you it's hard to tell if that's really

sarcastic right

out of the blue

something like humans are nominal mammal that the fact it you just this in the

real schools

very subtle we don't know right

so it's pretty hard to ask people to do this sort of annotations you have

to be a little bit clever about it that kind of what we try to

so we need a way to get more labeled data and the short-term to study

sarcasm

to allow for better linguistic generalisations

more powerful classifiers in the long term not kind of the promise of our corpus

building stage

how do we do it

so we do bootstrapping

we begin by replicating looking and walker's bootstrapping setup from twenty thirteen

and the idea behind this is that

you begin with a small set of annotated sarcastic and not sarcastic post

and use some kind of the linguistic pattern extractor to find

cues that you think are highly

precise indicators of sarcasm and not sarcasm in the data

once you have these sorts of cues you can go out against huge sets of

an annotated data look for those cues

and anything that matches we're gonna call the bootstrap data

drop it back in the original annotated data and then kind of iteratively expand your

data set that way

that's kind of the premise that we use

well i really the crux of this is that

did you could bootstrapping we need this

portion right here

which requires the high-precision linguistic patterns to be really good we need really good high

precision patterns so we try to get them using or

using the linguistic patterns are out of slot t s

so others log on the well by relevant ninety six is a weakly supervised pattern

learner

and we use it's extract lexical syntactic patterns highly associated with both sarcastic and not

sarcastic utterances

so the way that works is that it has a bunch of patterns templates that

are defined so things like

some sort of a subject followed by a passive verb phrase et cetera

and it uses these patterns to then find instantiations in the text and then brings

these different instantiations based on probability of occurrence in a certain class and frequency of

occurrence

so something like if you had the sentence in your data there are millions of

people saying all sorts of stupid things about the president

and you know run out of soggy we match this

for example would match this but noun phrase proposition

noun phrase pattern

millions of people

and then if this pattern was very frequent and highly

probably occurring in sarcasm and then you would float up to the top of are

ranked list

so we do this

and give each extraction pattern of frequency data off in a probability therapy

and we classify post as belonging to a class it has at least and of

those patterns existing

so the first round that we observe

looking at the small sample data

so here's some examples so something like say about your head get over a current

sarcastic posts

with these frequencies and probabilities of association

and things like

natural signal selection big thing area of our probabilities not sarcastic post

and just to kind of sparse

we find that the not sarcastic class contains

a lot of very technical jargon scientific language topic specific things

and then we can get

high precision when classifying post based on just these templates

it's up to about eighty percent

whereas the sarcastic classes you can see are much more varied not high precision

thirty percent

and so it's difficult and you bootstrapping

on data where the precision of these patterns is relatively low

we decided to make use of this high precision not sarcastic set of patterns that

we can collect

so actually expand our data trying to find post that would be good to get

annotated

that we think would have a higher probability than ten percent of being sarcastic and

based on that original metric from a sample of to be forms data

so using a pull of thirty k we filter out tools that we think contain

not sarcastic post

so pos that containing a any of those not sarcastic patterns that we identified

and we end up with about eleven k posts that we believe have higher likelihood

of being sarcastic and we put those out for annotation on mechanical turk

and the way the kind of annotation task looks as they get a definition of

sarcasm and an example of responses that contain sarcasm

and don't contain sarcasm

and then we show them a quote response pair so this is like a dialogic

pair where we have a dialogic parent and the response and we asked them to

identify sarcasm in the response

so that's what are annotators are seeing

and then using this method were able to skew the distribution of sarcasm to from

ten percent up to thirty one percent

so kind of getting annotated that pair that poll of eleven k

depending on where we set our agreement threshold where he was askew this distribution quite

high

so here from nineteen to twenty three percent using this relatively can sort of conservative

threshold of six out of nine annotators agreeing

that posttest sarcastic

we kind of since it so subjective and diverse we wanna make sure that or

annotations are

clean

so that's why use a relatively high threshold

so having more data

means we're we can do better at the boot-strapping task but we still need we

still observe some of the same trends

so highly-precise not sarcastic patterns less precise sarcastic

but and were still not quite at the point we wanted to be a propose

trapping

kind of given up

the diversity of the data we decide to revisit that

categorization i talked about earlier

so sarcasm rhetorical questions hyperbole understatement regularity

so we make this observation that somebody's lexical syntactic cues are frequently used sarcastically

so for example

well

let's all copper that great argument revolution as well

well

the what's your plan how to

how to realistic my friend

interesting someone hijacked your account role

central

so pretty funny and really a combination of words expel an arm mean to expel

arms louse use the creative genius

so kind of these different terms that are pretty probably in the terms like this

are pretty probably in sarcastic post and we try to make use of this observation

in our data collection

so the way we do that is

we develop projects a search for different patterns in or an annotated data

so we get annotations for different things that we think are quite probably the data

things like well

and things like

all the all of these ones pretty much fantastic et cetera

and we find that were able to get again distributions that are much higher than

ten percent searching for post that only contains a single cues so it's interesting to

note that just a single q have such a well large distribution of sarcasm so

something like well

used forty four percent of the time about

about something post

so using these observations we begin constructing are sub corpora

one for rhetorical questions and one for hyperbole

and the way we gather more data for this is that we observe that their

use both sarcastically and not sarcastically for argumentation

and we use this middle of posts heuristic to estimate whether a post is whether

questions actually use rhetorically or not

i'm so one a speaker

ask the question then continues on with their turn their not giving it a chance

whether

the listener at actually respond and so it's a question at least that doesn't require

at answering from someone else

in the view of the writer

so we do a little pilot annotation find that seventy five percent of these rhetorical

questions that we gather in this way are in fact used

artifact annotated to be

rhetorical

and we do annotations of these new post ending up with eight hundred fifty one

post per class so something like do you wish to not have a logical to

already then god bless you anyway

proof that you can't prove that i got

and given anything but in salt et cetera so these things where someone this is

the same post some was asking questions going on with their turn

the second subcorpus we look at is hyperbole so hyperbole exaggerated situation we use intensive

fires to capture these sorts of instances and we can get more annotations so calls

in an o'brien side this sort of situational scale this contrast in fact i was

talking about earlier

so hyperbole can shift utterances across the scale so chipped something into extremely positive i

don't way from literal and also into strictly negative and away from literal and so

intensify or is kind of sort of this purpose

so something like wow i'm so amazed by your comeback skills

do you go on "'em" so impressed by or intellectual argument

things like that

so the statistics for a final corpus we get around six thousand

five hundred post for are generic sarcasm corpus

and then rhetorical questions and hyperbole with this distribution and more information on the dataset

is available there

it's in the paper

so it's kind of validate the quality of our corpus

we do simple experiments using very simple features bag of words about features

noting previous work has achieved about seventy percent with more complex features

and we end up with distributions that are higher than that so we get we

do this kind of

segmented

segmented set of experiments where we test at different dataset sizes

and we see that are f-measure is continue to increase above our peak right now

seventy four with these simple features

so that warrants you know expanding our dataset even more

also we do again r weakly supervised experiments with other slot its just see what

sorts of precisions we can get now for bootstrapping

and we see much higher precision is that we were getting before at reasonable because

for bootstrapping so that's good use as well

so now we could expander method to be weakly supervised and gather more data more

quickly

and this is the numbers of new patterns that we learned so patterns that we

never searched for in the original data

so we're get we're learning a lot of new patterns that we didn't originally search

for

for all of the datasets

and then some linguistic analysis quickly

so we aim to characterize the differences between our datasets so again user some others

love instantiations still and are generic data we see these

creative sorts of different instantiations were sarcastic posts whereas again the not sarcastic pose that

these highly technically

technical jargon sort of terminologies

for the rhetorical questions we observe a lot of the same properties for the not

sarcastic class

but for the start has the class we observe that

there's a lot of attack on basic human abilities right on these debate forum dialogue

some people say things like can you read it can you write

do you understand

so we kind of went through looking at some of the dependency parses on these

sorts of questions

i'm just found a lot of things that really relate to basic human ability so

people are attacking people

not really attacking their argument that's very probably on rgb boards data

and finally for probably we find that the adjective an adverb patterns are really common

even though we don't search for these of originally in our metrics experiments

when there and things like contrast by exclusion used

samples of hyperbole are really interesting that we pick up

so in conclusion we develop a large-scale highly reliable corpus of sarcasm we reduced annotation

cost and effort by skewing the distribution of waiting having to annotate huge boobs of

data

and we operationalize lexical syntactic cues for rhetorical questions and hyperbole

and verify the quality of our corpus empirically qualitatively

for future directions you wanna do more feature engineering more model selection based on our

linguistic observations

develop more generalisable models of different categories of sarcasm that we haven't looked at

and explore characteristics of our lower agreement data see if there's anything interesting there as

well

thanks

questions

so first of all so we have we began with not looking at those categories

right so we start with this really generic sarcasm so definitely there it's kind of

it so long tail right so there's a lotta different exaggerations

definitely the problem

we began initially talked leaving was just sarcasm a general but it kind of interesting

to get into the more refined categories and look at how those are different and

yes there's also different sorts of things that we could look at the understatement is

quite prevalent as well

so it's it doesn't only existing to be formed it just quite pronounced in the

form so

good to look at their

right so the question is the question is about the words of x features so

do we train them

we train the word back model on our corpus are to be use existing model

so we don't both on these results that are reporting are actually on the google

news trained actors which is kind of

it correlates with our with our data as well it

the debate forums

we have used our own trained model it today perform as well as this probably

because that the smaller amount of data

and the google news is trained on a huge amount of data so that definitely

worth exploring in the future as well

right so actually i didn't mention the numbers here are there's more detail in our

in our paper but are level of agreement were about seventy percent for each of

the for each of the tasks and they were actually better for the smaller tasks

where what in generic sarcasm is a little bit are more constraint

i think it's

no that's actually agreement with the majority label so just

and

so is actually better for the sub categories in fact then the and the generics

are can talk it's pretty hard to

to get high agreement annotations rhetorical

so i was wondering about the idea of twelve contrast try so you set

these somewhere that highlights the fact that the entire time and it is some contrast

between let us thing and what you think that element and so i guess

that

and also this idea that is that the t seven a meaning that is non

leader right yes so i was thinking about the possible connection with method for and

with the task of metaphor detection right and so here you are focusing on trying

to find patterns that can act the rights a constant

but for instance in some working metaphor detection the goal is to

to capture contrast rice to what makes a particular use different from the little use

so by looking at how the sarcastic intended indicate actually be far from the regular

used by was wondering into the so it's a very open question i was wondering

that they have you had thought about the task in

in this their arms

that's really interesting so looking at kind of

maybe trying to measure how far away tonic sort of a contrast scale that would

definitely be interesting we haven't

do not explicitly but i mean

like the different intensified can have different affect so it's kind of

trying to map it across the scale

Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue

Oral Session 2: Corpus creation

Shereen Oraby, Vrindavan Harrison, Lena Reed, Ernesto Hernandez, Ellen Riloff and Marilyn Walker