Ibrahim Abu Farha | Arabic Sarcasm Detection: A Summary of My PhD Research

Sarcasm is a form of verbal irony where the intended meaning differs from the literal meaning. Consider this tweet: “يا سلام، ما أجمل يوم اثنين تبدأ فيه بثلاث ساعات في الزحمة وبطارية الجوال ميتة” - roughly, “What a beautiful Monday, starting with three hours in traffic and a dead phone battery.” Any Arabic speaker would immediately read this as sarcastic. A sentiment classifier would label it as positive. This gap between the literal and intended meaning is the core challenge of sarcasm detection.

My PhD, completed at the University of Edinburgh in 2023 under the supervision of Prof. Walid Magdy and Prof. Bonnie Webber, focused on Arabic sarcasm detection. The goal was to build datasets, establish benchmarks, and understand the limits of current approaches for this task.

The gap in 2019

When I started my PhD, Arabic NLP had made meaningful progress on sentiment analysis, named entity recognition, and machine translation. However, sarcasm detection in Arabic had received very little attention. There were no publicly available Arabic sarcasm datasets and no established benchmarks. This was a significant gap, given that sarcasm is common in Arabic social media and directly affects the accuracy of sentiment analysis systems.

Building the first dataset: ArSarcasm

The first step was creating a dataset. Rather than collecting new tweets from scratch, I re-annotated two existing Arabic sentiment datasets - the SemEval 2017 Arabic dataset and ASTD - for sarcasm. This approach produced paired sarcasm and sentiment labels on the same data, which allowed us to study the relationship between the two directly.

The resulting dataset, ArSarcasm, contains 10,547 tweets labelled for sarcasm (binary), sentiment (positive/negative/neutral), and dialect (Egyptian, Levantine, Gulf, Maghrebi, MSA). The analysis showed that around 16% of the tweets in these two corpora were sarcastic. Sarcastic tweets showed a strong sentiment inversion: tweets that appeared positive on the surface were typically negative in intent. State-of-the-art sentiment classifiers showed a significant drop in performance on sarcastic content, confirming that sarcasm is a meaningful source of error for these systems.

A BiLSTM baseline trained on the dataset achieved F1-sarcasm = 0.46, establishing an initial benchmark. A follow-up version, ArSarcasm-v2, extended the dataset to 15,548 tweets and was used in the WANLP 2021 Shared Task on Sarcasm and Sentiment Detection in Arabic.

Intended vs. perceived sarcasm: iSarcasmEval

The ArSarcasm labels reflect perceived sarcasm - annotators reading other people’s tweets and judging whether they are sarcastic. This approach has a fundamental limitation: because sarcasm is subjective, annotators can disagree, and their labels may not reflect the author’s actual intent.

To address this, I collected a dataset based on intended sarcasm, where authors label their own text. Contributors provided sarcastic sentences in their own dialect along with non-sarcastic rephrases of the same literal meaning. First-party annotation of this kind produces more reliable labels and enables a more controlled analysis through the paired sarcastic and non-sarcastic sentences.

This dataset became the Arabic portion of iSarcasmEval, used in SemEval-2022 Task 6 - the first shared task on intended sarcasm detection in English and Arabic. The task attracted 60 participating teams.

Modelling results

Testing state-of-the-art models on both datasets showed that intended sarcasm detection is more challenging than perceived sarcasm detection, despite having cleaner labels. This is because perceived-sarcasm annotations tend to capture the more obvious cases - those that third-party annotators could identify. Intended-sarcasm labels include more subtle and ambiguous cases that even careful readers may disagree on.

The best-performing models were MARBERT and AraBERT - monolingual Arabic models pre-trained on large dialectal Arabic corpora. Multilingual models such as XLM-R performed worse, suggesting that coverage of dialectal Arabic in pre-training is more important than multilingual breadth for this task.

Human vs. machine performance

To measure the upper bound of the task, we collected human annotations on a held-out portion of the intended-sarcasm data and compared them against the best models. Individual human annotators achieved F1-sarcasm around 0.525 - close to but slightly below the best model (0.563). With majority voting across multiple annotators, humans reached 0.665, outperforming all models.

The error analysis showed similar failure patterns for both humans and models:

Sarcasm type	Examples	Human errors	Model errors
Idioms	58	13	7
Proverbs	45	10	0
Referencing context / world knowledge	45	15	12
Complex metaphors	45	11	8
Dialect-specific words	21	8	1
Animal / object references	11	0	0
Words in uncommon context	8	0	0

Around 83% of model errors involved missing world knowledge or contextual information. Human errors were more concentrated in idioms and proverbs. This suggests that sarcasm detection is not purely a language understanding problem - it also requires world knowledge and cultural context that current models lack.

Effect of dialect familiarity

A separate study examined how dialect familiarity affects sarcasm detection. Arabic speakers from different dialect backgrounds annotated sarcasm in tweets from various dialects. The results showed that annotators perform significantly better on their own dialect or dialects they are familiar with, and that performance drops substantially on unfamiliar dialects.

This finding has implications beyond sarcasm. Any subjective annotation task in Arabic will be affected by the dialect background of the annotators. It is not sufficient to recruit Arabic speakers in general - for tasks where dialect matters, annotators should be native speakers of the relevant dialect. One additional finding was that female annotators tended to perform slightly better than male annotators on the sarcasm detection task, though the effect was modest.

Summary

This PhD produced three publicly available datasets and contributed to two shared tasks. The main findings are:

Sarcasm is common in Arabic sentiment datasets (~16% in the two corpora studied) and significantly degrades sentiment analysis performance.
First-party annotation produces more reliable labels than third-party annotation for subjective tasks like sarcasm detection.
Intended sarcasm detection is more challenging than perceived sarcasm detection, despite cleaner labels.
Dialect-aware monolingual Arabic models outperform multilingual models on this task.
Sarcasm detection requires world knowledge and cultural context beyond language understanding alone.
Dialect familiarity significantly affects annotation quality for subjective Arabic NLP tasks.

The datasets are available here:

ArSarcasm: github.com/iabufarha/ArSarcasm
ArSarcasm-v2: github.com/iabufarha/ArSarcasm-v2
iSarcasmEval: github.com/iabufarha/iSarcasmEval