When Sentiment Lies: Building the First Arabic Sarcasm Dataset
Sarcasm is one of the main challenges for sentiment analysis systems. A sarcastic tweet expresses the opposite of its literal meaning, which causes sentiment classifiers to output incorrect labels. For example: “يا سلام، الخدمة كانت ممتازة، ما انتظرنا غير ساعتين” - “The service was excellent, we only waited two hours.” Most systems would label this as positive. The intended meaning is clearly negative.
At the time of this work, there were no publicly available Arabic sarcasm datasets. Without labelled data, it was not possible to train or evaluate sarcasm detection models for Arabic, or to measure how much sarcasm affects Arabic sentiment analysis systems.
Building ArSarcasm
Rather than collecting new tweets from scratch, I re-annotated two existing Arabic sentiment datasets - the SemEval 2017 Arabic dataset and ASTD - for sarcasm. This approach produced paired sarcasm and sentiment labels on the same data, which made it possible to directly measure the relationship between the two.
The resulting dataset, ArSarcasm, contains 10,547 tweets annotated for sarcasm, sentiment, and dialect:
| Label | Categories |
|---|---|
| Sarcasm | Sarcastic / Not sarcastic |
| Sentiment | Positive / Negative / Neutral |
| Dialect | Egyptian - Levantine - Gulf - Maghrebi - MSA |
| Total tweets | 10,547 |
| Sarcasm prevalence | ~16% |
Key findings
Sarcasm prevalence: Around 16% of the tweets in these two corpora were sarcastic. This is higher than we expected - sarcasm is not a rare phenomenon in Arabic social media but a common feature of how people express themselves.
Effect on sentiment: Sarcastic tweets showed a strong sentiment inversion. Tweets that appeared positive in surface form were typically negative in intent. This means that sarcasm systematically affects the accuracy of sentiment labels in existing datasets.
Effect on sentiment models: State-of-the-art sentiment classifiers performed significantly worse on sarcastic content, confirming that sarcasm is a meaningful source of error for these systems.
Annotation subjectivity: Sarcasm is inherently subjective, and inter-annotator agreement was lower than for many other NLP tasks. The labels in ArSarcasm reflect the annotators’ perception of sarcasm rather than the authors’ intent - a distinction that has important implications for dataset use and evaluation.
Baseline results
We trained a BiLSTM model on the dataset as an initial baseline. It achieved F1-sarcasm = 0.46, which reflects the difficulty of the task and serves as a starting point for future work.
The dataset is available at github.com/iabufarha/ArSarcasm. The paper is published at OSACT4, LREC 2020.
What followed
The annotation subjectivity issue raised a more fundamental question: since third-party annotators can only judge based on what they read, their labels may not accurately reflect the author’s intent. In follow-up work, we collected sarcasm labels directly from authors - what we called intended sarcasm. This produced cleaner labels and led to the iSarcasmEval dataset, which was used in SemEval-2022 Task 6. ArSarcasm-v2, an extended version with 15,548 tweets, was also released and used in the WANLP 2021 Shared Task on Sarcasm and Sentiment Detection in Arabic.