THE LIMITS OF LARGE LANGUAGE MODELS IN FINE-GRAINED EMOTION DETECTION: A COMPARATIVE AND ERROR ANALYSIS STUDY

Mahrukh Rafique; Ahmed Asja; Shahzad Babar; Muhammad Khan; Mukhtar Ali Soomro

Authors

Mahrukh Rafique
Ahmed Asja
Shahzad Babar
Muhammad Khan
Mukhtar Ali Soomro

Keywords:

emotion recognition, NLP, fine-grained classification, transformer models, DistilBERT, RoBERTa, LoRA, multi-label classification, error analysis, GoEmotions.

Abstract

Emotion recognition in text is an increasingly important natural language processing task, yet the extent to which transformer-based models perform reliably on fine-grained, multi-label emotion classification remains poorly understood. This paper critically evaluates the effectiveness and failure modes of large language models applied to emotion detection, focusing specifically on how emotional granularity degrades classification performance and what structural error patterns emerge. Two benchmark datasets were used: the Emotion dataset (~20,000 Twitter posts across six coarse-grained categories) and GoEmotions (~58,000 Reddit comments across 28 fine-grained emotion categories). TF-IDF baselines with Logistic Regression and SVM were established first, followed by fine-tuning of DistilBERT on the Emotion dataset and DistilBERT, BERT-base, and RoBERTa with Low-Rank Adaptation (LoRA) on GoEmotions. On the coarse-grained task, DistilBERT reached 92.25% accuracy and macro-F1 of 0.87, well above the Logistic Regression baseline of 86.45% accuracy and macro-F1 of 0.80. On GoEmotions, RoBERTa+LoRA achieved micro-F1 0.61, macro-F1 0.55, and Hamming loss 0.0338 outperforming all baselines and DistilBERT by 8.9 macro-F1 points, yet substantially lower than coarse-grained performance, confirming that increased emotional granularity introduces structural difficulties that architecture alone cannot resolve. Structured error analysis identified four failure types: rare-class underperformance, universal semantic confusion across all 28 categories, over-prediction of dominant classes, and systematic under-detection of nuanced emotions. These findings argue for a diagnostic, failure-oriented evaluation framework as a professional and ethical requirement for emotion recognition research