A DOMAIN-SPECIFIC APPROACH FOR CROSS-LINGUAL EMOTION DETECTION THROUGH TEXT MINING
Keywords:
Emotion recognition, low-resource languages, Urdu, cross-lingual transfer, transformer models, preprocessingAbstract
Emotional recognition is the aspect of sentiment analysis that focuses on a more nuanced or meaningful understanding of the complex and diverse emotions found in text. Existing studies have mostly been limited to English support and general transformer models, while there is a significant lack of research in low-resource and morphologically complex languages like Urdu. But the Urdu language is different from English, and there is no Urdu emotion dataset with annotations, which makes cross-language emotion detection a challenging problem. To bridge this gap, the GoEmotions dataset was translated into Urdu and mapped into 7 basic emotional categories (anger, happiness, sadness, surprise, disgust, fear, and neutral), and the performance of six transformer models (BERT, DistilBERT, IndicBERT, RoBERTa, XLM-R, and RemBERT) on processed and unprocessed versions of English–Urdu was evaluated in four different configurations. The task was then defined as a multi-label emotion classification problem and tested in four configurations: English-to-English, Urdu-to-Urdu, English-to-Urdu, and Urdu-to-English. The results showed that in monolingual experiments, BERT achieved the highest accuracy (Acc=0.8115) on English data, while XLM-R gave the best F1 score (F1=0.4360) on Urdu data, and RoBERTa showed the highest accuracy (Acc=0.8161) on unprocessed Urdu text. In the cross-lingual context, XLM-R gave the best results (Acc=0.8219, F1=0.5177), and RemBERT was also close, which shows the multilingual generalization ability of these models. Moreover, preprocessing did not significantly improve on low-resource and morphologically rich texts like Urdu. A comparative analysis also revealed that while sentiment classification in Arabic reached 90% accuracy, Urdu-based experiments were limited to a maximum of 81.6%. The results of this study provide a reliable starting point for future cross-linguistic sentiment analysis on low-resource languages.












