A DOMAIN-SPECIFIC APPROACH FOR CROSS-LINGUAL EMOTION DETECTION THROUGH TEXT MINING

Faisal Shahzad; Israr Hanif

Authors

Faisal Shahzad Department of Computer Science, Bahauddin Zakariya University (BZU), Multan, Pakistan https://orcid.org/0009-0003-8413-2456
Israr Hanif Department of Computer Science, Bahauddin Zakariya University (BZU), Multan, Pakistan

Keywords:

Emotion recognition, low-resource languages, Urdu, cross-lingual transfer, transformer models, preprocessing

Abstract

Emotional recognition is the aspect of sentiment analysis that focuses on a more nuanced or meaningful understanding of the complex and diverse emotions found in text. Existing studies have mostly been limited to English support and general transformer models, while there is a significant lack of research in low-resource and morphologically complex languages like Urdu. But the Urdu language is different from English, and there is no Urdu emotion dataset with annotations, which makes cross-language emotion detection a challenging problem. To bridge this gap, the GoEmotions dataset was translated into Urdu and mapped into 7 basic emotional categories (anger, happiness, sadness, surprise, disgust, fear, and neutral), and the performance of six transformer models (BERT, DistilBERT, IndicBERT, RoBERTa, XLM-R, and RemBERT) on processed and unprocessed versions of English–Urdu was evaluated in four different configurations. The task was then defined as a multi-label emotion classification problem and tested in four configurations: English-to-English, Urdu-to-Urdu, English-to-Urdu, and Urdu-to-English. The results showed that in monolingual experiments, BERT achieved the highest accuracy (Acc=0.8115) on English data, while XLM-R gave the best F1 score (F1=0.4360) on Urdu data, and RoBERTa showed the highest accuracy (Acc=0.8161) on unprocessed Urdu text. In the cross-lingual context, XLM-R gave the best results (Acc=0.8219, F1=0.5177), and RemBERT was also close, which shows the multilingual generalization ability of these models. Moreover, preprocessing did not significantly improve on low-resource and morphologically rich texts like Urdu. A comparative analysis also revealed that while sentiment classification in Arabic reached 90% accuracy, Urdu-based experiments were limited to a maximum of 81.6%. The results of this study provide a reliable starting point for future cross-linguistic sentiment analysis on low-resource languages.

Author Biography

Faisal Shahzad, Department of Computer Science, Bahauddin Zakariya University (BZU), Multan, Pakistan

Motivated and detail-oriented Data Science graduate (MS/MPhil) with hands-on experience in Machine Learning, Deep Learning, Natural Language Processing, Computer Vision, Digital Image Processing, and Medical Image Processing. Adept at applying statistical and computational techniques to solve real-world problems. Passionate about domain-specific content classification and cross-lingual NLP. Strong academic foundation, freelance development experience, and a proven ability to learn and adapt. Led research and development of multilingual sentiment classification systems and medical imaging pipelines. Skilled in Python, TensorFlow, and Scikit-learn with hands-on data pipeline development in academic and freelance projects. Proficient in team collaboration, problem-solving, and delivering actionable insights. Open to challenging roles in data science leadership and enterprise-scale ML development.
Near completion of MPhil in Data Science, with hands-on experience in AI-based research, student mentorship, and technical instruction. Actively mentoring junior researchers on real-world AI projects involving image-based and tabular datasets and currently serving as a university instructor at Bahauddin Zakariya University, Multan, teaching Database Systems and Information Retrieval to undergraduate students in Data Analytics and Artificial Intelligence programs.