BRIDGING DATA SCARCITY IN SINDHI NER USING MACHINE-LABELED CORPORA AND MULTILINGUAL TRANSFORMERS

Authors

  • Nazish Basir
  • Muhammad Suleman Memon
  • Mumtaz Qabulio
  • Danish Nazir Arain
  • Rafique Ahmed Vighio
  • Dr. AHS Bukhari

Abstract

In low-resource languages, Named Entity Recognition (NER) possess many challenges and problems due to the shortage of high-quality annotated corpora and the unequal distribution of entity categories. In this study we inspect the effectiveness incorporating Machine-labeled data in Sindhi NER using two multilingual transformer models Multilingual BERT and XLM RoBERTa under two training settings, (i) direct fine-tuning on gold standard human-annotated data and (ii) machine-labeled pre-training followed by fine-tuning. The  performances of models are assessed using entity-level precision, recall, and F1-score, together with learning-curve analysis and confusion matrices. The results indicates that machine-labeled pre-training improves recognition, particularly for mid-frequency and low-frequency entity categories. The XLM-RoBERTa outperform Multilingual BERT in both aggregated and entity-specific evaluations, and the pre-training increases the micro-F1 score for mBERT from 0.50 to 0.63 and for XLM-RoBERTa 0.72 to 0.79, compared to without the pre-training step. These findings indicate that large-scale weak supervision can mitigate data scarcity and improve contextual representation learning for Sindhi and other low-resource languages. The study provides a strong empirical baseline for Sindhi NER and highlights the practical value of machine-labeled pre-training for low-resource language processing.   

Published

2026-03-10

How to Cite

Nazish Basir, Muhammad Suleman Memon, Mumtaz Qabulio, Danish Nazir Arain, Rafique Ahmed Vighio, & Dr. AHS Bukhari. (2026). BRIDGING DATA SCARCITY IN SINDHI NER USING MACHINE-LABELED CORPORA AND MULTILINGUAL TRANSFORMERS . Spectrum of Engineering Sciences, 4(3), 179–194. Retrieved from https://www.thesesjournal.com/index.php/1/article/view/2168