BRIDGING DATA SCARCITY IN SINDHI NER USING MACHINE-LABELED CORPORA AND MULTILINGUAL TRANSFORMERS
Abstract
In low-resource languages, Named Entity Recognition (NER) possess many challenges and problems due to the shortage of high-quality annotated corpora and the unequal distribution of entity categories. In this study we inspect the effectiveness incorporating Machine-labeled data in Sindhi NER using two multilingual transformer models Multilingual BERT and XLM RoBERTa under two training settings, (i) direct fine-tuning on gold standard human-annotated data and (ii) machine-labeled pre-training followed by fine-tuning. The performances of models are assessed using entity-level precision, recall, and F1-score, together with learning-curve analysis and confusion matrices. The results indicates that machine-labeled pre-training improves recognition, particularly for mid-frequency and low-frequency entity categories. The XLM-RoBERTa outperform Multilingual BERT in both aggregated and entity-specific evaluations, and the pre-training increases the micro-F1 score for mBERT from 0.50 to 0.63 and for XLM-RoBERTa 0.72 to 0.79, compared to without the pre-training step. These findings indicate that large-scale weak supervision can mitigate data scarcity and improve contextual representation learning for Sindhi and other low-resource languages. The study provides a strong empirical baseline for Sindhi NER and highlights the practical value of machine-labeled pre-training for low-resource language processing.













