HYBRID LEXICAL-SEMANTIC RETRIEVAL FOR IMPROVED ACADEMIC LITERATURE SEARCH
Abstract
The rapid growth of scientific publications has made accurate and comprehensive literature search a critical challenge for researchers. Traditional keyword-based search engines often miss relevant papers that use different terminology, while semantic embedding-based retrieval can overlook exact matches for domain-specific terms. To address this limitation, this paper proposes a hybrid retrieval approach that combines lexical BM25 matching with dense semantic embeddings using a weighted fusion score. The hybrid method aims to improve both recall and ranking quality in academic document search. Experiments are conducted on a curated dataset of 100 computer science papers from the arXiv repository. Retrieval performance is evaluated using Recall@5, Recall@10, and nDCG@10. Baseline comparisons include BM25-only and dense-only retrieval. Experimental results show that the hybrid approach achieves a Recall@10 of 0.85, outperforming BM25-only (0.72) and dense-only (0.74) baselines. The hybrid method also achieves the highest nDCG@10 score of 0.83, indicating better ranking quality. These findings demonstrate that combining lexical and semantic signals significantly improves literature search effectiveness without requiring complex multi-agent systems or citation verification. The proposed hybrid retrieval is lightweight, easy to implement, and suitable for integration into academic search engines and digital libraries.













