HYBRID STATISTICAL LEARNING FOR CARDIOVASCULAR DISEASE CLASSIFICATION USING LIGHTGBM WITH FEATURE OPTIMIZATION

Authors

  • Hina Zafar
  • Saba Akram
  • Muhammad Hamza Kashif
  • Syed Muhammad Junaid Hassan

Keywords:

LightGBM, cardiovascular disease classification, feature optimization, SHAP, blood pressure, hybrid statistical learning, PKCVD-633, class imbalance, Pakistani dataset

Abstract

Cardiovascular disease (CVD) remains the foremost cause of mortality worldwide, responsible for approximately 17.9 million deaths annually. While machine learning (ML) has demonstrated strong potential for early risk stratification, most studies rely on conventional classifiers such as Logistic Regression, Random Forest, or standard XGBoost, without systematically addressing feature redundancy, dataset heterogeneity, or class imbalance. This paper proposes a Hybrid Statistical Learning framework that combines rigorous statistical preprocessing, missing-data imputation, and LightGBM — a gradient-boosted decision tree algorithm optimized for speed and accuracy — enhanced by a multi-stage feature optimization pipeline. The study uses the PKCVD-633 dataset, a custom-merged Pakistani cardiovascular dataset of 633 records (effective analytical cohort: N = 333) encompassing 21 clinical, echocardiographic, and lifestyle features. Critical analysis of the dataset revealed two structurally distinct sub-cohorts merged into a single file, requiring targeted imputation strategies. Feature optimization via correlation filtering, SHAP-based importance ranking, and recursive feature elimination (RFE) identified a compact subset of 12 features that retained maximal predictive signal. The proposed LightGBM model with optimized features achieved superior classification performance relative to conventional baselines. The study contributes a reproducible pipeline for heterogeneous cardiovascular datasets, a formally named and documented dataset, and evidence that feature optimization substantially improves LightGBM performance on imbalanced clinical data

Downloads

Published

2026-04-24

How to Cite

Hina Zafar, Saba Akram, Muhammad Hamza Kashif, & Syed Muhammad Junaid Hassan. (2026). HYBRID STATISTICAL LEARNING FOR CARDIOVASCULAR DISEASE CLASSIFICATION USING LIGHTGBM WITH FEATURE OPTIMIZATION. Spectrum of Engineering Sciences, 4(4), 1038–1047. Retrieved from https://www.thesesjournal.com/index.php/1/article/view/2536