VIT2D: A MULTIMODAL VISION TRANSFORMER FRAMEWORK FOR NON-INVASIVE PREDICTION OF ARTERIAL HEART DISEASE

Umm-e-Farwa; Abdul Rauf; Usman Ahmed; Rana Hassam Ahmed; Majid Hussain

Authors

Umm-e-Farwa
Abdul Rauf
Usman Ahmed
Rana Hassam Ahmed
Majid Hussain

Keywords:

Arterial Heart Disease, Coronary Artery Disease, Vision Transformer, Deep Learning, Multimodal Imaging, MRI, CTA, Predictive Analytics, Medical Image Segmentation, Plaque Detection, Transfer Learning, Automated Diagnosis

Abstract

Arterial heart disease (AHD) is one of the largest causes that cause human death and illness throughout the globe. It not only explodes healthcare expense’s it also completely destroys the good morale that people have about their everyday lives. We are discussing significant quantities of negative change in living quality and AHD contributes a large portion to such an outcome. Still, the most commonly used solutions are traditional diagnostics ECG, angiography, and CT angiography (CTA). But the catch? They are quite offending, sometimes very expensive, and not always readily available, where resources are limited. In other words, the entire notion of developing proper, replicable, and non-invasive diagnostic technologies has taken the center stage in the event that we intend to effect some substantive changes in patient care. And in recent years the number of articles and books about AI and deep learning have been deafeningly loud, and useful in that AI and deep learning can extract useful features and possible forecasts out of all that tons of confusing data flows. I watched a recent research that established that running deep learning with CNNs and Vision Transformers on heart-sound spectrograms and chest X-rays showed deep learning was actually able to identify congenital heart disease (CHD), achieving an average of 73.9% and 80.7% accuracy respectively (Amangeldi et al. 2025). However, the experiment was failed as there are poor-quality images and baseline transformer models, think ViT -Tiny, did not offer good results, therefore, it is not available to be applied in real-life clinical practices. Due to the stated problems, in this paper, a novel and improved version of a Vision Transformer to conduct multimodal imaging on predicting arterial heart disease more reliably is proposed i.e. ViT2D.ViT2D framework is a combination of magnetic resonance imaging (MRI), CT scans, and large-scale cardiovascular clinical data (CADICA, ARCADE, and Kaggle) to learn both structural and functional information. The framework makes use of patch embeddings and multi-head self-attention layers to capture global dependencies and a multimodal fusion layer, which incorporates imaging and clinical risk factors. The training was performed on Google TPU v4 hardware with cross-validation of 5-folds, loss functions balanced, and a large amount of preprocessing, which included normalization, augmentation, and imputation. The results of the experimental work show that ViT2D at its current state-of-the-art performs with high accuracy (94.5% CADICA - 99.0% Kaggle) and has a high ROC-AUC (0.97 - 0.995). The results are significantly higher than the traditional CNNs (ResNet18, InceptionV3) and base transformer models ( ViT-Tiny, Sinw-Tiny), thus demonstrating the usefulness of the proposed structure. The robustness of the prediction is achieved with confusion and ROC matrices, whereas attention maps make the prediction interpretation more understandable by identifying lesions in both the MRI and CTA scans. So essentially, the conclusion is that viT2D could be a relatively realistic and scalable method for preventing heart disease without invasive procedures.