AMYLOPRED-DL: A HYBRID CNN-BILSTM-ATTENTION DEEP LEARNING FRAMEWORK INTEGRATING ESM-2 PROTEIN LANGUAGE MODEL EMBEDDINGS FOR IMPROVED PREDICTION OF AMYLOID PROTEINS
Abstract
Amyloid proteins (AMYs) are a unique class of intrinsically disordered proteins that exhibit both beneficial and harmful biological functions. While they are associated with severe neurodegenerative disorders such as Alzheimer's disease, Parkinson's disease, Huntington's disease, and type II diabetes, they also play important roles in hormone storage, antimicrobial defense, and immune regulation. This dual functionality creates a significant need for reliable computational methods capable of accurately identifying amyloidogenic proteins from sequence data. To address this challenge, we propose AmyloPred-DL, a hybrid deep learning framework that integrates complementary protein sequence representations. The model consists of branches multi-scale Convolutional Neural Networks (CNNs) with kernel sizes of 3, 5, and 7 to capture local amyloidogenic motifs; ESM-2 protein language model embeddings processed through stacked BiLSTM and multi-head attention layers to learn long-range sequence dependencies; and handcrafted evolutionary and physicochemical features derived from PSI-BLAST PSSM profiles and physicochemical descriptors. To mitigate class imbalance, SMOTE-Tomek resampling and focal loss were employed. The framework was trained on 571 non-redundant protein sequences and evaluated using independent validation, test, and cross-species datasets. AmyloPred-DL achieved an accuracy of 96.42%, sensitivity of 94.87%, specificity of 97.18%, F1-score of 0.959, MCC of 0.92, and AUC of 0.987 on the independent test set, outperforming existing approaches. Ablation studies demonstrated the significant contribution of ESM-2 embeddings, while cross-species evaluations confirmed strong generalization capability. Furthermore, SHAP-based interpretation revealed biologically relevant amyloidogenic motifs, indicating that the model learns meaningful sequence patterns. These results establish AmyloPred-DL as an effective and interpretable tool for amyloid protein prediction.
Keywords:
Amyloid Protein Prediction, Deep Learning, ESM-2 Protein Language Model, CNN-BiLSTM-Attention Network. Protein Sequence Analysis, Bioinformatics and Computational Biology.












