Optimizing N-linked Glycosylation Site Prediction in Human Proteins with Ensemble Stacking and Cross-Validation
- 1 Department of Computer Science and Applications, CMPICA, CHARUSAT, Charotar University of Science and Technology (CHARUSAT), CHARUSAT Campus, Changa, India
Abstract
The most frequent post-translational modification of proteins in all territories is glycosylation which impacts many biological activities. The most significant and critical of these modifications is N-linked glycosylation which is associated with various human diseases including diabetes cancer Inflammation Alzheimers and atherosclerosis. This article illustrates recent advances in knowledge of biology that are eventually targeting the computer science sector. Moreover-identification of N-linked glycosylation helps to understand the biological system of humans and the mechanism of glycosylation. Machine learning techniques became very important for the N-linked glycosylation prediction from human protein because the experimental process is time-consuming and costly. This article proposes an ensemble machine learning approach for N-linked glycosylation prediction integrating updated and experimentally verified databases (UniProtKB dbPTM and nGlycositeAtlas) with an optimal window size of 21. MMSeq2 clustering with a threshold of 0.3 was employed to eliminate duplicate and similar protein sequences for improved dataset preparation. A total of 9040 features were extracted using various descriptors including sequence structural and physicochemical features. ANOVA F-score CHI2 and Mutual Information were used as ensemble feature selection techniques the combination of all these results generated 182 desirable features for the final model training. The model was then trained using cross-validation methods and ensemble stacking using four base classifiers: SVM LR XGBoost and RF. The prediction result demonstrates that ensemble stacking techniques with cross-validation give a more reliable and promising result than the individual base classifiers. Moreover, ensemble Stacking with cross-validation performs better than the individual classifier with an Accuracy of 99.99% Precision of 99.98% Recall of 100% AUC of 99.94% MCC of 99.96%, and F-score 99.99%.
DOI: https://doi.org/10.3844/jcssp.2024.1753.1765
Copyright: © 2024 Mubina Malik and Jaimin Undavia. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 357 Views
- 208 Downloads
- 0 Citations
Download
Keywords
- Machine Learning
- Ensemble Stacking
- XGBoost
- Random Forest
- SVM Cross Validation
- Protein N-Linked Glycosylation