Optimized Ensemble Stacking Framework for NLinked Glycosylation Site Prediction in Human Proteins Using Cross Validation Weighted Learning and Feature Augmentation
Loading...
Date
item.page.authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The N-linked glycosylation of proteins is a common and functionally important
newlinepost-translational modification in human proteins, determining folding, stability,
newlinetrafficking, and cellular signaling. Identifying exactly where, on the protein chain,
newlinethis modification occurs-usually at asparagine (N) residues in a limited
newlinesequence motif is crucial to understanding protein function and also for the
newlinedesign of drugs and the interpretation of high-throughput proteomics data.
newlineExperimental identification of sites of glycosylation produces reliable results but
newlineis expensive and time-consuming. Computed predictions present a rapid, costeffective
newlinecomplement, yet existing predictive methods suffer from such issues
newlineas class imbalance, with far fewer glycosylated sites than non-glycosylated
newlineones, under-exploitation of heterogeneous features, and sensitivity to noise.
newlineThis thesis presents an Optimized Ensemble Stacking Framework for the
newlinePrediction of N-Linked Glycosylation Sites in Human Proteins that overcome
newlinethese limitations by combining robust model design with careful validation,
newlineweighted learning, and rich feature augmentation for high-performance,
newlinegeneralizable predictions. To obtain optimized result using machine learning
newlineclassifiers various dataset explored and three datasets namely UniprotKB,
newlinedbPTM and nGlycositeAtlas was selected and combined to get final dataset of
newlinethe model.
newlineThe different layers of the proposed framework are stacked ensembles, where
newlinemultiple diverse base learners extract complementary patterns from protein
newlinesequence and derived features, and a powerful meta-learner synthesizes their
newlineoutputs into a final prediction. The base learners include algorithms with
newlinedifferent inductive biases for instance, support vector machines (SVM), Logistic
newlineRegression(LR), Random Forest (RF) and Extreme Gradient Boosting
newline(XGBoost) chosen to capture both nonlinear interactions and linear trends. The
newlinemeta-learner is implemented using XGBoost because it can handle
newlineii
newlineheterogeneous inputs, has resistance to overfitting through