Authors :
Paras R. Parekh; Dr. Nilofer K. Shaikh
Volume/Issue :
Volume 11 - 2026, Issue 2 - February
Google Scholar :
https://tinyurl.com/bdz9w9jj
Scribd :
https://tinyurl.com/5b8rnw8e
DOI :
https://doi.org/10.38124/ijisrt/26feb656
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Breast cancer is a biologically heterogeneous malignancy and the leading cause of cancer-related mortality among
women worldwide. Its clinical complexity arises from distinct molecular subtypes—Luminal A, Luminal B, HER2-enriched,
and Basal-like—each exhibiting unique gene expression profiles and therapeutic responses. This study presents an
integrative bioinformatics and machine learning (ML) pipeline for subtype classification and biomarker discovery using
publicly available microarray datasets (GSE65194, GSE42568, GSE45827). The workflow incorporates R-based
preprocessing for probe-to-gene annotation, normalization, and differential expression analysis (DEA), followed by survival
modeling via Kaplan–Meier curves and protein–protein interaction (PPI) network construction using STRING. Annotated
gene features were used to train multiple ML models—Random Forest, XGBoost, Support Vector Machine (SVM), and
LASSO regression—implemented in Python. Model performance was evaluated using cross-validation and regression
metrics, achieving high predictive accuracy (R² > 0.90) across subtypes. The pipeline identified clinically relevant
biomarkers, including COL10A1, EGFR, FN1, COL1A1, BGN, ERBB2, COL5A1, COL5A2, and COL11A1— consistent
with known subtype characteristics and survival outcomes. Its modular design ensures reproducibility, scalability, and
adaptability to other cancer types or omics platforms. By integrating statistical rigor with ML interpretability, this study
provides a biologically informed framework for precision oncology, enhancing diagnostic accuracy, patient stratification,
and targeted therapy selection in breast cancer management
Keywords :
Breast Cancer Subtypes; Differential Gene Expression; Machine Learning; Biomarker Discovery; Survival Analysis; DBSCAN Clustering; Precision Oncology
References :
- MDPI Diagnostics, “Multimodal deep learning model for breast cancer subtype classification integrating imaging and clinical metadata,” Diagnostics, vol. 13, no. 5, 2023.
- T. Gill, “Bioinformatics_GSE65194_Breast_Cancer_Resistance/optimus.R,” GitHub repository, commit
1e6290dd248b07bcf0a23635b3b4afccfd623eb1.
- T. Gill, “optimus.R script,” GitHub, [Online]. Available: https://github.com/tahagill/Bioinformatics_GSE65194_Breast_Cancer_Resistance/blob/1e6290dd248b07bcf0a23635b3b4afccfd623eb1/optimus.R.
- Drippypale, “microarray-aml,” GitHub repository, [Online]. Available: https://github.com/drippypale/microarray-aml.
- Futureomics, “Machine learning in drug discovery,” GitHub repository, [Online]. Available:
https://github.com/futureomics/Machine-learning-in-drug-discovery_ (github.com in Bing).
- “Analysis of the microarray gene expression for breast cancer progression after the application of modified logistic regression,” ScienceDirect, [Online].
- Y. Wang, et al., “CancerSubtypes: an R/Bioconductor package for molecular cancer subtype identification, validation and visualization,” Bioinformatics, PubMed, 2016.
- pgbio99, “BRCA-Subtype-Classification_ML: Integrative ML framework for breast cancer subtype classification using GEO microarray data,” GitHub repository.
- Frontiers in Physiology, “Review of ML and deep learning techniques for cancer classification using microarray gene expression,” Front. Physiol., vol. 13, 2022.
- MDPI Medicine, “Article,” Medicina, vol. 59, no. 10, p. 1705, 2023. [Online]. Available: https://www.mdpi.com/1648-9144/59/10/1705.
- Dr. Nilofer, “FUTUREOMICS,” GitHub repository.
- National Center for Biotechnology Information, “Gene Expression Omnibus (GEO),” NCBI, [Online]. Available:
https://www.ncbi.nlm.nih.gov/geo/ (ncbi.nlm.nih.gov in Bing).
Breast cancer is a biologically heterogeneous malignancy and the leading cause of cancer-related mortality among
women worldwide. Its clinical complexity arises from distinct molecular subtypes—Luminal A, Luminal B, HER2-enriched,
and Basal-like—each exhibiting unique gene expression profiles and therapeutic responses. This study presents an
integrative bioinformatics and machine learning (ML) pipeline for subtype classification and biomarker discovery using
publicly available microarray datasets (GSE65194, GSE42568, GSE45827). The workflow incorporates R-based
preprocessing for probe-to-gene annotation, normalization, and differential expression analysis (DEA), followed by survival
modeling via Kaplan–Meier curves and protein–protein interaction (PPI) network construction using STRING. Annotated
gene features were used to train multiple ML models—Random Forest, XGBoost, Support Vector Machine (SVM), and
LASSO regression—implemented in Python. Model performance was evaluated using cross-validation and regression
metrics, achieving high predictive accuracy (R² > 0.90) across subtypes. The pipeline identified clinically relevant
biomarkers, including COL10A1, EGFR, FN1, COL1A1, BGN, ERBB2, COL5A1, COL5A2, and COL11A1— consistent
with known subtype characteristics and survival outcomes. Its modular design ensures reproducibility, scalability, and
adaptability to other cancer types or omics platforms. By integrating statistical rigor with ML interpretability, this study
provides a biologically informed framework for precision oncology, enhancing diagnostic accuracy, patient stratification,
and targeted therapy selection in breast cancer management
Keywords :
Breast Cancer Subtypes; Differential Gene Expression; Machine Learning; Biomarker Discovery; Survival Analysis; DBSCAN Clustering; Precision Oncology