Authors : Aman Goyal; Santushti Gandhi; Garima Arora
Volume/Issue : Volume 7 - 2022, Issue 12 - December
Google Scholar : https://bit.ly/3IIfn9N
Scribd : https://bit.ly/3v9V2FL
DOI : https://doi.org/10.5281/zenodo.7460453
To use classification machine learning
techniques to differentiate between early and late-stage
colorectal cancer based on individuals' mRNA profiles
with clinically recorded data.
Methods: A gene signature of 14 unique mRNAs was
found using a benchmark dataset extracted from The
Cancer Genome Atlas (TCGA) [1]. The data of the genes
were normalized using statistical methods, and the stages
of the cancer were divided into the early and late stages.
The best genes were found using hypothesis testing. The
gene set was then tested on a new dataset gathered using
the Gene Expression Omnibus (GEO) [2] using the
GSE32323 accession number. A null hypothesis was also
considered for the study.
Results: A training accuracy of 75% was achieved for the
gene expression with a ROC-AUC score of 0.81 for the
TCGA dataset using an ensemble technique. On this gene
expression, a testing accuracy of 74% and a ROC-AUC
score of 0.72 was achieved for the GEO dataset. The null
hypothesis was also proven wrong in favour of the
alternative hypothesis.
Conclusion: The study successfully proved the hypothesis
and presented a set of 14 unique mRNAs that help predict
the stage of Colorectal cancer in an individual.
Keywords : Biomarker, Ensemble Technique, Hypothesis Testing, mRNA, Machine Learning.