Predictive Modeling Case Study on Statistical Difficulties in Combining Genomic and Proteomic Data


Authors : Dr. P. Umamaheswari; N. Purusothaman; P. Gayathridevi

Volume/Issue : Volume 11 - 2026, Issue 1 - January


Google Scholar : https://tinyurl.com/292j5uk5

Scribd : https://tinyurl.com/3tzau29r

DOI : https://doi.org/10.38124/ijisrt/26jan183

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : A key challenge in modern biology is integrating different types of molecular data. This study examines the specific relationship between gene copy number and protein expression levels. Using data from the Cancer Cell Line Encyclopedia (CCLE), we find that this relationship varies significantly by gene. For the MYC oncogene, copy number strongly predicts protein levels (R2 = 0.37), indicating that more gene copies generally lead to more protein. However, for the TP53 tumor suppressor, copy number poorly predicts protein abundance (R2 = 0.08), suggesting that other regulatory mechanisms dominate. These results show that simple statistical models are often insufficient for biological data, and more advanced approaches are needed to understand complex gene-protein relationships.

Keywords : Data Integration, Genomics, Proteomics, Copy Number Variation, MYC, TP53, Predictive Modeling, Statistical Analysis.

References :

  1. Ghandi, M., et al. (2019). Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature, 569(7757), 503–508.
  2. Liu, Y., Beyer, A., & Aebersold, R. (2016). On the dependency of cellular protein levels on mRNA abundance. Cell, 165(3), 535-550.
  3. Maier, T., Güell, M., & Serrano, L. (2009). Correlation of mRNA and protein in complex biological samples. FEBS letters, 583(24), 3966-3973.
  4. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. Springer.
  5. Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press.

A key challenge in modern biology is integrating different types of molecular data. This study examines the specific relationship between gene copy number and protein expression levels. Using data from the Cancer Cell Line Encyclopedia (CCLE), we find that this relationship varies significantly by gene. For the MYC oncogene, copy number strongly predicts protein levels (R2 = 0.37), indicating that more gene copies generally lead to more protein. However, for the TP53 tumor suppressor, copy number poorly predicts protein abundance (R2 = 0.08), suggesting that other regulatory mechanisms dominate. These results show that simple statistical models are often insufficient for biological data, and more advanced approaches are needed to understand complex gene-protein relationships.

Keywords : Data Integration, Genomics, Proteomics, Copy Number Variation, MYC, TP53, Predictive Modeling, Statistical Analysis.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe