⚠ Official Notice: www.ijisrt.com is the official website of the International Journal of Innovative Science and Research Technology (IJISRT) Journal for research paper submission and publication. Please beware of fake or duplicate websites using the IJISRT name.



Leveraging Gemma 4 Large Language Model for Protein Function Prediction and Interpretability Application of AI Models for Protein Function Prediction from Amino Acid Sequences


Authors : Trinh Quang Minh; Ngo Thi Lan

Volume/Issue : Volume 11 - 2026, Issue 4 - April


Google Scholar : https://tinyurl.com/23mx44ea

Scribd : https://tinyurl.com/mvpne3c3

DOI : https://doi.org/10.38124/ijisrt/26apr247

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : This study presents an advanced approach in predicting protein function (Gene Ontology - GO) by combining traditional machine learning techniques with the Gemma 4 large language model. We use K-mer and TF-IDF feature extraction methods to encode amino acid sequences, then apply a multi-label classification model. The highlight of the research is the integration of Gemma 4 (E2B-it version) with the "Thinking Mode" mechanism to not only predict but also explain biological results in bilingual (English-Vietnamese). Experimental results on the CAFA 6 dataset show that the model not only achieves high accuracy but also provides in-depth interpretation, helping biologists clearly understand the reasoning behind predictions based on confidence scores. Interpretability: This is the brightest point of the research. Instead of just providing dry prediction results, the model provides inference steps (Thinking Mode) to explain why a Protein has a specific function based on confidence scores. 3 main pillars: Biological basis (Protein/CAFA), Large language models (Gemma/LLMs) and AI explainability (Interpretability).

Keywords : Protein Function Prediction, CAFA 6, Gemma 4, Gene Ontology, Machine Learning, Interpretability, K-mer.

References :

  1. Gemma Team, G. D. (2024, 2 21). Gemma: Open Models Based on Gemini. Retrieved from Arxiv.org - arXiv is an open-access repository of electronic preprints and postprints (known as e-prints): https://arxiv.org/pdf/2403.08295
  2. LLC, G. (2025). The official portal for users to experience Gemini – the AI generative assistant developed by Google. Retrieved from Gemini: https://gemini.google.com
  3. Minh, T. Q. (2026). CAFA 6 Protein Function Prediction. Retrieved from Kaggle.com: https://www.kaggle.com/competitions/cafa-6-protein-function-prediction
  4. Minh, T. Q. (2026, 4 04). Gemma 4 Good Hackathon - Trinh Quang Minh. Retrieved from Kaggle.com. Kaggle.com is a popular online platform for data science and machine learning: https://www.kaggle.com/code/trnhquangminh140/gemma-4-good-hackathon-trinh-quang-minh
  5. P.V. (2025, 1 14). Large language model trained by Vietnamese breaks through on VMLU rankings. (Vietnam Television) Retrieved 11 14, 2025, from https://vtv.vn/cong-nghe/mo-hinh-ngon-ngu-lon-do-nguoi-viet-huan-luyen-but-pha-tren-bang-xep-hang-vmlu-20250114084757555.htm
  6. Repository, U. M. (2). The machine learning community for the empirical analysis of machine learning algorithms. Retrieved 3 2, 2025, from https://archive.ics.uci.edu/dataset/45/heart+disease
  7. Wenlong Ji, W. Y. (2025, 02 25). An Overview of Large Language Models for Statisticians. (arXiv staff at Cornell University) Retrieved 11 14, 2025, from https://arxiv.org/html/2502.17814

This study presents an advanced approach in predicting protein function (Gene Ontology - GO) by combining traditional machine learning techniques with the Gemma 4 large language model. We use K-mer and TF-IDF feature extraction methods to encode amino acid sequences, then apply a multi-label classification model. The highlight of the research is the integration of Gemma 4 (E2B-it version) with the "Thinking Mode" mechanism to not only predict but also explain biological results in bilingual (English-Vietnamese). Experimental results on the CAFA 6 dataset show that the model not only achieves high accuracy but also provides in-depth interpretation, helping biologists clearly understand the reasoning behind predictions based on confidence scores. Interpretability: This is the brightest point of the research. Instead of just providing dry prediction results, the model provides inference steps (Thinking Mode) to explain why a Protein has a specific function based on confidence scores. 3 main pillars: Biological basis (Protein/CAFA), Large language models (Gemma/LLMs) and AI explainability (Interpretability).

Keywords : Protein Function Prediction, CAFA 6, Gemma 4, Gene Ontology, Machine Learning, Interpretability, K-mer.

Paper Submission Last Date
30 - April - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS
Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe