Authors :
Trinh Quang Minh; Ngo Thi Lan
Volume/Issue :
Volume 11 - 2026, Issue 4 - April
Google Scholar :
https://tinyurl.com/23mx44ea
Scribd :
https://tinyurl.com/mvpne3c3
DOI :
https://doi.org/10.38124/ijisrt/26apr247
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
This study presents an advanced approach in predicting protein function (Gene Ontology - GO) by combining
traditional machine learning techniques with the Gemma 4 large language model. We use K-mer and TF-IDF feature
extraction methods to encode amino acid sequences, then apply a multi-label classification model. The highlight of the
research is the integration of Gemma 4 (E2B-it version) with the "Thinking Mode" mechanism to not only predict but also
explain biological results in bilingual (English-Vietnamese). Experimental results on the CAFA 6 dataset show that the model
not only achieves high accuracy but also provides in-depth interpretation, helping biologists clearly understand the
reasoning behind predictions based on confidence scores. Interpretability: This is the brightest point of the research. Instead
of just providing dry prediction results, the model provides inference steps (Thinking Mode) to explain why a Protein has a
specific function based on confidence scores. 3 main pillars: Biological basis (Protein/CAFA), Large language models
(Gemma/LLMs) and AI explainability (Interpretability).
Keywords :
Protein Function Prediction, CAFA 6, Gemma 4, Gene Ontology, Machine Learning, Interpretability, K-mer.
References :
- Gemma Team, G. D. (2024, 2 21). Gemma: Open Models Based on Gemini. Retrieved from Arxiv.org - arXiv is an open-access repository of electronic preprints and postprints (known as e-prints): https://arxiv.org/pdf/2403.08295
- LLC, G. (2025). The official portal for users to experience Gemini – the AI generative assistant developed by Google. Retrieved from Gemini: https://gemini.google.com
- Minh, T. Q. (2026). CAFA 6 Protein Function Prediction. Retrieved from Kaggle.com: https://www.kaggle.com/competitions/cafa-6-protein-function-prediction
- Minh, T. Q. (2026, 4 04). Gemma 4 Good Hackathon - Trinh Quang Minh. Retrieved from Kaggle.com. Kaggle.com is a popular online platform for data science and machine learning: https://www.kaggle.com/code/trnhquangminh140/gemma-4-good-hackathon-trinh-quang-minh
- P.V. (2025, 1 14). Large language model trained by Vietnamese breaks through on VMLU rankings. (Vietnam Television) Retrieved 11 14, 2025, from https://vtv.vn/cong-nghe/mo-hinh-ngon-ngu-lon-do-nguoi-viet-huan-luyen-but-pha-tren-bang-xep-hang-vmlu-20250114084757555.htm
- Repository, U. M. (2). The machine learning community for the empirical analysis of machine learning algorithms. Retrieved 3 2, 2025, from https://archive.ics.uci.edu/dataset/45/heart+disease
- Wenlong Ji, W. Y. (2025, 02 25). An Overview of Large Language Models for Statisticians. (arXiv staff at Cornell University) Retrieved 11 14, 2025, from https://arxiv.org/html/2502.17814
This study presents an advanced approach in predicting protein function (Gene Ontology - GO) by combining
traditional machine learning techniques with the Gemma 4 large language model. We use K-mer and TF-IDF feature
extraction methods to encode amino acid sequences, then apply a multi-label classification model. The highlight of the
research is the integration of Gemma 4 (E2B-it version) with the "Thinking Mode" mechanism to not only predict but also
explain biological results in bilingual (English-Vietnamese). Experimental results on the CAFA 6 dataset show that the model
not only achieves high accuracy but also provides in-depth interpretation, helping biologists clearly understand the
reasoning behind predictions based on confidence scores. Interpretability: This is the brightest point of the research. Instead
of just providing dry prediction results, the model provides inference steps (Thinking Mode) to explain why a Protein has a
specific function based on confidence scores. 3 main pillars: Biological basis (Protein/CAFA), Large language models
(Gemma/LLMs) and AI explainability (Interpretability).
Keywords :
Protein Function Prediction, CAFA 6, Gemma 4, Gene Ontology, Machine Learning, Interpretability, K-mer.