Assessing finetuning efficacy in llms a case study with learning guidance chatbots| International Journal of Innovative Science and Research Technology

Assessing Fine-Tuning Efficacy in LLMs: A Case Study with Learning Guidance Chatbots

Authors : Rabia Bayraktar; Batuhan Sarıtürk; Merve Elmas Erdem

Volume/Issue : Volume 9 - 2024, Issue 5 - May

Google Scholar : https://tinyurl.com/y75jtaxs

Scribd : https://tinyurl.com/2u33cdn7

DOI : https://doi.org/10.38124/ijisrt/IJISRT24MAY1600

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : Training and accurately evaluating task- specific chatbots is an important research area for Large Language Models (LLMs). These models can be developed for general purposes with the ability to handle multiple tasks, or fine-tuned for specific applications such as education or customer support. In this study, Mistral 7B, Llama-2 and Phi-2 models are utilized which have proven success on various benchmarks, including question answering. The models were fine-tuned using QLoRa with limited information gathered from course catalogs. The fine-tuned models were evaluated using various metrics, with the responses from GPT-4 taken as the ground truth. The experiments revealed that Phi-2 slightly outperformed Mistral 7B, achieving scores of 0.012 BLEU, 0.184 METEOR, and 0.873 BERT. Considering the evaluation metrics obtained, the strengths and weaknesses of known LLM models, the amount of data required for fine-tuning, and the effect of the fine-tuning method on model performance are discussed.

Keywords : LLM, Mistral, Llama, Phi, Fine-Tune, QLoRa.

References :

T.F. Tan, K. Elangovan, L. Jin, Y. Jie, L. Yong, J. Lim, S. Poh, W.Y. Ng, D. Lim, Y. Ke, N. Liu, D.S.W. Ting, "Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4," arXiv preprint arXiv:2402.10083, 2024.
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, ... and T. Scialom, "Llama 2: Open foundation and fine-tuned chat models," arXiv preprint arXiv:2307.09288, 2023.
Y. Li, S. Bubeck, R. Eldan, A. Del Giorno, S. Gunasekar, and Y.T. Lee, "Textbooks are all you need ii: phi-1.5 technical report," arXiv preprint arXiv:2309.05463, 2023.
A.Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D.S. Chaplot, D.D.L. Casas, ... and W.E. Sayed, "Mistral 7B," arXiv preprint arXiv:2310.06825, 2023.
Hugging Face – The AI community building the future. (n.d.). https://huggingface.co/
F. Khennouche, Y. Elmir,, Y. Himeur, N. Djebari, A. Amira, "Revolutionizing generative pre-traineds: Insights and challenges in deploying ChatGPT and generative chatbots for FAQs." Expert Systems with Applications, 246, 123224, 2024.
M. Jovanović, K. Kuk, V. Stojanović, and E. Mehić, "Chatbot Application as Support Tool for the Learning Process of Basic Concepts of Telecommunications and Wireless Networks," Facta Universitatis, Series: Automatic Control and Robotics, 22(2), 2024, pp. 057-073.
S. Balakrishnan, P. Jayanth, S. Parvathynathan, and R. Sivashankar, "Artificial intelligence-based vociferation chatbot for emergency health assistant," In AIP Conference Proceedings (Vol. 2742, No. 1). AIP Publishing, 2024.
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, ... and D. Kiela, "Retrieval-augmented generation for knowledge-intensive nlp tasks," Advances in Neural Information Processing Systems, 33, 2020, pp. 9459-9474.
Huawei Talent. (n.d.). https://e.huawei.com/en/talent/portal/#/
Z. Guo, R. Jin, C. Liu, Y. Huang, D. Shi, L. Yu, ... and D. Xiong, "Evaluating large language models: A comprehensive survey," arXiv preprint arXiv:2310.19736, 2023.
H. Naveed, A.U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, ... and A. Mian, "A comprehensive overview of large language models," arXiv preprint arXiv:2307.06435, 2023.
H.A. Alawwad, A. Alhothali, U. Naseem, A. Alkhathlan, and A. Jamal, "Enhancing Textbook Question Answering Task with Large Language Models and Retrieval Augmented Generation," arXiv preprint arXiv:2402.05128, 2024.
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, "QLoRa: Efficient finetuning of quantized llms," Advances in Neural Information Processing Systems, 36, 2024.
J.C. Chow, L. Sanders, and K. Li, "Design of an educational chatbot using artificial intelligence in radiotherapy," AI, 4(1), 2023, pp. 319-332.
N. Ghorashi, A. Ismail, P. Ghosh, A. Sidawy, R. Javan, and N.S. Ghorashi, "AI-powered chatbots in medical education: potential applications and implications," Cureus, 15(8), 2023.
J. Wang, J. Macina, N. Daheim, S.P. Chowdhury, and M. Sachan, "Book2Dial: Generating Teacher-Student Interactions from Textbooks for Cost-Effective Development of Educational Chatbots," arXiv preprint arXiv:2403.03307, 2024.
A. Bandi, and H. Kagitha, "A Case Study on the Generative AI Project Life Cycle Using Large Language Models," Proceedings of 39th International Confer, 98, 2024, pp. 189-199.
A. Chen, G. Stanovsky, S. Singh, and M. Gardner, “Evaluating question answering evaluation,” in Proceedings of the 2nd workshop on machine reading for question answering, 2019, pp. 119–124.
K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. “BLEU: a method for automatic evaluation of machine translation,” In Proc. 40th Annual Meeting of the Association for Computational Linguistics (Association of Computational Machinery), 2002, pp. 311–318.
L. Chin-Yew, “ROUGE: A package for automatic evaluation of summaries,” In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81.
S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005.
T. Zhang, V. Kishore, F. Wu, K.Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” In International Conference on Learning Representations, 2020.

Training and accurately evaluating task- specific chatbots is an important research area for Large Language Models (LLMs). These models can be developed for general purposes with the ability to handle multiple tasks, or fine-tuned for specific applications such as education or customer support. In this study, Mistral 7B, Llama-2 and Phi-2 models are utilized which have proven success on various benchmarks, including question answering. The models were fine-tuned using QLoRa with limited information gathered from course catalogs. The fine-tuned models were evaluated using various metrics, with the responses from GPT-4 taken as the ground truth. The experiments revealed that Phi-2 slightly outperformed Mistral 7B, achieving scores of 0.012 BLEU, 0.184 METEOR, and 0.873 BERT. Considering the evaluation metrics obtained, the strengths and weaknesses of known LLM models, the amount of data required for fine-tuning, and the effect of the fine-tuning method on model performance are discussed.

Keywords : LLM, Mistral, Llama, Phi, Fine-Tune, QLoRa.