Authors :
Rabia Bayraktar; Batuhan Sarıtürk; Merve Elmas Erdem
Volume/Issue :
Volume 9 - 2024, Issue 5 - May
Google Scholar :
https://tinyurl.com/y75jtaxs
Scribd :
https://tinyurl.com/2u33cdn7
DOI :
https://doi.org/10.38124/ijisrt/IJISRT24MAY1600
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Training and accurately evaluating task-
specific chatbots is an important research area for Large
Language Models (LLMs). These models can be
developed for general purposes with the ability to handle
multiple tasks, or fine-tuned for specific applications
such as education or customer support. In this study,
Mistral 7B, Llama-2 and Phi-2 models are utilized which
have proven success on various benchmarks, including
question answering. The models were fine-tuned using
QLoRa with limited information gathered from course
catalogs. The fine-tuned models were evaluated using
various metrics, with the responses from GPT-4 taken as
the ground truth. The experiments revealed that Phi-2
slightly outperformed Mistral 7B, achieving scores of
0.012 BLEU, 0.184 METEOR, and 0.873 BERT.
Considering the evaluation metrics obtained, the
strengths and weaknesses of known LLM models, the
amount of data required for fine-tuning, and the effect of
the fine-tuning method on model performance are
discussed.
Keywords :
LLM, Mistral, Llama, Phi, Fine-Tune, QLoRa.
References :
- T.F. Tan, K. Elangovan, L. Jin, Y. Jie, L. Yong, J. Lim, S. Poh, W.Y. Ng, D. Lim, Y. Ke, N. Liu, D.S.W. Ting, "Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4," arXiv preprint arXiv:2402.10083, 2024.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, ... and T. Scialom, "Llama 2: Open foundation and fine-tuned chat models," arXiv preprint arXiv:2307.09288, 2023.
- Y. Li, S. Bubeck, R. Eldan, A. Del Giorno, S. Gunasekar, and Y.T. Lee, "Textbooks are all you need ii: phi-1.5 technical report," arXiv preprint arXiv:2309.05463, 2023.
- A.Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D.S. Chaplot, D.D.L. Casas, ... and W.E. Sayed, "Mistral 7B," arXiv preprint arXiv:2310.06825, 2023.
- Hugging Face – The AI community building the future. (n.d.). https://huggingface.co/
- F. Khennouche, Y. Elmir,, Y. Himeur, N. Djebari, A. Amira, "Revolutionizing generative pre-traineds: Insights and challenges in deploying ChatGPT and generative chatbots for FAQs." Expert Systems with Applications, 246, 123224, 2024.
- M. Jovanović, K. Kuk, V. Stojanović, and E. Mehić, "Chatbot Application as Support Tool for the Learning Process of Basic Concepts of Telecommunications and Wireless Networks," Facta Universitatis, Series: Automatic Control and Robotics, 22(2), 2024, pp. 057-073.
- S. Balakrishnan, P. Jayanth, S. Parvathynathan, and R. Sivashankar, "Artificial intelligence-based vociferation chatbot for emergency health assistant," In AIP Conference Proceedings (Vol. 2742, No. 1). AIP Publishing, 2024.
- P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, ... and D. Kiela, "Retrieval-augmented generation for knowledge-intensive nlp tasks," Advances in Neural Information Processing Systems, 33, 2020, pp. 9459-9474.
- Huawei Talent. (n.d.). https://e.huawei.com/en/talent/portal/#/
- Z. Guo, R. Jin, C. Liu, Y. Huang, D. Shi, L. Yu, ... and D. Xiong, "Evaluating large language models: A comprehensive survey," arXiv preprint arXiv:2310.19736, 2023.
- H. Naveed, A.U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, ... and A. Mian, "A comprehensive overview of large language models," arXiv preprint arXiv:2307.06435, 2023.
- H.A. Alawwad, A. Alhothali, U. Naseem, A. Alkhathlan, and A. Jamal, "Enhancing Textbook Question Answering Task with Large Language Models and Retrieval Augmented Generation," arXiv preprint arXiv:2402.05128, 2024.
- T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, "QLoRa: Efficient finetuning of quantized llms," Advances in Neural Information Processing Systems, 36, 2024.
- J.C. Chow, L. Sanders, and K. Li, "Design of an educational chatbot using artificial intelligence in radiotherapy," AI, 4(1), 2023, pp. 319-332.
- N. Ghorashi, A. Ismail, P. Ghosh, A. Sidawy, R. Javan, and N.S. Ghorashi, "AI-powered chatbots in medical education: potential applications and implications," Cureus, 15(8), 2023.
- J. Wang, J. Macina, N. Daheim, S.P. Chowdhury, and M. Sachan, "Book2Dial: Generating Teacher-Student Interactions from Textbooks for Cost-Effective Development of Educational Chatbots," arXiv preprint arXiv:2403.03307, 2024.
- A. Bandi, and H. Kagitha, "A Case Study on the Generative AI Project Life Cycle Using Large Language Models," Proceedings of 39th International Confer, 98, 2024, pp. 189-199.
- A. Chen, G. Stanovsky, S. Singh, and M. Gardner, “Evaluating question answering evaluation,” in Proceedings of the 2nd workshop on machine reading for question answering, 2019, pp. 119–124.
- K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. “BLEU: a method for automatic evaluation of machine translation,” In Proc. 40th Annual Meeting of the Association for Computational Linguistics (Association of Computational Machinery), 2002, pp. 311–318.
- L. Chin-Yew, “ROUGE: A package for automatic evaluation of summaries,” In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81.
- S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005.
- T. Zhang, V. Kishore, F. Wu, K.Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” In International Conference on Learning Representations, 2020.
Training and accurately evaluating task-
specific chatbots is an important research area for Large
Language Models (LLMs). These models can be
developed for general purposes with the ability to handle
multiple tasks, or fine-tuned for specific applications
such as education or customer support. In this study,
Mistral 7B, Llama-2 and Phi-2 models are utilized which
have proven success on various benchmarks, including
question answering. The models were fine-tuned using
QLoRa with limited information gathered from course
catalogs. The fine-tuned models were evaluated using
various metrics, with the responses from GPT-4 taken as
the ground truth. The experiments revealed that Phi-2
slightly outperformed Mistral 7B, achieving scores of
0.012 BLEU, 0.184 METEOR, and 0.873 BERT.
Considering the evaluation metrics obtained, the
strengths and weaknesses of known LLM models, the
amount of data required for fine-tuning, and the effect of
the fine-tuning method on model performance are
discussed.
Keywords :
LLM, Mistral, Llama, Phi, Fine-Tune, QLoRa.