Enhancing Conversational AI for Low-Resource Languages: A Case Study on Somali


Authors : Mohamud Osman Hamud; Serpil Aydın

Volume/Issue : Volume 10 - 2025, Issue 2 - February


Google Scholar : https://tinyurl.com/2t6w6kx4

Scribd : https://tinyurl.com/3edfhfd9

DOI : https://doi.org/10.5281/zenodo.14908879


Abstract : Conversational AI has made huge strides in understanding and generating human language. However, these advances have mostly benefited high-resource languages such as English and Spanish. In contrast, languages like Somali— spoken by an estimated 20 million people—lack the abundance of annotated data needed to develop robust language models. This study focuses on practical strategies to boost Somali text and speech processing capabilities. We explore three core approaches: (1) transfer learning, (2) synthetic data augmentation, and (3) fine-tuning multilingual models. Our experiments, featuring XLM-R, mBERT, and OpenAI’s Whisper API, show that well-adapted models significantly outperform their baseline counterparts in Somali text translation and speech-to-text tasks. Beyond the numbers, our findings underscore the societal value of creating accessible AI tools for underrepresented linguistic communities, providing a template for extending these methods to other low-resource languages.

References :

  1. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” Proceedings of the ACM Conference on Fairness, Accountability, and Transparency.
  2. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., & Askell, A. (2020). “Language Models are Few-Shot Learners.” arXiv preprint arXiv:2005.14165.
  3. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., … & Stoyanov, V. (2020). “Unsupervised Cross-lingual Representation Learning at Scale.” arXiv preprint arXiv:1911.02116.
  4. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT).
  5. Fadaee, M., Bisazza, A., & Monz, C. (2017). “Data Augmentation for Low-Resource Neural Machine Translation.” Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL).
  6. Howard, J. & Ruder, S. (2018). “Universal Language Model Fine-tuning for Text Classification.” arXiv preprint arXiv:1801.06146.

Conversational AI has made huge strides in understanding and generating human language. However, these advances have mostly benefited high-resource languages such as English and Spanish. In contrast, languages like Somali— spoken by an estimated 20 million people—lack the abundance of annotated data needed to develop robust language models. This study focuses on practical strategies to boost Somali text and speech processing capabilities. We explore three core approaches: (1) transfer learning, (2) synthetic data augmentation, and (3) fine-tuning multilingual models. Our experiments, featuring XLM-R, mBERT, and OpenAI’s Whisper API, show that well-adapted models significantly outperform their baseline counterparts in Somali text translation and speech-to-text tasks. Beyond the numbers, our findings underscore the societal value of creating accessible AI tools for underrepresented linguistic communities, providing a template for extending these methods to other low-resource languages.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe