Authors :
Mohamud Osman Hamud; Serpil Aydın
Volume/Issue :
Volume 10 - 2025, Issue 2 - February
Google Scholar :
https://tinyurl.com/2t6w6kx4
Scribd :
https://tinyurl.com/3edfhfd9
DOI :
https://doi.org/10.5281/zenodo.14908879
Abstract :
Conversational AI has made huge strides in understanding and generating human language. However, these
advances have mostly benefited high-resource languages such as English and Spanish. In contrast, languages like Somali—
spoken by an estimated 20 million people—lack the abundance of annotated data needed to develop robust language models.
This study focuses on practical strategies to boost Somali text and speech processing capabilities. We explore three core
approaches: (1) transfer learning, (2) synthetic data augmentation, and (3) fine-tuning multilingual models. Our
experiments, featuring XLM-R, mBERT, and OpenAI’s Whisper API, show that well-adapted models significantly
outperform their baseline counterparts in Somali text translation and speech-to-text tasks. Beyond the numbers, our findings
underscore the societal value of creating accessible AI tools for underrepresented linguistic communities, providing a
template for extending these methods to other low-resource languages.
References :
- Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” Proceedings of the ACM Conference on Fairness, Accountability, and Transparency.
- Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., & Askell, A. (2020). “Language Models are Few-Shot Learners.” arXiv preprint arXiv:2005.14165.
- Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., … & Stoyanov, V. (2020). “Unsupervised Cross-lingual Representation Learning at Scale.” arXiv preprint arXiv:1911.02116.
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT).
- Fadaee, M., Bisazza, A., & Monz, C. (2017). “Data Augmentation for Low-Resource Neural Machine Translation.” Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL).
- Howard, J. & Ruder, S. (2018). “Universal Language Model Fine-tuning for Text Classification.” arXiv preprint arXiv:1801.06146.
Conversational AI has made huge strides in understanding and generating human language. However, these
advances have mostly benefited high-resource languages such as English and Spanish. In contrast, languages like Somali—
spoken by an estimated 20 million people—lack the abundance of annotated data needed to develop robust language models.
This study focuses on practical strategies to boost Somali text and speech processing capabilities. We explore three core
approaches: (1) transfer learning, (2) synthetic data augmentation, and (3) fine-tuning multilingual models. Our
experiments, featuring XLM-R, mBERT, and OpenAI’s Whisper API, show that well-adapted models significantly
outperform their baseline counterparts in Somali text translation and speech-to-text tasks. Beyond the numbers, our findings
underscore the societal value of creating accessible AI tools for underrepresented linguistic communities, providing a
template for extending these methods to other low-resource languages.