Ai based voice cloning system from text to speech| International Journal of Innovative Science and Research Technology

AI Based Voice Cloning System: From Text to Speech

Authors : Md. Sadik; P. Vijaya; Y. Revathi; V. Siva Naga Tanuja; B. Soudhamini; R. Vaishnavi

Volume/Issue : Volume 10 - 2025, Issue 4 - April

Google Scholar : https://tinyurl.com/4tsj4649

Scribd : https://tinyurl.com/yc2fetr2

DOI : https://doi.org/10.38124/ijisrt/25apr834

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : The rapid advancements in Artificial Intelligence and Deep Learning have significantly improved Text-To-Speech (TTS) technology, enabling more accurate and natural voice conversion. This project presents a Voice Cloning System that leverages a Transformer-based encoder and a GAN-based vocoder to generate high-quality, natural-sounding speech from text. The system supports both Text-to-Speech (TTS), where textual input is converted into a default synthesized voice, and Voice Cloning, which allows the replication of a new voice using a short audio sample. By employing a one-shot learning approach, the system enables speaker adaptation with minimal training data, making it efficient and scalable for real-world applications. The Transformer-based encoder effectively captures linguistic and prosodic features, while the GAN-based vocoder enhances the realism of the generated speech by refining spectral details. The model's ability to generalize across different speakers ensures robustness, even when trained on limited datasets. This project highlights the potential of deep generative models in speech synthesis and their impact on various domains, including assistive technology, where it can help individuals with speech impairments communicate more naturally, personalized virtual assistants that adapt to user preferences, and entertainment industries for voiceovers and character dubbing.

Keywords : Voice Cloning, Text-to-Speech (TTS), Transformer Encoder, GAN-based Vocoder, One-Shot Learning, Speech Synthesis, Deep Learning, Artificial Intelligence, Speaker Adaption, Generative Models, Speech Conversion.

References :

Wang Y, Skerry-Ryan R, Stanton D, et al. Tacotron: Towards End-to-End Speech Synthesis. Proceedings of InterSpeech 2017; 2017:4006–4010.
Shen J, Pang R, Weiss RJ, et al. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. ICASSP 2018 - IEEE International Conference on Acoustics, Speech, and Signal Processing; 2018:4779-4783.
Jia Y, Zhang Y, Weiss RJ, et al. Transfer Learning from Speaker Verification to Multi-Speaker Text-To-Speech Synthesis. Advances in Neural Information Processing Systems (NeurIPS); 2018:4480–4490.
Ren Y, Hu C, Tan X, et al. FastSpeech: Fast, Robust and Controllable Text to Speech. Neural Information Processing Systems (NeurIPS); 2019.
Kumar KR, Gulati T, Zen H, et al. High-Fidelity Speech Synthesis with Improved Variational Autoencoders. Proceedings of InterSpeech 2019; 2019:2918–2922.
Arik et al. (2017): "Deep Voice: Real-time Neural Text-to-Speech". Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 195-204.
Ping et al. (2018): "Cloning a Voice in 5 Seconds to Generate Arbitrary Speech in Real-time". Proceedings of the 19th Annual Conference of the International Speech Communication Association (Interspeech), pp. 392-396.
Chen et al. (2019): "Transformers for Speech Synthesis". Proceedings of the 20th Annual Conference of the International Speech Communication Association (Interspeech), pp. 270-274.
Liu et al. (2020): "Wavetone: A High-Quality, Flexible, and Efficient Text-to-Speech System". Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech), pp. 330-334.
Huang et al. (2020): "Voice Cloning using Generative Adversarial Networks". IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1455-1465.
Kong et al. (2020): "HiFi-GAN: Generative Adversarial Networks for High-Fidelity Speech Synthesis". Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech), pp. 335-339.

The rapid advancements in Artificial Intelligence and Deep Learning have significantly improved Text-To-Speech (TTS) technology, enabling more accurate and natural voice conversion. This project presents a Voice Cloning System that leverages a Transformer-based encoder and a GAN-based vocoder to generate high-quality, natural-sounding speech from text. The system supports both Text-to-Speech (TTS), where textual input is converted into a default synthesized voice, and Voice Cloning, which allows the replication of a new voice using a short audio sample. By employing a one-shot learning approach, the system enables speaker adaptation with minimal training data, making it efficient and scalable for real-world applications. The Transformer-based encoder effectively captures linguistic and prosodic features, while the GAN-based vocoder enhances the realism of the generated speech by refining spectral details. The model's ability to generalize across different speakers ensures robustness, even when trained on limited datasets. This project highlights the potential of deep generative models in speech synthesis and their impact on various domains, including assistive technology, where it can help individuals with speech impairments communicate more naturally, personalized virtual assistants that adapt to user preferences, and entertainment industries for voiceovers and character dubbing.

CALL FOR PAPERS

Paper Submission Last Date
31 - July - 2025

Video Explanation for Published paper

CALL FOR PAPERS

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.