Autovidgen a fully automated texttovideo generation pipeline using ai and multimedia tools| International Journal of Innovative Science and Research Technology

AutoVidGen: A Fully Automated Text-to-Video Generation Pipeline Using AI and Multimedia Tools

Authors : Mohammed Lubna Firdous; Thameena Nousheen; Dr. K. Rajitha; K. Shirisha; Dr. K. Sreekala

Volume/Issue : Volume 11 - 2026, Issue 5 - May

Google Scholar : https://tinyurl.com/mw6kxn39

Scribd : https://tinyurl.com/4635jfav

DOI : https://doi.org/10.38124/ijisrt/26May562

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : Short-form video has become one of the most effective ways to explain ideas, promote learning material, and share information on social media platforms. However, creating even a short vertical video usually requires script writing, voice recording, subtitle preparation, stock media selection, editing, and final rendering. This paper presents AutoVidGen, a Python-based automated text-to-video generation pipeline that converts a single user-provided topic into a complete vertical video. The system combines a large language model for narration script generation, neural text-to-speech for voiceover creation, Whisper-based transcription for timed captions, language-model-assisted keyword extraction for visual search, the Pexels API for stock video retrieval, and MoviePy with FFmpeg support for final video composition. The final output is rendered as a 1080 x 1920 MP4 video at 30 frames per second, making it suitable for platforms such as YouTube Shorts, Instagram Reels, and TikTok. The system is implemented with both command-line and Streamlit interfaces, allowing users to generate videos with minimal technical effort. Experimental use of the prototype shows that the pipeline can quickly produce usable educational and informational videos, while the final quality depends on topic clarity, keyword relevance, stock footage availability, and caption rendering reliability.

Keywords : Text-to-Video Generation; Generative AI; Large Language Model; Text-to-Speech; Whisper; Automatic Subtitles; Stock Video Retrieval; MoviePy; Short-Form Video; Content Automation.

References :

M. Singer, S. Sheynin, A. Polyak, et al., "Make-A-Video: Text-to-Video Generation without Text-Video Data," arXiv preprint arXiv:2209.14792, 2022.
A. Vaswani, N. Shazeer, N. Parmar, et al., "Attention Is All You Need," Advances in Neural Information Processing Systems, 2017.
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "High-Resolution Image Synthesis with Latent Diffusion Models," IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
A. Radford, J. W. Kim, C. Hallacy, et al., "Learning Transferable Visual Models From Natural Language Supervision," International Conference on Machine Learning, 2021.
A. van den Oord, S. Dieleman, H. Zen, et al., "WaveNet: A Generative Model for Raw Audio," arXiv preprint arXiv:1609.03499, 2016.
Y. Wang, R. Skerry-Ryan, D. Stanton, et al., "Tacotron: Towards End-to-End Speech Synthesis," Interspeech, 2017.
A. Radford, J. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, "Robust Speech Recognition via Large-Scale Weak Supervision," International Conference on Machine Learning, 2023.
Z. Tulyakov, M. Liu, X. Yang, and J. Kautz, "MoCoGAN: Decomposing Motion and Content for Video Generation," IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
C. Vondrick, H. Pirsiavash, and A. Torralba, "Generating Videos with Scene Dynamics," Advances in Neural Information Processing Systems, 2016.
W. Wang, Y. Zhang, Z. Liu, et al., "DirecT2V: Large Language Models as Frame-Level Directors for Zero-Shot Text-to-Video Generation," IEEE/CVF International Conference on Computer Vision, 2023.

Short-form video has become one of the most effective ways to explain ideas, promote learning material, and share information on social media platforms. However, creating even a short vertical video usually requires script writing, voice recording, subtitle preparation, stock media selection, editing, and final rendering. This paper presents AutoVidGen, a Python-based automated text-to-video generation pipeline that converts a single user-provided topic into a complete vertical video. The system combines a large language model for narration script generation, neural text-to-speech for voiceover creation, Whisper-based transcription for timed captions, language-model-assisted keyword extraction for visual search, the Pexels API for stock video retrieval, and MoviePy with FFmpeg support for final video composition. The final output is rendered as a 1080 x 1920 MP4 video at 30 frames per second, making it suitable for platforms such as YouTube Shorts, Instagram Reels, and TikTok. The system is implemented with both command-line and Streamlit interfaces, allowing users to generate videos with minimal technical effort. Experimental use of the prototype shows that the pipeline can quickly produce usable educational and informational videos, while the final quality depends on topic clarity, keyword relevance, stock footage availability, and caption rendering reliability.

Keywords : Text-to-Video Generation; Generative AI; Large Language Model; Text-to-Speech; Whisper; Automatic Subtitles; Stock Video Retrieval; MoviePy; Short-Form Video; Content Automation.

Paper Submission Last Date
31 - July - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.