Authors :
Mohammed Lubna Firdous; Thameena Nousheen; Dr. K. Rajitha; K. Shirisha; Dr. K. Sreekala
Volume/Issue :
Volume 11 - 2026, Issue 5 - May
Google Scholar :
https://tinyurl.com/mw6kxn39
Scribd :
https://tinyurl.com/4635jfav
DOI :
https://doi.org/10.38124/ijisrt/26May562
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Short-form video has become one of the most effective ways to explain ideas, promote learning material, and
share information on social media platforms. However, creating even a short vertical video usually requires script writing,
voice recording, subtitle preparation, stock media selection, editing, and final rendering. This paper presents AutoVidGen,
a Python-based automated text-to-video generation pipeline that converts a single user-provided topic into a complete
vertical video. The system combines a large language model for narration script generation, neural text-to-speech for
voiceover creation, Whisper-based transcription for timed captions, language-model-assisted keyword extraction for
visual search, the Pexels API for stock video retrieval, and MoviePy with FFmpeg support for final video composition. The
final output is rendered as a 1080 x 1920 MP4 video at 30 frames per second, making it suitable for platforms such as
YouTube Shorts, Instagram Reels, and TikTok. The system is implemented with both command-line and Streamlit
interfaces, allowing users to generate videos with minimal technical effort. Experimental use of the prototype shows that
the pipeline can quickly produce usable educational and informational videos, while the final quality depends on topic
clarity, keyword relevance, stock footage availability, and caption rendering reliability.
Keywords :
Text-to-Video Generation; Generative AI; Large Language Model; Text-to-Speech; Whisper; Automatic Subtitles; Stock Video Retrieval; MoviePy; Short-Form Video; Content Automation.
References :
- M. Singer, S. Sheynin, A. Polyak, et al., "Make-A-Video: Text-to-Video Generation without Text-Video Data," arXiv preprint arXiv:2209.14792, 2022.
- A. Vaswani, N. Shazeer, N. Parmar, et al., "Attention Is All You Need," Advances in Neural Information Processing Systems, 2017.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "High-Resolution Image Synthesis with Latent Diffusion Models," IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- A. Radford, J. W. Kim, C. Hallacy, et al., "Learning Transferable Visual Models From Natural Language Supervision," International Conference on Machine Learning, 2021.
- A. van den Oord, S. Dieleman, H. Zen, et al., "WaveNet: A Generative Model for Raw Audio," arXiv preprint arXiv:1609.03499, 2016.
- Y. Wang, R. Skerry-Ryan, D. Stanton, et al., "Tacotron: Towards End-to-End Speech Synthesis," Interspeech, 2017.
- A. Radford, J. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, "Robust Speech Recognition via Large-Scale Weak Supervision," International Conference on Machine Learning, 2023.
- Z. Tulyakov, M. Liu, X. Yang, and J. Kautz, "MoCoGAN: Decomposing Motion and Content for Video Generation," IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
- C. Vondrick, H. Pirsiavash, and A. Torralba, "Generating Videos with Scene Dynamics," Advances in Neural Information Processing Systems, 2016.
- W. Wang, Y. Zhang, Z. Liu, et al., "DirecT2V: Large Language Models as Frame-Level Directors for Zero-Shot Text-to-Video Generation," IEEE/CVF International Conference on Computer Vision, 2023.
Short-form video has become one of the most effective ways to explain ideas, promote learning material, and
share information on social media platforms. However, creating even a short vertical video usually requires script writing,
voice recording, subtitle preparation, stock media selection, editing, and final rendering. This paper presents AutoVidGen,
a Python-based automated text-to-video generation pipeline that converts a single user-provided topic into a complete
vertical video. The system combines a large language model for narration script generation, neural text-to-speech for
voiceover creation, Whisper-based transcription for timed captions, language-model-assisted keyword extraction for
visual search, the Pexels API for stock video retrieval, and MoviePy with FFmpeg support for final video composition. The
final output is rendered as a 1080 x 1920 MP4 video at 30 frames per second, making it suitable for platforms such as
YouTube Shorts, Instagram Reels, and TikTok. The system is implemented with both command-line and Streamlit
interfaces, allowing users to generate videos with minimal technical effort. Experimental use of the prototype shows that
the pipeline can quickly produce usable educational and informational videos, while the final quality depends on topic
clarity, keyword relevance, stock footage availability, and caption rendering reliability.
Keywords :
Text-to-Video Generation; Generative AI; Large Language Model; Text-to-Speech; Whisper; Automatic Subtitles; Stock Video Retrieval; MoviePy; Short-Form Video; Content Automation.