⚠ Official Notice: www.ijisrt.com is the official website of the International Journal of Innovative Science and Research Technology (IJISRT) Journal for research paper submission and publication. Please beware of fake or duplicate websites using the IJISRT name.



AutoVidGen: A Fully Automated Text-to-Video Generation Pipeline Using AI and Multimedia Tools


Authors : Mohammed Lubna Firdous; Thameena Nousheen; Dr. K. Rajitha; K. Shirisha; Dr. K. Sreekala

Volume/Issue : Volume 11 - 2026, Issue 5 - May


Google Scholar : https://tinyurl.com/mw6kxn39

Scribd : https://tinyurl.com/4635jfav

DOI : https://doi.org/10.38124/ijisrt/26May562

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : Short-form video has become one of the most effective ways to explain ideas, promote learning material, and share information on social media platforms. However, creating even a short vertical video usually requires script writing, voice recording, subtitle preparation, stock media selection, editing, and final rendering. This paper presents AutoVidGen, a Python-based automated text-to-video generation pipeline that converts a single user-provided topic into a complete vertical video. The system combines a large language model for narration script generation, neural text-to-speech for voiceover creation, Whisper-based transcription for timed captions, language-model-assisted keyword extraction for visual search, the Pexels API for stock video retrieval, and MoviePy with FFmpeg support for final video composition. The final output is rendered as a 1080 x 1920 MP4 video at 30 frames per second, making it suitable for platforms such as YouTube Shorts, Instagram Reels, and TikTok. The system is implemented with both command-line and Streamlit interfaces, allowing users to generate videos with minimal technical effort. Experimental use of the prototype shows that the pipeline can quickly produce usable educational and informational videos, while the final quality depends on topic clarity, keyword relevance, stock footage availability, and caption rendering reliability.

Keywords : Text-to-Video Generation; Generative AI; Large Language Model; Text-to-Speech; Whisper; Automatic Subtitles; Stock Video Retrieval; MoviePy; Short-Form Video; Content Automation.

References :

  1. M. Singer, S. Sheynin, A. Polyak, et al., "Make-A-Video: Text-to-Video Generation without Text-Video Data," arXiv preprint arXiv:2209.14792, 2022.
  2. A. Vaswani, N. Shazeer, N. Parmar, et al., "Attention Is All You Need," Advances in Neural Information Processing Systems, 2017.
  3. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "High-Resolution Image Synthesis with Latent Diffusion Models," IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  4. A. Radford, J. W. Kim, C. Hallacy, et al., "Learning Transferable Visual Models From Natural Language Supervision," International Conference on Machine Learning, 2021.
  5. A. van den Oord, S. Dieleman, H. Zen, et al., "WaveNet: A Generative Model for Raw Audio," arXiv preprint arXiv:1609.03499, 2016.
  6. Y. Wang, R. Skerry-Ryan, D. Stanton, et al., "Tacotron: Towards End-to-End Speech Synthesis," Interspeech, 2017.
  7. A. Radford, J. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, "Robust Speech Recognition via Large-Scale Weak Supervision," International Conference on Machine Learning, 2023.
  8. Z. Tulyakov, M. Liu, X. Yang, and J. Kautz, "MoCoGAN: Decomposing Motion and Content for Video Generation," IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  9. C. Vondrick, H. Pirsiavash, and A. Torralba, "Generating Videos with Scene Dynamics," Advances in Neural Information Processing Systems, 2016.
  10. W. Wang, Y. Zhang, Z. Liu, et al., "DirecT2V: Large Language Models as Frame-Level Directors for Zero-Shot Text-to-Video Generation," IEEE/CVF International Conference on Computer Vision, 2023.

Short-form video has become one of the most effective ways to explain ideas, promote learning material, and share information on social media platforms. However, creating even a short vertical video usually requires script writing, voice recording, subtitle preparation, stock media selection, editing, and final rendering. This paper presents AutoVidGen, a Python-based automated text-to-video generation pipeline that converts a single user-provided topic into a complete vertical video. The system combines a large language model for narration script generation, neural text-to-speech for voiceover creation, Whisper-based transcription for timed captions, language-model-assisted keyword extraction for visual search, the Pexels API for stock video retrieval, and MoviePy with FFmpeg support for final video composition. The final output is rendered as a 1080 x 1920 MP4 video at 30 frames per second, making it suitable for platforms such as YouTube Shorts, Instagram Reels, and TikTok. The system is implemented with both command-line and Streamlit interfaces, allowing users to generate videos with minimal technical effort. Experimental use of the prototype shows that the pipeline can quickly produce usable educational and informational videos, while the final quality depends on topic clarity, keyword relevance, stock footage availability, and caption rendering reliability.

Keywords : Text-to-Video Generation; Generative AI; Large Language Model; Text-to-Speech; Whisper; Automatic Subtitles; Stock Video Retrieval; MoviePy; Short-Form Video; Content Automation.

Paper Submission Last Date
30 - June - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS
Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe