Authors :
K Tresha; Kavya; Medhaa PB; Pragathi T
Volume/Issue :
Volume 9 - 2024, Issue 12 - December
Google Scholar :
https://tinyurl.com/4fsceapa
Scribd :
https://tinyurl.com/25cuscwv
DOI :
https://doi.org/10.5281/zenodo.14470731
Abstract :
Text-to-video (T2V) generation is an
emerging field in artificial intelligence, gaining traction
with advances in deep learning models like generative
adversarial networks (GANs), diffusion models, and
hybrid architectures. This paper provides a
comprehensive survey of recent T2V methodologies,
exploring models such as GAN-based frameworks,
VEGAN-CLIP, IRC-GAN, Sora OpenAI, and
CogVideoX, which aim to transform textual descriptions
into coherent video content. These models face
challenges in maintaining semantic coherence, temporal
consistency, and realistic motion across generated
frames. We examine the architectural designs,
methodologies, and applications of key models,
highlighting the advantages and limitations in their
approaches to video synthesis. Additionally, we discuss
benchmark advancements, such as T2VBench, which
plays a crucial role in evaluating temporal consistency
and content alignment. This review sheds light on the
strengths and limitations of existing approaches and
outlines ethical considerations and future directions for
T2V generation in the realm of generative AI.
Keywords :
Text-to-Video (T2V) Generation, Deep Learning, Generative Adversarial Networks (GANs), Diffusion Models, Hybrid Architectures, VQGAN- CLIP,IRC-GAN, Sora Open AI, Cog Video X, Semantic Coherence, Temporal Consistency, Realistic Motion, Video Synthesis, Benchmark Advancements, T2VBench,Content Alignment, Ethical Considerations, Generative AI.
References :
- TiVGAN: Text to Image to Video Generation With Step-by-Step Evolutionary Generator DOYEON KIM(Member, IEEE), DONGGYU JOO AND JUNMO KIM , (Member, IEEE)School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 34141, South Korea.
- Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable Models by Enis Karaarslan1 and ¨Omer Aydın1.
- Conditional GAN with Discriminative Filter Generation for Text-to-Video.Synthesis by Yogesh Balaji ,Martin Renqiang Min , Bing Bai , Rama Chellappa1 and Hans Peter Graf2.University of Maryland, College Park, NEC Labs America – Princeton
- Transforming Text into Video: A Proposed Methodology for Video Production Using the VQGAN-CLIP Image Generative AI Model by SukChang Lee Prof., Dept. of Digital Contents, Konyang Univ., Korea
- To Create What You Tell: Generating Videos from Captions by Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li and Tao Mei.University of Science and Technology of China, Hefei, China.Microsoft Research, Beijing, China
- Yitong Li, Martin Renqiang Min,Dinghan Shen, David Carlson,Lawrence Carin,Duke University, Durham, NC, United States, 27708 NEC Laboratories America, Princeton, NJ, United States, 08540 {yitong.li, dinghan.shen, david.carlson, lcarin}@duke.edu, [email protected]
- AUTOLV: AUTOMATIC LECTURE VIDEO GENERATOR Wenbin Wang Yang Song Sanjay Jha ,School of Computer Science and Engineering, University of New South Wales, Australia
- Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation.Jiawei Liu, Weining Wang, Sihan Chen, Xinxin Zhu, Jing Liu
- IRC-GAN: Introspective Recurrent Convolutional GAN for Text-to-video Generation,Kangle Deng , Tianyi Fei, Xin Huang and Yuxin Pengy.Institute of Computer Science and Technology, Peking,University, Beijing, [email protected]
- Sora OpenAI’s Prelude: Social Media Perspectives on Sora OpenAI and the Future of AI Video Generation:REZA HADI MOGAVI, DERRICK WANG, JOSEPH TU, HILDA HADAN, and SABRINA A.
- SGANDURRA,Stratford School of Interaction Design and Business, University of Waterloo, Canada,PAN HUI, Hong Kong University of Science and Technology (Guangzhou), Hong Kong SAR and Guangzhou, China,LENNART E. NACKE, Stratford School of Interaction Design and Business, University of Waterloo, Canada
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer:Zhuoyi Yang Jiayan Teng Wendi Zheng Ming Ding Shiyu Huang,Jiazheng Xu Yuanming Yang Wenyi Hong Xiaohan Zhang Guanyu Feng,Da Yin Xiaotao Gu Yuxuan Zhang Weihan Wang Yean Cheng,Ting Liu Bin Xu Yuxiao Dong Jie Tang
- StreamingT2V: Consistent, Dynamic, and Extendable.Long Video Generation from Text:Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan,Zhangyang Wang1,2, Shant Navasardyan1, Humphrey Shi1,3,1Picsart AI Research (PAIR) 2UT Austin 3SHI Labs @ Georgia Tech, Oregon & UIUC
- TAVGBench: Benchmarking Text to Audible-Video Generation:Yuxin Mao1, Xuyang Shen2, Jing Zhang3, Zhen Qin4, Jinxing Zhou5, Mochu Xiang1, Yiran Zhong2, Yuchao Dai1.Northwestern Polytechnical University
- ,OpenNLPLab, Shanghai AI Lab ,Australian National University,TapTap 5Hefei University of Technology ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models:Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin,Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, Zhiwei Xiong.University of Science and Technology of China ,Microsoft Research Asia
- Rescribe: Authoring and Automatically ,Editing Audio Descriptions:Amy Pavel ,Gabriel Reyes ,Jeffrey P. Bigham T2VBench: Benchmarking Temporal Dynamics for Text-to-Video Generation by Pengliang Ji, Chuyang Xiao, Huilin Tai, Mingxiao Huo.Carnegie Mellon University,ShanghaiTech University,McGill University
- Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation by Jay Zhangjie Wu Yixiao Ge Xintao Wang Stan Weixian Lei Yuchao Gu Yufei Shi Wynne Hsu Ying Shan Xiaohu Qie Mike Zheng Shou.Show Lab, National University of Singapore ARC Lab, Tencent PCG
- LAVIE: HIGH-QUALITY VIDEO GENERATION WITH CASCADED LATENT DIFFUSION MODELSYaohuiWang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang,Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang,Yuwei Guo, TianxingWu, Chenyang Si, Yuming Jiang, Cunjian Chen,Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu
- CogVideo: Large-scale Pre Training for Text-to-Video,Generation via Transformers by Wenyi Hong,Ming Ding,Wendi Zheng, Xinghan Liu, Jie Tang,Tsinghua University zBAAI {hongwy18@mails, dm18@mails, jietang@mail}.tsinghua.edu.cn
- To Create What You Tell: Generating Videos from Captions by Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li and Tao Mei.University of Science and Technology of China, Hefei, China.Microsoft Research, Beijing, China
Text-to-video (T2V) generation is an
emerging field in artificial intelligence, gaining traction
with advances in deep learning models like generative
adversarial networks (GANs), diffusion models, and
hybrid architectures. This paper provides a
comprehensive survey of recent T2V methodologies,
exploring models such as GAN-based frameworks,
VEGAN-CLIP, IRC-GAN, Sora OpenAI, and
CogVideoX, which aim to transform textual descriptions
into coherent video content. These models face
challenges in maintaining semantic coherence, temporal
consistency, and realistic motion across generated
frames. We examine the architectural designs,
methodologies, and applications of key models,
highlighting the advantages and limitations in their
approaches to video synthesis. Additionally, we discuss
benchmark advancements, such as T2VBench, which
plays a crucial role in evaluating temporal consistency
and content alignment. This review sheds light on the
strengths and limitations of existing approaches and
outlines ethical considerations and future directions for
T2V generation in the realm of generative AI.
Keywords :
Text-to-Video (T2V) Generation, Deep Learning, Generative Adversarial Networks (GANs), Diffusion Models, Hybrid Architectures, VQGAN- CLIP,IRC-GAN, Sora Open AI, Cog Video X, Semantic Coherence, Temporal Consistency, Realistic Motion, Video Synthesis, Benchmark Advancements, T2VBench,Content Alignment, Ethical Considerations, Generative AI.