Dense caption imagining| International Journal of Innovative Science and Research Technology

Dense Caption Imagining

Authors : Vatsal Verma; Darpan Khanna; Gaurvi Vishnoi; Shreyas Raturi

Volume/Issue : Volume 7 - 2022, Issue 5 - May

Google Scholar : https://bit.ly/3TmGbDi

Scribd : https://tinyurl.com/ybd7hhhb

DOI : https://doi.org/10.5281/zenodo.8394992

Abstract : A lot of recent research has focused on both computer vision and natural language processing. Our research focuses on the intersection of these, specifically generating pictures from captions. We focus on the lower data regime, using the COCO and CUB data sets which include 200k and 11k picture and caption pairs (respectively). We will use a hierarchical GAN architecture as our baseline[7][24][26]. To improve our baseline we attempt various methods targeting the upsampling blocks, and adding residual or attention- based layers. We will compare the inception score of the methods to analyze our results. We will also consider qualitative results to assure there is minimal mode collapse and memorization. We find that of all our improvements, improving the up-sampling technique to use a Laplacian pyramid method with transposed convolutional layers obtains the best results with a minimal increase in computation time and memory needs.

Keywords : Computer Vision, Natural Language Processing, stackGAN, Image Captioning, Machine Learning, Deep Learning.

A lot of recent research has focused on both computer vision and natural language processing. Our research focuses on the intersection of these, specifically generating pictures from captions. We focus on the lower data regime, using the COCO and CUB data sets which include 200k and 11k picture and caption pairs (respectively). We will use a hierarchical GAN architecture as our baseline[7][24][26]. To improve our baseline we attempt various methods targeting the upsampling blocks, and adding residual or attention- based layers. We will compare the inception score of the methods to analyze our results. We will also consider qualitative results to assure there is minimal mode collapse and memorization. We find that of all our improvements, improving the up-sampling technique to use a Laplacian pyramid method with transposed convolutional layers obtains the best results with a minimal increase in computation time and memory needs.

Keywords : Computer Vision, Natural Language Processing, stackGAN, Image Captioning, Machine Learning, Deep Learning.