Authors :
Vatsal Verma; Darpan Khanna; Gaurvi Vishnoi; Shreyas Raturi
Volume/Issue :
Volume 7 - 2022, Issue 5 - May
Google Scholar :
https://bit.ly/3TmGbDi
Scribd :
https://tinyurl.com/ybd7hhhb
DOI :
https://doi.org/10.5281/zenodo.8394992
Abstract :
A lot of recent research has focused on both
computer vision and natural language processing. Our
research focuses on the intersection of these, specifically
generating pictures from captions. We focus on the lower
data regime, using the COCO and CUB data sets which
include 200k and 11k picture and caption pairs
(respectively). We will use a hierarchical GAN
architecture as our baseline[7][24][26]. To improve our
baseline we attempt various methods targeting the
upsampling blocks, and adding residual or attention-
based layers. We will compare the inception score of the
methods to analyze our results. We will also consider
qualitative results to assure there is minimal mode
collapse and memorization. We find that of all our
improvements, improving the up-sampling technique to
use a Laplacian pyramid method with transposed
convolutional layers obtains the best results with a
minimal increase in computation time and memory needs.
Keywords :
Computer Vision, Natural Language Processing, stackGAN, Image Captioning, Machine Learning, Deep Learning.
A lot of recent research has focused on both
computer vision and natural language processing. Our
research focuses on the intersection of these, specifically
generating pictures from captions. We focus on the lower
data regime, using the COCO and CUB data sets which
include 200k and 11k picture and caption pairs
(respectively). We will use a hierarchical GAN
architecture as our baseline[7][24][26]. To improve our
baseline we attempt various methods targeting the
upsampling blocks, and adding residual or attention-
based layers. We will compare the inception score of the
methods to analyze our results. We will also consider
qualitative results to assure there is minimal mode
collapse and memorization. We find that of all our
improvements, improving the up-sampling technique to
use a Laplacian pyramid method with transposed
convolutional layers obtains the best results with a
minimal increase in computation time and memory needs.
Keywords :
Computer Vision, Natural Language Processing, stackGAN, Image Captioning, Machine Learning, Deep Learning.