Because an image can have a variety of
meanings in different languages, it's difficult to generate
short descriptions of those meanings automatically. It's
difficult to extract context from images and use it to
construct sentences because they contain so many different
types of information. It allows blind people to
independently explore their surroundings. Deep learning, a
new programming trend, can be used to create this type of
system. This project will use VGG16, a top-notch CNN
architecture for image classification and feature extraction.
In the text description process, LSTM and an embedding
layer will be used. These two networks will be combined to
form an image caption generation network. After that,
we'll train our model with data from the flickr8k dataset.
The model's output is converted to audio for the benefit of
those who are visually impaired
Keywords : Deep Learning; Recurrent neural network; Convolutional neural network; VGG16; LSTM.