Automatic Image Captioning Using Encoder-Decoder Architecture

In this project, you will create a neural network architecture to automatically generate captions from images. Captioning combines the most powerful attributes of CNNs (convolutional neural networks) and RNNs (Recurrent Neural Networks). In essence, the model that we will be constructing will learn from a bunch of images paired with captions that describe the contents of those images. 
End to end, we want to take images as input, and output the text for that image. The input image will be processed by a CNN, and we'll connect the output of the CNN to an RNN to generate the captions for the input image. 

Data 

The Microsoft C*ommon *Objects in COntext (MS COCO) dataset is a large-scale dataset for scene understanding. The dataset is commonly used to train and benchmark object detection, segmentation, and captioning algorithms.


Loading a sample image from the dataset, with 5 sample generated captions

a balloon elephant sits in the middle of a park area
A ceramic elephant that is standing up on fake water.
An elephant statue standing on top of a lush green park.
A sculpture of an elephant, decorated with images of other animals.
An elephant statue with an opening of various drawings on it.


The architecture I will be using to build my Image Captioning model is the Encoder-Decoder architecture.

Encoder

We first feed a training image, say a picture of a man holding pizza to a CNN - a pretrained network like VGG-16 or ResNet. 
At the end of the network (ResNet Structure) is a softmax classifier that outputs a vector of class scores. But we don't want to classify the image. Instead, we want a set of features that represent that spatial features of the input image. To get this information, we remove the final fully connected layer of the network that classifies the image, and look at the early layers that give out more information about the image.

What we have is then a CNN that we use as a feature extractor that compresses the feature information of the image into a smaller representation. This CNN is what we call the encoder because it "encodes" the content of the image into a smaller feature vector. We can then process this feature vector and use it as input to the RNN.


The encoder that we will be using is the pre-trained ResNet-50 architecture (with the final fully-connected layer removed) to extract features from a batch of pre-processed images. The output is then flattened to a vector, before being passed through a Linear layer to transform the feature vector to have the same size as the word embedding. 

The Glue

It is sometimes useful to pass the CNN output to an additional fully connected or linear layer before passing it to the RNN. This is in fact similar to other transfer learning examples. Note that the CNN we are using is pre-trained. Adding an untrained linear layer at the end of it allows us to tweak only the final layer as we train the entire model to generate captions.

Decoder
The job of the RNN is to decode the processed feature vector and turn it into a natural language. This portion of the network is often called the decoder.

The RNN component is trained on the caption of a dataset on the COCO dataset. We aim the RNN to predict the next word of a sentence based on previous words. But this is impossible if we are dealing directly with string data. Neural networks need a well-defined numerical values to perform backpropagation.

So we tokenize words, that is we transform sentences into a list of words, then this list of words is transformed into a list of integers. We assign certain numbers to words mapped back to a vocabulary. Token is a fancy term for a symbol that corresponds to a meaning. In natural language processing, tokens are individual words. So tokenizing is simply splitting a sentence into a sequence of words. There are a lot of useful libraries to use such as nltk in Python.
But before sending this to the RNN, we do embedding. Embedding transforms each word in a caption into a vector.

The RNN has 2 responsibilities:
1. Remember spatial information from the input feature vector
2. Predict the next word.

The first word that shall be produced is the <start> token. At every timestep, we look at the current word as input and combine it with the hidden state of the LSTM cell to produce an output. This output is then passed to a fully connected layer that produces a distribution representing the most likely next word. We feed this word again to the network until we reach the <end> token. The RNN use backpropagation to update its weights until it learns to produce the correct sequence of words for the caption of the input image.

The model updates its weights after each training batch, where the batch size corresponds to the number of image-caption pairs.
Code
Import important libraries first. And here's what the encoder looks like in code:
Here's the decoder.

Results

To do a sanity check, I got an image from the dataset and fed it to the trained model. Here's what I got:
a bunch of green bananas growing in a tree
It's actually good. So I got a bunch of my own photos and tried the model on them.


So there we have it. Our image captioning model! I actually had fun doing this. Again, this is part of Udacity's nanodegree course in Computer Vision.

Comments

Popular Posts