My fascination for the field of natural language processing led me to pursue a personal project. I constructed a recommendation engine which, based on free text input by the user, displays relevant images to the lexical meaning.
- Image Captioning Algorithm
- NoSQL database implementation
- NLP Querying using BERT
- Methods to improve the current model
The architecture consists of two branches - a CNN one and an LSTM one. The training set is made of 6000 images from the flickr8k dataset. This set had multiple captions for each image. Considering the wide variety of images, this training data size is rather small for a neural network model of this capability. It took the model approximately 4-5 hrs to train on this data set. To get a higher accuracy of prediction, it would be good to train the model on a dataset like Coco which has ~100,000 images. This would however take approximately 3 days and was not attempted due to paucity of time.
I took 200 images from the test dataset, and these were converted into equivalent strings and stored in the NoSQL database. The captions generated by the trained model for each of the images were also stored in MongoDB. One can continuously add new images to the database without having to train the model.
Next, the input query was passed through the Bert model. The Bert model is amazing at embedding any sentence into a vector space where each token is related to one another. This was done so that the match with the captions would be more oriented towards the intent of the question rather than just plain word comparison. The BERT model is capable of matching sentences with similar semantic meaning but different usage of words. Each caption from the database was passed through Bert and the ‘Cosine Similarity’ was measured with the query. If it crossed a threshold of 0.6, it would display the image. Currently work in progress to create the web interface that will connect to the database in the server and accept user queries.
- Train it on a large dataset like Coco. It will be able to learn a variety of different tokens and generate much more accurate captions.
- Use attention-based models to focus on important aspects of the image.
- Make modifications to the existing architecture by perhaps, increasing the number of parameters or implementing regularization.