You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Dataset contains 30k images along with 5 annotations for each image. Here I have used only first annotation of each image.
Parameters & Libraries
Tensorflow, NLTK, Numpy, Pandas are used.
For converting text to embedding vectors TextVectorization is used with vocabulary size of 5000, sequence length of 25 and with embedding dimension of 256.
The image size for EfficientNet is (224, 224, 3). EfficientNet was loaded with weights from ImageNet.
Units for GRU are 512.
Training & Evaluation
Here training is done on batch size of 64 for 25 epochs.
Encoder consists of EfficientNet and a FC layer for fine tunning. Decoder consists of GRU along with Attention Mechanism.
First the image is passed to EfficientNet and image context vector is obtained then along with image context vector hidden_state(intial state of decoder) is passed to Attention layer now its output is passed to GRU along with embedding vector of "[start]" token.
Here Teacher Forcing is used which is while training we pass the word vector of target sentence to GRU.
While testing the model input to GRU is previous output along with Attention output.
Loss obtained by model is 0.511. And BLEU score on test data is 0.129