Zheng Z, Zheng L, Garrett M, et al. Dual-path convolutional image-text embedding with instance loss[J]. arXiv preprint arXiv:1711.05535, 2017.
1. Overview
1.1. Motivation
- existing methods uses RNN for text feature learning and employs off-the-shelf CNN for image extraction
- ImageNet pre-trained models do not preserve rich image details that are critical for matching languages
In this paper, it proposed
- CNN for fine-tuning the visual and textual representation (network only contains Conv, Pooling, ReLU, BN)
- instance loss according to viewing each multimodal data pair as a class
1.2. Contribution
- dual-path CNN model
- instance loss
- outperform on FLickr30K, MSCOCO, CUHK-PEDES
1.3. Related Work
1.3.1. Model for Image Recognition
- fixed CNN feature as input
1.3.2. Model for Natural Language Understanding
- word2vec
- RNN, directional LSTM
- CNN to conduct machine translation, 9.3x speed up
1.3.3. Multi-modal Learning
- class-level retrieval. leverage the class labels in training set
- instance-level retrieval. match image-text pairs, does not use any class label
In this paper, it focus on instance-level retrieval and propsoed instance loss.
1.4. Dataset
- Flickr30k. 5 sentences, avg 10.5 words per sentence after remove rare word
- MSCOCO. avg 5 sentences, avg 8.7 words after rare word removal
- CUHK-PEDES. 2 sentences, avg 19.6 words after removal
2. Method
2.1. Deep Image CNN
- pre-trained models can still provide for good CNN initialization
- input. 224x224
- output. 2048 dimension vector
2.2. Deep Text CNN
2.2.1. Text Processing
- convert sentence to code T [n x d]. (ont-hot)
- n. length of the sentence (set to fixed length)
- d. size of the dictionary
- use word2vec to filter out rare words
- pad zeros to T, if less than fixed length words
- reshape T to 1 x 32 x d (h, w, c)
- position shift (more robust). pad random number of zeros at the beginning and the end of sentence
- left alignment. only padding at the end of the sentence.
2.2.2. Deep Text CNN
- input. T (1 x 32 x d)
- output. 2048 dimension vector
- the filter size of the first conv is 1 x 1 x d x 300, two method to init
- random initialization
- using d x 300 matrix from word2vec for initialization (better)
- kernel of Conv. 1x2, every two neighbour components may form a phrase containing content information
2.3. Loss Function
2.3.1. Ranking Loss
- I. visual input
- T. text input
- I_a/T_a. the same image/text group
- I_n/T_n. negative sample
α. margin
the convergence of ranking loss requires both image and text branches converge
- may be prone to getting stuck in a local minimum
2.3.2. Instance Loss
- assumption. each image/text group is distinct
- instance loss is a softmax loss classifies an image/text group into one of a large number of classes
- L. loss
- P. probability
- P(c). predicted possibility of the right class c
enforce shared weight in the fully connected layer for the two modalities, otherwise the learned image and text features mat exist in totally different subspace.
2.3.3. Total Loss
stage-1. only instance loss, λ_1 = 0
- so the ranking loss can fing a better optimisation for both modalities in the second stage
- using the instance loss alone can lead to a competitive result
- instance loss encourages the model to find the fine-grained differences, such as ball, stick,..
stage-2. ranking loss + instance loss
2.3.4. Training Stage
- stage-1. fixed pre-trained image CNN, use instance loss to tune the remaining part
- if we train the image and text CNNs simultaneously, the text CNN may comprise the pre-trained image CNN
- stage-2. instance loss + ranking loss to fine-tune the entire network
3. Experiments
3.1. Metric
- Recall@K. possibility that true match appears in the top K of the rank list
- Median Rank. median rank of the closest ground truth result in the rank list
3.2. Details
- SGD + fixed 0.9 momentum
- Matconvnet Framework
- 224x224 random crop from shorter side 256
- horizontal flipping for image
- position shift for text
- 0.75 dropout
- max text length 32 for Flickr30K and MSCOCO, 56 for CUHK-PEDES
- LR. 0.001
- α=1
3.3. Comparison
3.4. Ablation Study
3.4.1. Loss
- stage-1. ranking loss focuses on inter-modal distance, may be hard to tune the visual and textual features simultaneous at the beginning
- stage-1. instance loss performs better, which focusses more on learning intra-modal discriminative descriptors
- instace loss help to regularise the model
3.4.2. Fine-tune
- fine-tune in stage-2 helps
3.4.3. Initialization
- word2vec initialization helps
3.4.4. Position Shift vs Left Alignment
3.5. Training Time
- image CNN ~119ms per image batch (32) on 1080Ti
- text CNN ~117ms per sentence batch (32)
image feature and text feature can be simultaneously calculated, the model can run in a parallel style efficiently.