Zheng Z, Zheng L, Garrett M, et al. Dual-path convolutional image-text embedding with instance loss[J]. arXiv preprint arXiv:1711.05535, 2017.

1. Overview

1.1. Motivation

existing methods uses RNN for text feature learning and employs off-the-shelf CNN for image extraction
ImageNet pre-trained models do not preserve rich image details that are critical for matching languages

In this paper, it proposed

CNN for fine-tuning the visual and textual representation (network only contains Conv, Pooling, ReLU, BN)
instance loss according to viewing each multimodal data pair as a class

1.2. Contribution

dual-path CNN model
instance loss
outperform on FLickr30K, MSCOCO, CUHK-PEDES

1.3.1. Model for Image Recognition

fixed CNN feature as input

1.3.2. Model for Natural Language Understanding

word2vec
RNN, directional LSTM
CNN to conduct machine translation, 9.3x speed up

class-level retrieval. leverage the class labels in training set
instance-level retrieval. match image-text pairs, does not use any class label

In this paper, it focus on instance-level retrieval and propsoed instance loss.

1.4. Dataset

Flickr30k. 5 sentences, avg 10.5 words per sentence after remove rare word
MSCOCO. avg 5 sentences, avg 8.7 words after rare word removal
CUHK-PEDES. 2 sentences, avg 19.6 words after removal

2. Method

2.1. Deep Image CNN

pre-trained models can still provide for good CNN initialization
input. 224x224
output. 2048 dimension vector

2.2. Deep Text CNN

2.2.1. Text Processing

convert sentence to code T [n x d]. (ont-hot)

n. length of the sentence (set to fixed length)
d. size of the dictionary

use word2vec to filter out rare words
pad zeros to T, if less than fixed length words
reshape T to 1 x 32 x d (h, w, c)
position shift (more robust). pad random number of zeros at the beginning and the end of sentence
- left alignment. only padding at the end of the sentence.

2.2.2. Deep Text CNN

input. T (1 x 32 x d)
output. 2048 dimension vector
the filter size of the first conv is 1 x 1 x d x 300, two method to init
- random initialization
- using d x 300 matrix from word2vec for initialization (better)
kernel of Conv. 1x2, every two neighbour components may form a phrase containing content information

2.3. Loss Function

2.3.1. Ranking Loss

I. visual input
T. text input
I_a/T_a. the same image/text group
I_n/T_n. negative sample
α. margin
the convergence of ranking loss requires both image and text branches converge
may be prone to getting stuck in a local minimum

2.3.2. Instance Loss

assumption. each image/text group is distinct
instance loss is a softmax loss classifies an image/text group into one of a large number of classes

L. loss
P. probability
P(c). predicted possibility of the right class c

enforce shared weight in the fully connected layer for the two modalities, otherwise the learned image and text features mat exist in totally different subspace.

2.3.3. Total Loss

stage-1. only instance loss, λ_1 = 0
- so the ranking loss can fing a better optimisation for both modalities in the second stage
- using the instance loss alone can lead to a competitive result
- instance loss encourages the model to find the fine-grained differences, such as ball, stick,..
stage-2. ranking loss + instance loss

2.3.4. Training Stage

stage-1. fixed pre-trained image CNN, use instance loss to tune the remaining part
- if we train the image and text CNNs simultaneously, the text CNN may comprise the pre-trained image CNN
stage-2. instance loss + ranking loss to fine-tune the entire network

3. Experiments

3.1. Metric

Recall@K. possibility that true match appears in the top K of the rank list
Median Rank. median rank of the closest ground truth result in the rank list

3.2. Details

SGD + fixed 0.9 momentum
Matconvnet Framework
224x224 random crop from shorter side 256
horizontal flipping for image
position shift for text
0.75 dropout
max text length 32 for Flickr30K and MSCOCO, 56 for CUHK-PEDES
LR. 0.001
α=1

3.3. Comparison

3.4. Ablation Study

3.4.1. Loss

stage-1. ranking loss focuses on inter-modal distance, may be hard to tune the visual and textual features simultaneous at the beginning
stage-1. instance loss performs better, which focusses more on learning intra-modal discriminative descriptors
instace loss help to regularise the model

3.4.2. Fine-tune

fine-tune in stage-2 helps

3.4.3. Initialization

word2vec initialization helps

3.4.4. Position Shift vs Left Alignment

3.5. Training Time

image CNN ~119ms per image batch (32) on 1080Ti
text CNN ~117ms per sentence batch (32)

image feature and text feature can be simultaneously calculated, the model can run in a parallel style efficiently.