Li S, Xiao T, Li H, et al. Identity-aware textual-visual matching with latent co-attention[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1890-1899.
1. Overview
1.1. Motivation
- most existing methods tackle textual-visual matching problem without effectively utilizing identity-level annotations
- RNNs have difficulty in remembering the complete sequential information of very long sentences
- RNNs are variant to different sentence structures
In this paper, it proposed an identity-aware two-stage framework
- stage-1. learn to embed cross-modal feature with Cross-Modal Cross-Entropy (CMCE) loss; provide initial training point for stage-2; screen easy incorrect matchings
- stage-2. refine the matching results with a latent co-attention mechanism
- spatial attention. relates each word with corresponding image region
- latent semantic attention. aligns different sentence structures to make the matching results more robust to sentence structure variation; at each step of the LSTM, it learns how to weight different words’ features to be more invariant to sentence structure variations
1.2. Contribution
- identity-aware two-stage framework
- CMCE loss
- latent co-attention mechanism
1.3. Related Work
1.3.1. Visual Matching with Identity-Level Annotation
- person re-identification
- face recognition
- classify all identities simultaneously. face challenge when number of classes is too large
- pair-wise or triplet distance loss function. hard negative training samples might be difficult to sample as the number of training sample increases
1.3.2. Textual-Visual Matching
- image caption
- VQA
- text-image embedding
2. Methods
2.1. Stage-1 with CMCE Loss
- map image and description into a joint feature embedding space
- Cross-Modal Cross-Entropy (CMCE) to minimize intra-identity and maximize inter-identity feature distances
2.1.1. Cross-Modal Cross-Entropy Loss
- pair-wise or triplet loss. N identities contains O(N^2) training samples, difficult to sample hard negative samples
CMCE. compare each identity in mini-batch from one modality to all N identites in another modality, can cover all hard negative samples
cross-modal affinity. inner products of features from the two modalities
- textual and visual feature buffers. enable efficient calculation of textual-visual affinities
- before the first iteration. if an identity has multiple descriptions or images, its stored features in the buffers are the average of the multiple samples
- in each iteration. (calc loss)-(BP)-(update corresponding rows in buffer for sampled identity), if identity t has multiple images or descriptions update by
- affinity between one image feature v and ith textual feature S_i.
σ. temperature hyper-parameter to control how peaky the probability distribution
affinity between one textual feature s and kth image feature V_k.
maximize the probability of corresponding identity pairs
2.2. Stage-2 with Latent Co-attention
- stage-1. visual and text feature embeddings might not be optimal, compress the whole sentence into a single vector
- stage-1. sensitive to sentence structure variation
- input. a pair of text description and image
- output. matching confidence
- trained stage-1 network serves as the initial point for the stage-2 network
- only hard negative matching samples from stage-1 results are utilized for training stage-2
2.2.1. Encoder Word-LSTM with Spatial Attention
- generate the weight between word and L regions
- sum of all weighted region
- concate the word and sum
- word feature of LSTM. H={h_1, …, h_T}, H∈(D_H x T)
- image feature. I={i_1, …, i_L}, I∈(D_I, L)
- D_H. dimension of hidden state
- D_I. dimension of image region feature
- T. number of words
- L. number of region
- W_I, W_H. transform feature to K-dimension space
- W_P. conver feature to affinity score
- sum of weighted L regions according to a word
- concate the word and its weighted region
2.2.2. Decoder LSTM with Latent Semantic Attention
- generate weights between last hidden state and all x_t
- sum of weighted x_t
- process and fed into next LSTM step
- LSTM not robust to sentence structure variations
- decoder LSTM with latent semantic attention automatically align sentence structure
- M-step decoder LSTM processes the encoded feature step by step while searches through the entire input sentence to align the image-word features x_t, at mth step
- f. two-layer CNN weight the importance of jth word for mth decoding step
c_{m-1}. hidden state by decoder LSTM for step m-1
x_m. transform by two-FC layers
- LSTM is able to focus more on relevant information by re-weighting the source image-word features to enhance the network’s robust to sentence structure variation
- easier training samples are filtered out by the stage-1 network
- N’. the number of training samples for training the stage-2
3. Experiments
3.1. Dataset
- CUHK-PEDES. two descriptions
- Caltech-UCSD birds (CUB). ten descriptions
- Oxford-102 Flowers. ten descriptions
3.2. Details
- σ=0.04
- Adam of LSTM, SDG of CNN
- training and testing samples are screened by the mathcing results of stage-1
- for each visual or textual sample, we take its 20 most similar samples from the other modality by stage-1 and construct textual-visual pair samples for stage-2 training and testing
3.3. Comparison
3.4. Ablation Study
- CMCE loss vs Triplet loss. 3 times more training time than CMCE
- identity number vs counting number.
- latent semantic attention vs remove it. align visual and semantic concepts, mitigate sensitivity to sentence structure
- spatial attention vs concate visual and textual feature.
- stage-1 vs w/o stage-1