0%

(CVPR 2017) Person search with natural language description

Li S, Xiao T, Li H, et al. Person search with natural language description[C]//Proc. CVPR. 2017.



1. Overview


1.1. Motivation

  • existing methods mainly focus on searching persons with image-based or attribute-based queries (limitations for a practical usage)
  • no person dataset or benchmark with textual description available

In this paper, it studied person search with natural language description

  • proposed Recurrent Neural Network with Gated Neural Attention mechanism (GNA-RNN)
  • collect CUHK Person Description Dataset (CUHK-PEDES)


1.2. Image-Based Query

  • person re-identification which requires at least one photo of the queried person being given.

1.3. Attribute-Based Query

  • pre-defined semantic attributes which have limited capability of describing persons’s appearance. And labeling exhausted set of attributes is expensive.

1.4. Contribution

  • person search with language is more practical for real-world
  • investigate different solution. image caption, VQA, visual-semantic embedding
  • proposed GNA-RNN

1.5.1. Language Dataset for Vision

  • Flickr8K, Flickr30K
  • MS-COCO Caption
  • Visual Genome
  • Caltech-UCSD
  • Oxford-102 flowers

1.5.2. Deep Language Models for Vision

  • Image Caption. NeuralTalk
  • VQA. Stacked Attention Network
  • Visual-Semantic Embedding

1.6. CUHK-PEDES Dataset

  • 40,206 images of 13,003 persons from five existing person re-identification datasets (CUHK03, Market-1501, SSM, VIPER, CUHK01)
  • 80,412 sentences for 40,206 images (2 sentences/img). details about appearance, actions, poses and interaction with other objects
  • high-frequency word


1.6.1. User Study



  • Language vs Attribute.
    language description are much precise and effective in describing persons than attributes. (top-1: 58.7% vs 33.3%; top-5: 92.0% vs 74.7%).
  • Sentence Number and Length
    3 sentences achieve highest retrieval accuracy. the longer the sentences are, the easier for users to retrieve the correct images.
  • Word Types


nouns provide most information followed by the adjectives, while the verbs carry least information.



2. GNA-RNN




  • key to build word-image relations. given each word, search related regions to determine whether the word with its context fit the image
  • confidences of all relations should be weighted and then aggregate to generate the final sentence-image affinity

2.1. Visual Units

  • Input. resize to 256x256
  • Output. 512 visual units
  • pre-trained on our dataset for person classification based on person IDs
  • during jointlt training, only update cls-fc1 and cls-fc2

2.2. Attension over Visual Units

  • word are encoded into K-length one-hot vectors. K is the vocabulary size
  • embedded one-hot vector and concat with image features
  • through LSTM-FCs-Softmax, generate unit-level attention at each word



  • summation of weighted attention image features



  • summation of all T words



2.2.1. LSTM



  • h. tanh

2.3. Word-Level Gates for Visual Units

  • different words carry significantly different amount of information for obtaining language-image affinity. (“white” should be more important than the word “this”)
  • unit-level attention can not reflect such difference
  • learn word-level scalar gates at each word


2.4. Loss Function



2.5. Details

  • SGD
  • positive:negative=1:3
  • batch size 128
  • all FC are 512 units except gate-fc1



3. Experiments


3.1. Dataset

  • training set. 11,003 persons; 34,054 images; 68,108 sentence descriptions
  • testing set. 3,074 images of 1,000 persons
  • validation set. 3,078 images of 1,000 persons

3.2. Comparison



  • LSTM might have difficulty encoding complex sentences into a single feature vector
  • word-by-word processing and comparison might be more suitable for the person search problems
  • RNN is more suitable in processing natural language data

3.3. Ablation Study



  • initial training affects the final performance a lot

3.4. The Number of Visual Unit



  • more units might over-fit the dataset