0%

(CVPR 2018) Diversity Regularized Spatiotemporal Attention for Video-based Person Re-identification

Keyword [Spatiotemporal Attention]

Li S, Bak S, Carr P, et al. Diversity regularized spatiotemporal attention for video-based person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 369-378.



1. Overview


1.1. Motivation

  • most existing methods encoding each video frame in its entirely and computing an aggregate representation across all frame
  • the remaining visible portions of the person may provide strong cues for re-identification
  • features directly generated from entire images can easily miss fine-grained visual cues

In this paper, it proposed spatialtemporal attention model

  • multiple spatail attention model (alignment) + diversity regularization term (Hellinger Distance) to not discover the same body
    • align corresponding image patches across frames
    • determining whether a particular part of the body is occluded or not
  • temporal attention

  • automatically discovers a diverse set of distinctive body parts

  • extract useful information from all frames without succumbing to occlusions and misalignment


1.2.1. Image-Based Person Re-id

  • extracting discriminative features
  • learning robust metrics
    • Online Instance Matching Loss

1.2.2. Video-Based Person Re-id

(extension of image-based)

  • top-push distance
  • RNN
  • space-time

1.2.3. Attention Models for Person Re-id

  • avoid focus on similar region



2. Methods




2.1. Restricted Random Sampling

  • divide video into N chunks of equal duration
  • random sample an image of each chunk

2.2. Multiple Spatial Attention Models

foucs on body part, hats, bags, …

  • ResNet-50 (1 Conv + 4 ResBlock). 8x4 grids


  1. L. 32
  2. D. 2048 dimension
  • the weight of n-th frame, k-th attention part, l-th grid



  • weighted attention region



  • Enhance (appendix)



2.3. Diversity Regularization

  • collection of each region weight for n-th frame
    >

  1. K. the number of attention model
  2. L. grids
  • Hellinger Distance. maximize the distance



  • Regularization Term. will be multipled by a coefficient and added to original OIM loss



  1. variant


2.4. Temporal Attention

Pooling features arocss time using a per-frame weight is not sufficiently robust, since some frames could contain valuable partial information about an individual. (apply same temporal attention weight to all frames)

  • weight across all frames for one attention region


2.5. Overview



  • entire video is represented by vector x ∈ (K x D)


2.6. Re-id Loss

  • OIM



3. Experiments


3.1. Details

  • N = 6
    • pretrain ResNet-50 on image-based re-identification datasets
    • fixed CNN, train multiple spatial attention model (Diversity Regularization)
    • fixed CNN, jointly train the whole network
  • SGD, 0.1 drop to 0.01
  • 128 dimension L2-normalized

3.2. Ablation Study



3.2.1. different number of spatial attention models



  • treating a person as single region instead of two distinct body parts is better

3.3. Comparison