Xiao T, Li S, Wang B, et al. Joint detection and identification feature learning for person search[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 3376-3385.
1. Overview
1.1. Motivation
- existing methods mainly focus on matching cropped pedestrian images between queries and candidates (assum perfect detection)
In this paper, it proposed a framework for person search
- jointly handle pedestrian detection and person re-identification
- proposal net. focus more on the recall rather than the precision
- misalignments of proposal can be furtheradjusted by the identification net
- Online Instance Matching (OIM) loss function
- collect and annotate a large-scale benchmark dataset
1.2. Comparison of Loss Function
- pairwise or triplet loss. O(N^2) need efficient strategy, difficult to find
- softmax. compare all samples at the same time, as the number of classes increases, training the big softmax classifier matrix become much slower or even can not converge
- OIM.
- compare samples of mini-batch with all registered entries
- unlabeled identities can be served as negatives for labeled identities
1.3. Contribution
- jointly optimization
- OIM loss function
- dataset
1.4. Related Work
1.4.1. Person Re-identification
- manually design discriminative features
- learn feature transforms across camera views
- learning distance metrics
- CNN
- triplet samples
- classify
- on abnormal images. low-resolution and partially occlude images
1.4.2. Pedestrian Detection
- hand-crafted features. DPM, ACF and Checkerboard
1.5. Dataset
- CUHK03
- Market501
- Duke
2. Method
2.1. Structure
- output.
- 2048 dimension→ L2-normalized 256 dimension→ cosine similarities
- 2048 dimension to proposal alignment
- 2048 dimension→ L2-normalized 256 dimension→ cosine similarities
2.2. Online Instance Matching Loss
- only consider the labeled abd unlabeled identities while leave the other proposals untouched
the look up table (LUT). L: the size of the table; D: vector dimension
forward. compute cosine similarities between the mini-batch sample and all the labeled identities.
x. the features of a labeled identity inside a mini-batch
backward. if targe class id is t, update t-th column of the LUT, and then scale to unit L2-norm
many unlabeled identities can be safely used as negative classes for all the labeled identities, and store in circular queue. Q: size of the queue
cosine similarities
The probability of x being recognized as the identity with class-id i
- L. the number of different target people
- Q. the size of circular queue to store unlabeled prople
- τ. higher temperature leads to softer probability distribution
The probability of x being recognized as the i-th unlabeled identity
Maximization
Degradation
)
2.3. Drawback of Softmax
- classifier matrix suffers from large variance of gradients and can not by learned effectively
- large number of identities, which only has several instances; each image contains a few identities
- learn more than 5,000 discriminant functions simultaneously, but during each SGD iteration we only have positive samples from tens of classes
- can not exploit the unlabeled identities with softmax loss
- OIM is non-parametric.
- potential drawback. overfit more easily, it find that projecting features into a L2-normalized low-dimensional subspace helps reduce overfitting
2.4. Scalability
- when the number of identities increases, OIM could be time-consuming
- approximate by sub-sampling the labeled and unlabeled identities
3. Dataset
3.1. Come From
- hand-held camera to shoot
- movie snapshots
3.2. Processing
- ignore smaller heights than 50 pixels
3.3. Evaluation
- no overlapped images or labeled identities between training and test set
3.4. Metrics
- cumulative matching characteristics (CMC top-K)
- mean averaged precision (mAP)
4. Experiments
4.1. Details
- τ. 0.1
- size of circular queue. 5,000
- mini-batch. 2 images
- learning rate. 0.001 to 0.0001 after 40k
4.2. Detection
4.3. Search
4.4. OIM
- converge faster
- consistently improves the test performance
4.5. Sub-sample of OIM
- small number converge faster
4.6. Low-dimensional Subspace
- project features into a proper low-rank subspace is very important to regularize the network training
4.7. Detection Recall
- higher recall does not necessarily lead to higher person search performance, re-id method could still get confused on some false alarms
- should not only focus on training re-id methods with manually cropped pedestrians, but should consider the detections jointly under the person search problem setting
4.8. Gallery Size
- larger gallery, more difficult
- all methods may suffer from some common hard samples