Paper status: completed

Mining Hard Samples Globally and Efficiently for Person Reidentification

Published:03/13/2020

Global Hard Sample Mining (1)Person Reidentification Task (1)Multiplet Loss Optimization (1)Ranking List Network (1)Public Dataset Evaluation (1)

Original Link

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces a globally efficient hard sample mining system for person reidentification, utilizing a ranking list network with multiplet loss to effectively mine positive and negative samples from the entire training set, enhancing training efficiency and addressing comp

Abstract

Person reidentification (ReID) is an important application of Internet of Things (IoT). ReID recognizes pedestrians across camera views at different locations and time, which is usually treated as a ranking task. An essential part of this task is the hard sample mining. Technically, two strategies could be employed, i.e., global hard mining and local hard mining. For the former, hard samples are mined within the entire training set, while for the latter, it is done in mini-batches. In literature, most existing methods operate locally. Examples include batch-hard sample mining and semihard sample mining. The reason for the rare use of global hard mining is the high computational complexity. In this article, we argue that global mining helps to find harder samples that benefit model training. To this end, this article introduces a new system to: 1) efficiently mine hard samples (positive and negative) from the entire training set and 2) effectively use them in training. Specifically, a ranking list network coupled with a multiplet loss is proposed. On the one hand, the multiplet loss makes the ranking list progressively created to avoid the time-consuming initialization. On the other hand, the multiplet loss aims to make effective use of the hard and easy samples during training. In addition, the ranking list makes it possible to globally and effectively mine hard positive and negative samples. In the experiments, we explore the performance of the global and local sample mining methods, and the effects of the semihard, the hardest, and the randomly selected samples. Finally, we demonstrate the validity of our theories using various public data sets and achieve competitive results via a quantitative evaluation.

Mind Map

In-depth Reading

English Analysis~31 min read · 39,442 chars

1. Bibliographic Information

1.1. Title

Mining Hard Samples Globally and Efficiently for Person Reidentification

1.2. Authors

Hao Sheng: Member, IEEE. Affiliated with the State Key Laboratory of Software Development Environment, School of Computer Science and Engineering and Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China. Research interests include computer vision, pattern recognition, and machine learning.
Yanwei Zheng: Corresponding author. Affiliated with the State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing, China, and the School of Computer Science and Technology, Shandong University, Qingdao, China. Research interests include machine learning, computer vision, and person reidentification.
Wei Ke: Member, IEEE. Affiliated with the School of Applied Sciences, Macao Polytechnic Institute, Macao SAR, China. Research interests include programming languages, image processing, computer graphics, and tool support for object-oriented and component-based engineering and systems.
Dongxiao Yu: Affiliated with the School of Computer Science and Technology, Shandong University, Qingdao, China. Research interests include wireless networks, distributed computing, and graph algorithms.
Xiuzhen Cheng: Fellow, IEEE. Affiliated with the School of Computer Science and Technology, Shandong University, Qingdao, China. Research interests include cyberphysical systems, wireless and mobile computing, sensor networking, wireless and mobile security, and algorithm design and analysis.
Weifeng Lyu: Affiliated with the State Key Laboratory of Software Development Environment, School of Computer Science and Engineering and Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, China. Research interests include massive information system and urban cognitive computing.
Zhang Xiong: Affiliated with the School of Computer Science and Engineering, Beihang University, Beijing, China. Research interests include computer vision, information security, and data vitalization.

1.3. Journal/Conference

Published in IEEE Internet of Things Journal. The IEEE Internet of Things Journal is a highly reputable and influential publication in the fields of Internet of Things, computer science, and engineering, known for publishing high-quality research on various aspects of IoT technologies and applications.

1.4. Publication Year

2020

1.5. Abstract

Person reidentification (ReID) is a critical application in the Internet of Things (IoT) that aims to recognize pedestrians across different camera views and times, typically framed as a ranking task. A key challenge in ReID is hard sample mining. There are two main strategies: global hard mining, which considers the entire training set, and local hard mining, which operates on mini-batches. Most existing methods employ local hard mining (e.g., batch-hard, semi-hard sample mining) due to the high computational complexity of global hard mining. This article argues that global mining is superior for finding harder samples, which significantly benefits model training. To overcome the computational challenge, the paper proposes a new system called LoopNet. This system efficiently mines both hard positive and hard negative samples from the entire training set and effectively uses them in training. Specifically, LoopNet introduces a ranking list network coupled with a multiplet loss. The multiplet loss facilitates the progressive creation of the ranking list, avoiding time-consuming initialization, and effectively leverages both hard and easy samples during training. The ranking list itself enables global and effective mining of hard samples. Through extensive experiments, the paper explores the performance of global versus local sample mining methods, and the impact of semihard, hardest, and randomly selected samples. Quantitative evaluations on various public datasets demonstrate the validity of their theories and achieve competitive results.

1.6. Original Source Link

/files/papers/69533c0a0394820b7e465231/paper.pdf (This link points to a local file provided by the system, indicating it's the full content of the paper for analysis.)

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is Person Re-identification (ReID), which involves recognizing the same person across different cameras, locations, and times, often treated as a ranking task. This problem is highly important for various Internet of Things (IoT) applications, particularly in video surveillance, long-term cross-camera tracking, content-based image retrieval, and multicamera behavior analysis.

A critical component of effective ReID models, especially those based on metric learning, is hard sample mining. Hard samples are those that are difficult for the model to correctly classify or distinguish, and focusing on these samples during training can significantly improve model performance and generalization.

The paper identifies a significant challenge and gap in prior research:

Computational Complexity of Global Mining: There are two main strategies for hard sample mining: local hard mining (within mini-batches) and global hard mining (across the entire training set). Most existing methods, such as batch-hard and semi-hard sample mining, primarily use local hard mining because global hard mining is computationally very expensive. Iterating through the entire dataset to find the hardest samples for every training step is generally considered impractical.
Suboptimal Sample Selection: While local hard mining is more feasible, it might miss truly hard samples that exist outside the current mini-batch, potentially leading to suboptimal training and poorer generalization. The paper argues that global mining can find "harder samples" that are more beneficial for training.

The paper's entry point is to bridge this gap by proposing a novel system that makes global hard sample mining both efficient and effective, thereby leveraging its benefits for Person ReID.

2.2. Main Contributions / Findings

The paper's primary contributions are threefold:

Proposed LoopNet with Ranking List for Global Hard Sample Mining: The authors introduce LoopNet, a listwise ranking network that maintains positive and negative ranking lists. These lists are sorted by distance (descending for positive, ascending for negative), ensuring that the hardest positive and hardest negative samples are always at the top. This mechanism enables global selection of hard samples, which is a significant improvement over local mini-batch mining in terms of sample quality. The ranking list is updated online, meaning it continuously adapts as the model learns, ensuring selected samples remain "hard."
Introduced Multiplet Loss for Progressive Initialization and Effective Sample Usage: To address the computational burden of initializing a full ranking list, the paper proposes a multiplet loss. This loss function allows for progressive initialization of the ranking list by combining a few mined hard samples with randomly chosen ones in a mini-batch. As training progresses, more samples are added to the lists. The multiplet loss also incorporates a priority mechanism for samples, assigning greater importance (via adaptive margins) to harder samples, thereby making training more effective by focusing on challenging examples.
Comprehensive Exploration of Sample Mining Methods: The paper conducts an extensive experimental analysis, comparing the performance of global versus local sample mining strategies. It also investigates the effects of different sample selection modes—semihard, hardest, and randomly selected samples—for both positive and negative samples. This exploration provides valuable insights into the optimal strategies for hard sample mining in ReID.

Key conclusions and findings reached by the paper are:

Hard sample mining is generally effective in promoting ReID performance.
For triplet loss, local sample mining within a mini-batch generally yields the best performance.
When sample priority and effectiveness are considered (as in the multiplet loss), global hard sample mining significantly outperforms local mining methods. Specifically, the global hardest sample mining combined with multiplet loss achieves the best results.
The computational overhead of their proposed global mining method is only marginally higher (2.2% to 2.9% more time) than local mining, making it practically feasible.
The proposed method achieves competitive, and often state-of-the-art, results on various public ReID datasets (Market-1501, CUHK03, Duke).

3.1. Foundational Concepts

To understand this paper, a beginner needs to grasp several fundamental concepts in computer vision and machine learning, particularly within the context of Person Re-identification (ReID).

3.1.1. Person Re-identification (ReID)

Person Re-identification (ReID) is the task of identifying the same person across different camera views, locations, and times. Imagine a surveillance system where a person walks from Camera A to Camera B. ReID aims to link the images of that person captured by Camera A with those captured by Camera B, even if their appearance changes due to varying illumination, pose, occlusion, or viewpoint. It's a challenging problem because people can look very different under these conditions, and different people might look similar (e.g., wearing similar clothes).

3.1.2. Ranking Task

The paper treats ReID as a ranking task. In this context, given a probe image (the person we want to find), the goal is to rank all gallery images (a database of images from other cameras) based on their similarity to the probe. The ideal outcome is for images of the same person as the probe to appear at the top of the ranked list, while images of different people appear lower down. This is different from a classification task (identifying a specific person from a predefined set) or verification task (determining if two images are of the same person).

3.1.3. Hard Sample Mining

Hard sample mining is a crucial technique in machine learning, especially in tasks involving metric learning or object detection. During training, a model might encounter easy samples (those it already handles well) and hard samples (those it struggles with). Hard sample mining strategically selects these hard samples to prioritize during training. The intuition is that learning from challenging examples provides more valuable gradients and leads to a more robust and generalizable model.

A hard positive sample for an anchor is a positive sample (same identity as the anchor) that is far in the feature space, making it difficult for the model to correctly group.
A hard negative sample for an anchor is a negative sample (different identity from the anchor) that is close in the feature space, making it difficult for the model to correctly separate.

3.1.4. Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a class of deep neural networks widely used for analyzing visual imagery. They are particularly effective for tasks like image classification, object detection, and feature extraction. In ReID, CNNs are typically used to learn robust, discriminative feature representations (embeddings) for person images. These features are then used to calculate distances or similarities between images. The paper uses ResNet50 and GoogLeNetv3 (Inception-v3) as baseline CNN architectures.

3.1.5. Metric Learning

Metric learning aims to learn a distance function or embedding space where similar samples are close together, and dissimilar samples are far apart. Triplet loss and quadruplet loss are popular metric learning objectives.

3.1.6. Triplet Loss

The triplet loss is a widely used loss function in metric learning, especially for tasks like face recognition and ReID. It operates on a triplet of samples: an anchor image ( $A$ ), a positive image ( $P$ ) (same identity as $A$ ), and a negative image ( $N$ ) (different identity from $A$ ). The goal of triplet loss is to ensure that the distance between the anchor and positive is smaller than the distance between the anchor and negative by at least a certain margin ( $\alpha$ ). The standard triplet loss formula is: $\mathcal{L}(A, P, N) = \left[ d(A, P) - d(A, N) + \alpha \right]_{+}$ where:

$A$ : Anchor image.
$P$ : Positive image (same identity as $A$ ).
$N$ : Negative image (different identity from $A$ ).
$d(\cdot, \cdot)$ : A distance function (e.g., Euclidean distance) between the feature embeddings of two images.
$\alpha$ : A predefined margin that ensures a separation between positive and negative pairs.
$[x]_{+} = \operatorname{max}\{x, 0\}$ : The hinge loss function, meaning the loss is zero if the condition ( $d(A, P) + \alpha < d(A, N)$ ) is met, otherwise it's $d(A, P) - d(A, N) + \alpha$ .

3.1.7. Quadruplet Loss

The quadruplet loss is an extension of the triplet loss that aims to further increase the inter-class variance while decreasing the intra-class variance. It uses a quadruplet of samples: an anchor ( $A$ ), a positive ( $P$ ), a negative ( $N_1$ ), and a second negative ( $N_2$ ). The quadruplet loss not only enforces the triplet constraint but also aims to push $N_1$ and $N_2$ further apart from $A$ or $P$ . A common form of quadruplet loss includes two margins, $\alpha$ and $\beta$ : $\mathcal{L}(A, P, N_1, N_2) = \left[ d(A, P) - d(A, N_1) + \alpha \right]_{+} + \left[ d(A, P) - d(N_1, N_2) + \beta \right]_{+}$ This encourages A, P to be closer, $A, N_1$ to be farther apart, and also $N_1, N_2$ to be farther apart. This helps create a more discriminative feature space.

3.1.8. Mini-batch Training

In deep learning, instead of processing the entire dataset at once (which is computationally expensive and memory-intensive), training is typically done in mini-batches. A mini-batch is a small, randomly selected subset of the training data. The model computes gradients and updates its weights based on the samples in the current mini-batch. This process is repeated over many iterations or epochs until the model converges. Local hard mining methods operate within these mini-batches.

3.2. Previous Works

The paper categorizes previous ReID methods based on how many samples are considered in the loss function:

Pointwise Approach: Treats ReID as a classification task. A CNN classifies images into person categories, and then features are extracted for similarity calculation and ranking. Examples include classification networks that learn ranking scores by combining classifier outputs [14]-[19].
Pairwise Approach: Uses Siamese networks [20]-[24]. These networks take two images as input and output a similarity score or classify whether the pair belongs to the same or different pedestrians. The main focus is on effectively concatenating cross-corresponding pairs.
Listwise Approach: Uses frameworks like triplet [7], [9], [11], [25], quadruplet [26], or DeepList [27]. These methods consider more than two samples to enforce relative ordering or distances.

Specific related works mentioned for Loss Function Improvement and Hard Sample Mining:

3.2.1. Loss Function Improvements (Listwise Focus)

Triplet Loss Variations:
- Ding et al. [9] focused on making $L_2$ feature distance of matched pairs smaller than mismatched pairs.
- Cheng et al. [7] designed a loss to make matched pairs closer than a predefined threshold and also closer than mismatched pairs.
- Liu et al. [31] used comparative attention networks for part-based comparisons.
- Wang et al. [11] combined single-image and cross-image representations and used a learned metric instead of Euclidean distance for triplet loss.
- Zhou et al. [32] introduced Point-to-Set (P2S) metric instead of Point-to-Point (P2P) distances, jointly minimizing intraclass and maximizing interclass distance.
Quadruplet Loss: Chen et al. [26] designed quadruplet loss by adding a second mismatched image to the triplet to increase the distance from negative pairs, aiming for larger interclass variance and smaller intraclass variance. They also used a normalized 2-D output from a fully connected layer and Softmax as the distance metric.
Listwise Loss: Wang et al. [27] developed a listwise loss function with an adaptive margin using ranking lists, assigning larger margins to harder negative samples. This used the Plackett-Luce permutation probability model [33] and likelihood loss from ListMLE [34], but was based on a single-shot assumption (only one gallery image matches the probe).
Chen et al. [8] proposed a learning-to-rank loss to minimize costs for poor gallery rankings by accumulating similarity differences between positive and negative matches.

3.2.2. Hard Sample Mining

Offline Methods: Ahmed et al. [22] initially trained with randomly selected negative samples, then used the model to select hard negative pairs for retraining, which is time-consuming.
Online Methods (Mini-batch Based):
- Batch Hard: Hermans et al. [29] proposed batch hard, where classes and images are randomly sampled for a batch, and the hardest positive and negative samples within that batch are selected to form triplets. (Illustrated in Figure 1(a)).
- Semi-hard Samples: Schroff et al. [28] (FaceNet) selected negative samples that are closer to the anchor than positive samples but still further away than the positive samples from the anchor. These are called semi-hard samples (Illustrated in Figure 1(b)).
- Threshold-based: Kong et al. [30] selected hard samples whose scores are higher than a threshold. Chen et al. [26] used a combined mean of positive and negative pair distances to set an adaptive threshold. Wang et al. [27] introduced an adaptive shifting parameter in listwise loss for larger margins on harder negative samples.
- Wang and Gupta [35] randomly chose negative samples for initial epochs, then selected top K negatives with highest losses.
- Xiao et al. [56] used margin sample mining loss, selecting maximum positive pair distance and minimum negative pair distance in a batch.
Set-based Methods (Reranking): Liu et al. [36] eliminated hard negative label matches based on reciprocal nearest neighbor. Zhong et al. [37] used k-reciprocal nearest neighbors to recall hard positive gallery images.
Probability/Weight Updating: Ye et al. [38] used label reweighting to filter false positives and easy negatives. Zhou et al. [39] used local hard negative samples for tight constraints. Li et al. [20] updated selection probabilities of negative samples not chosen for a long time. Triantafyllidou et al. [40] progressively added harder positive samples based on model-produced difficulty scores. Dong et al. [41] started with easy samples and progressively selected more challenging ones as the model improved.

3.3. Technological Evolution

The field of Person ReID has evolved significantly, particularly with the advent of deep learning:

Early Methods (Feature Engineering): Before deep learning, ReID relied heavily on hand-crafted features (e.g., color histograms, texture features) and traditional metric learning algorithms.
Pointwise and Pairwise Deep Learning (Early CNNs): The introduction of CNNs allowed for end-to-end feature learning. Early deep ReID methods often adopted pointwise (classification-based) or pairwise (Siamese network-based) approaches to learn discriminative features.
Listwise Deep Learning (Triplet/Quadruplet Loss): The shift to listwise approaches, notably triplet loss and its variants, marked a significant advancement. These methods directly optimized the ranking objective by considering relative distances between multiple samples, leading to more robust embeddings. Quadruplet loss further refined this by introducing additional negative constraints.
Hard Sample Mining within Mini-batches: As triplet loss gained prominence, the importance of hard sample mining became evident. Techniques like batch hard and semi-hard sample mining emerged to efficiently select challenging examples within mini-batches, making triplet loss more effective and stable.
This Paper's Position (Efficient Global Hard Mining): This paper builds upon the listwise approach and the concept of hard sample mining. It addresses the long-standing challenge of global hard sample mining by proposing a computationally efficient method (LoopNet with ranking lists) and an effective loss function (multiplet loss) that leverages these globally mined samples. This work represents a step towards making global hard mining practical and universally beneficial, pushing the boundary of listwise metric learning.

3.4. Differentiation Analysis

Compared to the main methods in related work, the LoopNet with multiplet loss offers several core differences and innovations:

Global vs. Local Hard Sample Mining:
- Prior Methods: Predominantly rely on local hard mining (e.g., batch hard, semi-hard in FaceNet, [28], [29]) within mini-batches. This is efficient but can miss truly hard samples that are not present in the current batch, potentially leading to suboptimal learning.
- LoopNet Innovation: Proposes efficient global hard sample mining. It maintains dynamic ranking lists (one for positive, one for negative samples) that span the entire training set. These lists are continually updated, ensuring that the globally hardest samples are always accessible. This directly addresses the limitation of local mining by providing a wider and more accurate pool of hard samples.
Progressive Initialization of Ranking Lists:
- Prior Challenges: True global mining would typically require a full pass over the entire dataset to compute all distances and initialize ranking lists before training, which is computationally prohibitive.
- LoopNet Innovation: Introduces a progressive initialization strategy for the ranking lists, enabled by the multiplet loss. Instead of a full pre-computation, the lists are gradually built and updated online during training, making global mining computationally feasible. This is a key enabler for the efficiency aspect.
Multiplet Loss with Sample Priority:
- Prior Loss Functions (Triplet, Quadruplet): While effective, they often treat all selected samples (within a triplet/quadruplet) equally in terms of their contribution to the loss, or use fixed margins.
- LoopNet Innovation: The multiplet loss goes beyond triplet and quadruplet by considering multiple positive and negative samples for each anchor. Crucially, it incorporates adaptive margins ( $\alpha_j, \beta_j$ ) that decrease with the rank of the sample in the ranking list (i.e., harder samples at the top get larger margins). This explicit priority mechanism allows harder samples to have a greater impact on the loss, leading to more effective training. The multiplet loss also includes a negative-negative separation component, similar to quadruplet loss, but extended.
Loop Network Architecture:
- Prior Architectures: Typically linear or branched, where samples are fed forward and loss is computed.
- LoopNet Innovation: The network has a "loop" where the output (ranking list from the ranking layer) feeds back as input to the sampling layer. This online, dynamic feedback mechanism is integral to the progressive initialization and continuous hard sample mining process, distinguishing it from static sample selection strategies.
  
  In essence, this paper's core innovation lies in making the theoretically superior global hard sample mining practically viable and highly effective for Person ReID through the ingenious combination of a dynamic ranking list network and a prioritized multiplet loss.

4. Methodology

The LoopNet model proposed in this paper is designed to efficiently mine hard samples globally and effectively use them in training for Person Re-identification (ReID). It achieves this through a novel ranking list mechanism and a multiplet loss function, integrated into a cyclical network architecture.

4.1. Ranking List

The ranking list is the cornerstone of the LoopNet for global hard sample mining. The paper defines hard positive samples as those that are farther from the anchor (probe image) in the feature space, despite being of the same identity. Conversely, hard negative samples are those that are closer to the anchor but belong to a different identity. Given these differing definitions of "hardness," two separate ranking lists are maintained: one for positive samples and one for negative samples.

The positive ranking list for an anchor $p_i$ is sorted in descending order of distance, meaning the hardest positive samples (those furthest from $p_i$ ) will be at the top. The negative ranking list for an anchor $p_i$ is sorted in ascending order of distance, placing the hardest negative samples (those closest to $p_i$ ) at the top.

To formally define these lists, let's establish the notation:

$P = <p_1, p_2, ..., p_{n_p}>$ : A permutation of $n_p$ probe images.
$G = <g_1, g_2, ..., g_{n_g}>$ : A permutation of $n_g$ gallery images.
$\mathrm{id}(x)$ : The identity (label) of image $x$ .
f(x, y): The distance between the feature embeddings of image $x$ and image $y$ .

4.1.1. Positive Ranking List

The positive ranking list for a probe image $p_i$ is denoted as $\pi_i^+$ . It is defined as: $\begin{array}{r l} & \pi_i^+ = <\pi_i^+(1), \pi_i^+(2), \dots, \pi_i^+(m^+)> \\ \mathrm{s.t.} \quad & \mathrm{id}\Big(g_{\pi_i^+(k)}\Big) = \mathrm{id}(p_i) \\ & f\Big(p_i, g_{\pi_i^+(j)}\Big) \geq f\Big(p_i, g_{\pi_i^+(j+1)}\Big) \end{array}$ where:

$i$ : The index of the probe image $p_i$ in the selected probe permutation.
$\pi_i^+(k)$ : The index of a gallery image in the selected gallery permutation that is at position $k$ in the positive ranking list for $p_i$ . For example, if $\pi_5^+ = <4, 7, 3>$ , then $\pi_5^+(2) = 7$ means the 7th gallery image is the second sample in the list.
$\mathrm{id}(g_{\pi_i^+(k)}) = \mathrm{id}(p_i)$ : This condition ensures that all gallery images $g_{\pi_i^+(k)}$ in the list $\pi_i^+$ belong to the same identity as the probe image $p_i$ . This means they are positive samples.
$k = 1, 2, \ldots, m^+$ : $m^+$ is the total number of positive samples for $p_i$ in the list. The number of positive gallery images for a given probe $p_i$ is typically less than the total number of gallery images $n_g$ .
$f(p_i, g_{\pi_i^+(j)}) \geq f(p_i, g_{\pi_i^+(j+1)})$ : This condition ensures that the positive ranking list is sorted in descending order of distance. Therefore, the samples with larger distances from $p_i$ appear earlier in the list, making $\pi_i^+(1)$ the hardest positive sample.

4.1.2. Negative Ranking List

Similarly, the negative ranking list for a probe image $p_i$ is denoted as $\pi_i^-$ . It is defined as: $\begin{array}{r l r} & & \pi_i^- = <\pi_i^-(1), \pi_i^-(2), \dots, \pi_i^-(m^-)> \\ & & \mathrm{s.t.} \quad \mathrm{id}\Big(g_{\pi_i^-(k)}\Big) \ne \mathrm{id}(p_i) \\ & & f\Big(p_i, g_{\pi_i^-(j)}\Big) \le f\Big(p_i, g_{\pi_i^-(j+1)}\Big). \qquad \end{array}$ where:

$\pi_i^-(k)$ : The index of a gallery image at position $k$ in the negative ranking list for $p_i$ .
$\mathrm{id}(g_{\pi_i^-(k)}) \ne \mathrm{id}(p_i)$ : This condition ensures that all gallery images $g_{\pi_i^-(k)}$ in the list $\pi_i^-$ belong to a different identity from the probe image $p_i$ . These are negative samples.
$k = 1, 2, \ldots, m^-$ : $m^-$ is the total number of negative samples for $p_i$ in the list.
$f(p_i, g_{\pi_i^-(j)}) \leq f(p_i, g_{\pi_i^-(j+1)})$ : This condition ensures that the negative ranking list is sorted in ascending order of distance. Therefore, the samples with smaller distances from $p_i$ appear earlier in the list, making $\pi_i^-(1)$ the hardest negative sample.

In both cases, the closer a sample is to the top of its respective list (i.e., smaller $j$ or $k$ ), the "harder" it is considered for the given probe image.

4.2. Multiplet Loss

Inspired by the quadruplet loss, the paper designs a multiplet loss to achieve a smaller intra-class variance (distances between same-identity samples) and a larger inter-class variance (distances between different-identity samples). This loss also considers the priority of samples for each image, giving harder samples more influence. Crucially, the multiplet loss supports the progressive initialization of the ranking list by allowing a mix of mined hard samples and randomly selected ones in a mini-batch.

For a single anchor (probe image) $p_i$ in a mini-batch, the multiplet loss selects $n$ positive images ( $g_{i,1}^+, \ldots, g_{i,n}^+$ ) and $n$ negative images ( $g_{i,1}^-, \ldots, g_{i,n}^-$ ) from the gallery set. These form a $(2n+1)$ -tuple: $< p_i, g_{i,1}^+, \ldots, g_{i,n}^+, g_{i,1}^-, \ldots, g_{i,n}^- >$ .

The multiplet loss $\mathcal{L}_i$ for a single probe $p_i$ is defined as: $\begin{array}{l} { \displaystyle { \cal L } _ { i } = \sum _ { j = 1 } ^ { n } \left[ f \big ( p _ { i } , g _ { i , j } ^ { + } \big ) - f \big ( p _ { i } , g _ { i , j } ^ { - } \big ) + \alpha _ { j } \right] _ { + } } \\ { \displaystyle ~ + \sum _ { j = 1 } ^ { n - 1 } \left[ f \big ( p _ { i } , g _ { i , j } ^ { + } \big ) - f \big ( g _ { i , j } ^ { - } , g _ { i , j + 1 } ^ { - } \big ) + \beta _ { j } \right] _ { + } } \end{array}$ where:

$p_i$ : The probe image (anchor).
$g_{i,j}^+$ : The $j$ -th positive gallery sample for $p_i$ .
$g_{i,j}^-$ : The $j$ -th negative gallery sample for $p_i$ .
$n$ : The dimension of the multiplet loss, indicating how many positive and negative samples are considered for each anchor.
$f(\cdot, \cdot)$ : The distance function between features of two images. The paper uses Euclidean distance for this.
$[x]_+ = \operatorname{max}\{x, 0\}$ : The hinge loss component, ensuring that only violations of the margin contribute to the loss.
$\alpha_j$ : The margin for the triplet-like comparison between $p_i$ , $g_{i,j}^+$ , and $g_{i,j}^-$ .
$\beta_j$ : The margin for the quadruplet-like comparison, aiming to separate the $j$ -th positive sample from consecutive negative samples ( $g_{i,j}^-$ and $g_{i,j+1}^-$ ).

The loss function has two main parts:

First Summation (Triplet-like Constraints): \sum_{j=1}^{n} \left[ f \big( p_i, g_{i,j}^+ \big) - f \big( p_i, g_{i,j}^- \big) + \alpha_j \right]_+ This term ensures that for each $j$ , the distance between the anchor $p_i$ and its $j$ -th positive sample $g_{i,j}^+$ is smaller than the distance between $p_i$ and its $j$ -th negative sample $g_{i,j}^-$ , by at least the margin $\alpha_j$ . This is the standard triplet loss objective, extended to multiple positive and negative pairs. If $n=1$ , this term becomes the triplet loss.
Second Summation (Quadruplet-like Constraints): \sum_{j=1}^{n-1} \left[ f \big( p_i, g_{i,j}^+ \big) - f \big( g_{i,j}^-, g_{i,j+1}^- \big) + \beta_j \right]_+ This term aims to increase the separation between consecutive negative samples ( $g_{i,j}^-$ and $g_{i,j+1}^-$ ) relative to the anchor-positive distance. It encourages negative samples to be distinct from each other in the feature space, thereby expanding the inter-class variance. This is akin to the quadruplet loss objective, but generalized.

4.2.1. Adaptive Margins

The margins $\alpha_j$ and $\beta_j$ are designed to be gradually decreasing, reflecting the priority of samples. Harder samples (those with smaller $j$ values, meaning they are higher in the ranking lists) are assigned larger margins, giving them more weight in the loss calculation. The formulas for these adaptive margins are: $\alpha_j = \frac{1}{j}\alpha, \quad \beta_j = \frac{1}{j}\beta$ where:

$\alpha$ and $\beta$ : Two basic constant margins.
$j$ : The index of the positive/negative sample, corresponding to its "rank" within the selected $n$ samples. A smaller $j$ implies a harder sample. The condition $\alpha \ge \beta$ is set because the quadruplet item is considered a weaker auxiliary constraint compared to the primary triplet item. In the experiments, distances are normalized to [0, 1], and the constants are set to $\alpha = 1.0$ and $\beta = 0.5$ .

4.2.2. Ideal Sample Mining from Ranking Lists

Ideally, the $j$ -th positive sample $g_{i,j}^+$ and the $j$ -th negative sample $g_{i,j}^-$ for the multiplet loss would be directly selected from the top of their respective ranking lists: $\begin{array}{l} { g _ { i , j } ^ { + } = g _ { \pi _ { i } ^ { + } ( j ) } } \\ { g _ { i , j } ^ { - } = g _ { \pi _ { i } ^ { - } ( j ) } . } \end{array}$ This means $g_{\pi_i^+(j)}$ is the $j$ -th hardest positive sample from the global list for $p_i$ , and $g_{\pi_i^-(j)}$ is the $j$ -th hardest negative sample. However, as noted in the paper, this direct selection requires the ranking lists to be fully initialized, which is computationally expensive. The sampling layer (discussed in Section 4.5) addresses this with a progressive initialization strategy.

4.3. Backpropagation

To train the network using gradient descent, the partial derivatives of the multiplet loss $\mathcal{L}_i$ with respect to the distance terms are required. Let $I\{x\}$ be the indicator function, which equals 1 if condition $x$ is true, and 0 otherwise.

The partial derivative with respect to the anchor-positive distance $f(p_i, g_{i,j}^+)$ is: $\begin{array}{c} { \displaystyle \frac { \partial L _ { i } } { \partial f \big ( p _ { i } , g _ { i , j } ^ { + } \big ) } = \sum _ { j = 1 } ^ { n } I \big \{ f \big ( p _ { i } , g _ { i , j } ^ { + } \big ) - f \big ( p _ { i } , g _ { i , j } ^ { - } \big ) + \alpha _ { j } > 0 \big \} } \\ { \displaystyle ~ + \sum _ { j = 1 } ^ { n - 1 } I \big \{ f \big ( p _ { i } , g _ _ { i , j } ^ { + } \big ) - f \big ( g _ { i , j } ^ { - } , g _ { i , j + 1 } ^ { - } \big ) + \beta _ { j } > 0 \big \} } \end{array}$ This derivative is 1 if the corresponding hinge loss terms are active (i.e., greater than 0), and 0 otherwise.

The partial derivative with respect to the anchor-negative distance $f(p_i, g_{i,j}^-)$ is: $\frac { \partial L _ { i } } { \partial f \Big ( p _ { i } , g _ { i , j } ^ { - } \Big ) } = - \sum _ { j = 1 } ^ { n } I \Big \{ f \Big ( p _ { i } , g _ { i , j } ^ { + } \Big ) - f \Big ( p _ { i } , g _ { i , j } ^ { - } \Big ) + \alpha _ { j } > 0 \Big \}$ This derivative is -1 if the corresponding triplet hinge loss term is active, and 0 otherwise, pushing $f(p_i, g_{i,j}^-)$ to increase.

The partial derivative with respect to the negative-negative distance $f(g_{i,j}^-, g_{i,j+1}^-)$ is: $\frac { \partial L _ { i } } { \partial f \Big ( g _ { i , j } ^ { - } , g _ _ { i , j + 1 } ^ { - } \Big ) } = - \sum _ { j = 1 } ^ { n - 1 } I \Big \{ f \Big ( p _ { i } , g _ { i , j } ^ { + } \Big ) - f \Big ( g _ { i , j } ^ { - } , g _ { i , j + 1 } ^ { - } \Big ) + \beta _ { j } > 0 \Big \} .$ This derivative is -1 if the corresponding quadruplet hinge loss term is active, and 0 otherwise, pushing $f(g_{i,j}^-, g_{i,j+1}^-)$ to increase. These derivatives are then used in backpropagation to update the network weights.

4.4. Pipeline of LoopNet

The LoopNet architecture is designed to integrate the ranking list and multiplet loss seamlessly. The pipeline involves a series of layers that process image data, extract features, compute distances, and update the ranking list.

The following figure (Figure 2 from the original paper) illustrates the LoopNet pipeline, depicting how sample mining and feature extraction are performed.

该图像是示意图，展示了通过 LoopNet 进行样本挖掘和特征提取的过程。图中说明了如何选择样本、挖掘困难样本、计算距离、以及通过 multiplet loss 効果性利用样本进行训练。

The LoopNet pipeline consists of the following steps:

Sampling Layer: This is the first layer. It simultaneously performs two crucial functions:
- Mines hard samples (positive and negative) from the existing ranking list (if available).
- Randomly chooses samples from the shuffled samples in the dataset.
- It then outputs the image data and their corresponding person labels to the subsequent layers.
General Network: A standard Convolutional Neural Network (CNN) architecture, such as GoogLeNetv3 (referred to as Inc in experiments) or ResNet50 (referred to as Res in experiments), is used here. This network processes the input image data to extract high-level visual features, outputting a feature map.
Fully Connected Layer 1: The feature map from the general network is passed through a fully connected layer (also known as a dense layer) with 384 output units. This layer transforms the feature map into a compact feature vector (embedding) for each input image.
Fully Connected Layer 2 (Classifier): The feature vectors from the first fully connected layer are fed into another fully connected layer, which acts as a classifier. This layer maps the features to the person classification space, effectively predicting the identity of the person.
Softmax Loss Layer: The output of the classifier (person classes) is then used to compute the Softmax loss. This loss measures the error between the predicted classification and the actual person label, guiding the network to correctly classify identities.
Pairing Layer: The feature vectors from Fully Connected Layer 1 and the person labels are also sent to a pairing layer. This layer constructs pairs of images (e.g., probe-positive, probe-negative, negative-negative) and computes their Euclidean distances in the feature space, generating pair distances and pair labels.
Softmax Layer (Distance Normalization): The raw pair distances from the pairing layer are normalized to the interval [0, 1] using a Softmax layer. This ensures that distances are consistent and comparable for the loss calculation.
Ranking Layer: The normalized pair distances and person labels are used by the ranking layer to update and generate the ranking list. This layer is responsible for sorting samples by distance and maintaining the hardest samples at the top of the positive and negative lists. It also handles the progressive initialization and online updates of these lists.
Multiplet Loss Layer: This layer constructs multiplets using the normalized pair distances (from the Softmax layer) and measures the error based on the pair labels using the multiplet loss function defined in Section 4.2.
Final Loss Layer: The Softmax loss (from step 5) and the multiplet loss (from step 9) are combined to form the final loss. The paper uses an equal weight ( $\lambda = 0.5$ $λ = 0.5$ ) for both losses: $L_F = \lambda L_S + (1 - \lambda) L_M$ where:
- $L_F$ : The final loss used for training.
- $L_S$ : The Softmax loss (for identity classification).
- $L_M$ : The multiplet loss (for metric learning and hard sample mining).
- $\lambda$ : A weighting parameter, set to 0.5.
  
  A key aspect of LoopNet is its cyclical nature: the ranking list (an output of the ranking layer) is also an input to the sampling layer. This forms a "loop" in the network. However, for backpropagation, this cycle is broken between the ranking list and the sampling layer. The sampling layer itself has no trainable parameters and only uses the order information from the ranking list; thus, it does not require error gradients from previous layers.

The structure of the LoopNet is summarized in the following table:

The following are the results from Table I of the original paper:

Layer	Input	Output
Sampling Layer	Ranking List	Image Data, Person Labels
General Network	Image Data	Feature Map
Fully Connected Layer 1	Feature Map	Features
Fully Connected Layer 2	Features	Person Classes
Softmax Loss Layer	Person Classes	Softmax Loss
Pairing Layer	Features, Person Labels	Pair Distances, Pair Labels
Softmax Layer	Pair Distances	Normalized Pair Distances
Ranking Layer	Normalized Pair Distances, Person Labels	Ranking List
Multiplet Loss Layer	Normalized Pair Distances	Multiplet Loss
Final Loss Layer	Softmax Loss Multiplet Loss	Final Loss

The input image size used in the experiments is $224 \times 112$ (height $\times$ width).

4.5. Sampling Layer

As mentioned, ideal hard sample mining (selecting samples directly from fully formed ranking lists) is computationally intensive. The sampling layer addresses this by implementing a gradual initializing and updating method for the ranking lists.

Let $m^+$ and $m^-$ be the current number of images in the positive and negative ranking lists, respectively. The multiplet loss requires $n$ positive and $n$ negative samples.

In each mini-batch:

Mining Hard Samples: Two random numbers, $s^+$ and $s^-$ , are generated. These denote the number of hard positive and hard negative samples to be mined from the current ranking lists. These numbers are constrained as follows: $0 \leq s^+ \leq \operatorname{min}\{m^+, n\} \qquad 0 \leq s^- \leq \operatorname{min}\{m^-, n\}$ This ensures that we only attempt to mine as many hard samples as are available in the current (possibly partially built) ranking list, and not more than the $n$ samples required by the multiplet loss.
Randomly Choosing Non-Hard Samples: The remaining $(n - s^+)$ positive and $(n - s^-)$ negative samples are chosen randomly from the overall gallery set for the current mini-batch.
Progressive Initialization: Newly selected samples (those randomly chosen and not yet in the ranking list) are added to the appropriate list in the ranking layer. This allows the ranking lists to grow and be progressively initialized over time, avoiding the need for a full, time-consuming pre-calculation.

The formal description of how the gallery samples $g_{i,j}^+$ (positive) and $g_{i,j}^-$ (negative) are selected for the $i$ -th probe $p_i$ is given by:

For positive samples: $\begin{array}{r l} & g_{i,j}^+ = \left\{ \begin{array}{l l} { g_{\pi_i^+(j)} , } & { \mathrm{if~} j \leq s^+ } \\ { g_{t_j} , } & { \mathrm{if~} j > s^+ } \end{array} \right. \\ & \quad 1 \leq t_j \leq n_g , \mathrm{id} \bigl( g_{t_j} \bigr) = \mathrm{id} ( p_i ) \\ & \qquad \forall t_j \not\in \left\{ \pi_i^+(1) , \ldots , \pi_i^+(s^+) , t_{s^+ + 1} , t_{s^+ + 2} , \ldots , t_{j-1} \right\} \end{array}$ where:

$g_{\pi_i^+(j)}$ : The $j$ -th hard positive sample from the ranking list (if $j \le s^+$ ).
$g_{t_j}$ : A randomly selected positive sample (if $j > s^+$ ). $t_j$ is a random index into the gallery.
$\mathrm{id}(g_{t_j}) = \mathrm{id}(p_i)$ : Ensures the random sample is indeed positive.
The exclusion condition ensures that randomly selected samples (and already mined hard samples) are unique within the multiplet for the current probe.

For negative samples: $\begin{array}{r l r} & & g_{i,j}^- = \left\{ \begin{array}{l l} { g_{\pi_i^-(j)} , } & { \mathrm{~if~} j \leq s^- } \\ { g_{t_j} , } & { \mathrm{~if~} j > s^- } \end{array} \right. \\ & & \mathrm{s.t.} \quad 1 \leq t_j \leq n_g , \mathrm{id} \bigl( g_{t_j} \bigr) \neq \mathrm{id} ( p_i ) \\ & & \forall t_j \notin \left\{ \pi_i^-(1) , \ldots , \pi_i^-(s^-) , t_{s^- + 1} , t_{s^- + 2} , \ldots , t_{j-1} \right\} \\ & & \mathrm{id} \Bigl( g_{i,j}^- \Bigr) \neq \mathrm{id} \Bigl( g_{i,k}^- \Bigr) ( k = 1 , 2 , \ldots , j - 1 ) \qquad (13) \end{array}$ where:
$g_{\pi_i^-(j)}$ : The $j$ -th hard negative sample from the ranking list (if $j \le s^-$ ).
$g_{t_j}$ : A randomly selected negative sample (if $j > s^-$ ).
$\mathrm{id}(g_{t_j}) \neq \mathrm{id}(p_i)$ : Ensures the random sample is negative.
The exclusion conditions ensure uniqueness of negative samples within the multiplet, and also that they are of different identities.

A special case is handled when the number of positive gallery images $n_g$ is less than the dimension $n$ required by the multiplet loss. In this scenario, the hardest positive sample $g_{\pi_i^+(1)}$ is repeated $n - n_g$ times to fulfill the requirement. $\begin{array}{r} { g_{i,j}^+ = \left\{ \begin{array}{l l} { g_{\pi_i^+(1)} , } & { \mathrm{if~} j \leq n - n_g , m^+ > 0 } \\ { g_{\pi_i^+(j - n + n_g + 1)} , } & { \mathrm{if~} n - n_g < j \leq m^+ + n - n_g } \\ { g_{t_1} , } & { \mathrm{if~} j \leq n - n_g , m^+ = 0 } \\ { g_{t_{j - n + n_g}} , } & { \mathrm{if~} m^+ + n - n_g < j \leq n } \end{array} \right. } \\ { 1 \leq t_j \leq n_g , \mathrm{id} \bigl( g_{t_j} \bigr) = \mathrm{id} ( p_i ) } \\ { \forall t_j \not\in \left\{ \pi_i^+(1) , \ldots , \pi_i^+(s^+) , t_{s^+ + 1} , t_{s^+ + 2} , \ldots , t_{j-1} \right\} } \end{array}$ This formula specifically details the handling of insufficient positive samples by repeating the hardest positive sample or drawing more random samples if no ranking list is initialized ( $m^+=0$ ).

4.6. Pair Images

The pairing layer (refer to step 6 in Section 4.4) is responsible for structuring the input samples for the multiplet loss calculation. The following figure (Figure 3 from the original paper) illustrates the organization of samples in a mini-batch, how they are paired, and the resulting triplets and quadruplets for the multiplet loss.

$Fig. 3. Pair images. (a) Mini-batch samples. An anchor is followed by $n$ positive samples and $n$ negative samples, which form the mini-batch. (b) Image pairs. Each sample is paired with the anchor, and each negative sample is paired with its previous negative sample. (c) Multiplet loss triplets and quadruplets. The columns with red and blue masks are the triplets and quadruplets, respectively.$ 该图像是图表，展示了用于行人重识别的正负样本组织。图中包含三个部分：(a) 针对锚定样本的正样本和负样本的展示；(b) 结合多个锚定样本的排列；(c) 以及最终的样本安排。

As shown in Figure 3(a), the sampling layer organizes samples in a mini-batch such that an anchor image $p_i$ is followed by $n$ positive samples ( $g_{i,1}^+, \ldots, g_{i,n}^+$ ) and then $n$ negative samples ( $g_{i,1}^-, \ldots, g_{i,n}^-$ ).

Before the multiplet loss can be computed, the pairing layer constructs the necessary image pairs and calculates their distances. Specifically, it computes:

$f(p_i, g_{i,j}^+)$ : Distance between the probe and each positive sample.
$f(p_i, g_{i,j}^-)$ : Distance between the probe and each negative sample.
$f(g_{i,j}^-, g_{i,j+1}^-)$ : Distance between consecutive negative samples (for the quadruplet-like terms).

The Euclidean distance is used as the distance function $f(\cdot)$ .

For backpropagation, if the pairing layer is the $l$ -th layer and the error terms (gradients) from the next layer (the multiplet loss layer) are $\delta^{(l+1)} = <\delta_1^{(l+1)}, \delta_2^{(l+1)}, \dots, \delta_{3n-1}^{(l+1)}>$ , then the error terms for the inputs of this layer are calculated as follows:

The error term for the probe image $p_i$ at layer $l$ , denoted $\delta_{p_i}^{(l)}$ , is: $\delta_{p_i}^{(l)} = \sum_{j=1}^{n} \delta_k^{(l+1)} \frac{\partial f \big( p_i, g_{i,j}^+ \big)}{\partial p_i} + \sum_{j=1}^{n} \delta_{n+k}^{(l+1)} \frac{\partial f \big( p_i, g_{i,j}^- \big)}{\partial p_i}$ where $k$ corresponds to the index of the distance term in the multiplet loss that involves $p_i$ . This equation sums the gradients received from all positive and negative pairs involving $p_i$ .

The error term for the $j$ -th positive gallery sample $g_{i,j}^+$ at layer $l$ , denoted $\delta_{g_{i,j}^+}^{(l)}$ , is: $\delta_{g_{i,j}^+}^{(l)} = \delta_k^{(l+1)} \frac{\partial f \left( p_i, g_{i,j}^+ \right)}{\partial g_{i,j}^+}$ This equation indicates that the gradient for $g_{i,j}^+$ comes from its participation in the anchor-positive distance term.

The error term for the $j$ -th negative gallery sample $g_{i,j}^-$ at layer $l$ , denoted $\delta_{g_{i,j}^-}^{(l)}$ , is: $\begin{array}{r l r} { \delta_{g_{i,j}^-}^{(l)} = \delta_{n+k}^{(l+1)} \frac{\partial f \big( p_i, g_{i,j}^- \big)}{\partial g_{i,j}^-} + \delta_{2n+k}^{(l+1)} \frac{\partial f \big( g_{i,j}^-, g_{i,j+1}^- \big)}{\partial g_{i,j}^-} I\{j < n\} } \\ & { } & { + \delta_{2n+k-1}^{(l+1)} \frac{\partial f \big( g_{i,j-1}^-, g_{i,j}^- \big)}{\partial g_{i,j}^-} I\{j > 1\} \quad \qquad (17) } \end{array}$ This equation shows that the gradient for $g_{i,j}^-$ can come from three sources: its anchor-negative pair, its pair with $g_{i,j+1}^-$ (if it's not the last negative sample), and its pair with $g_{i,j-1}^-$ (if it's not the first negative sample). $I\{x\}$ is the indicator function.

4.7. Rank Samples

The ranking layer (step 7 in Section 4.4) is responsible for maintaining and updating the positive and negative ranking lists during training. This is where the progressive initialization and online updating of the lists occur.

After each iteration, when distances for the current mini-batch have been computed:

Update Existing Samples: For samples that are already present in the ranking lists, their positions are updated based on their newly computed distances.
Append New Samples: For samples that were randomly chosen for the mini-batch but are not yet in the ranking lists, they are appended to the appropriate list. This ensures that even "easy" or previously unconsidered samples eventually get a chance to be added to the global pool, potentially becoming hard samples later.
Sorting: Both the positive and negative ranking lists are then sorted. The positive list is sorted in descending order of distance (hardest at top), and the negative list is sorted in ascending order of distance (hardest at top). The paper mentions using a Bitonic sorting method [44] for this purpose.
Memory Management: To manage memory usage efficiently, especially for the typically much larger negative ranking list, the length of the lists is limited. After sorting, any samples that exceed this length limit are deleted from the list. This means that if a new, harder sample is found, it can replace an older, less hard sample that falls out of the list's capacity.

This dynamic process, where randomly chosen samples ensure coverage and the ranking list continuously adapts to feature space changes, is key to making global hard sample mining efficient and effective within the LoopNet.

5. Experimental Setup

5.1. Datasets

The proposed method is evaluated on three widely used public Person Re-identification (ReID) datasets.

5.1.1. Market-1501

Source: [45]
Scale & Characteristics: Contains 32,688 bounding boxes of 1,501 unique identities (IDs). These bounding boxes were produced by a deformable part model (DPM). Each person is captured by 2-6 non-overlapping cameras. The dataset also includes 2,798 distractors (false detections by DPM) and 3,819 junk images (images with no influence on ReID accuracy) in the test set.
Splits: 751 IDs for training/validation, 750 IDs for testing (probe and gallery).
Why chosen: It's a large-scale and challenging dataset with significant variations due to multiple cameras, distractors, and a large number of identities, making it a standard benchmark for ReID.

5.1.2. CUHK03

Source: [20]
Scale & Characteristics: One of the largest early ReID datasets, featuring 1,467 IDs captured from five different pairs of cameras on a campus. The dataset provides both detected bounding boxes and manual labeled bounding boxes. The authors use both for training, with an average of 4.8 detected and manual labeled bounding boxes per view.
Splits: 1,367 IDs for training/validation, 100 IDs for testing (probe and gallery).
Why chosen: Represents an academic campus environment with multiple camera views and offers a comparison between automatically detected and manually annotated bounding boxes, reflecting real-world complexities.

5.1.3. DukeMTMC-ReID (Duke)

Source: [46], a subset of DukeMTMC [47]
Scale & Characteristics: Derived from 85-minute high-resolution videos recorded by eight different cameras. It includes 1,404 IDs appearing in more than two cameras and 408 IDs appearing in only one camera.
Splits: 702 IDs are randomly selected for the training set, and the remaining 702 IDs form the testing set. In the test set, one query image per ID per camera is selected for the probe set, and the remaining images are used for the gallery. This results in 16,522 training images (702 IDs), 2,228 query images (702 IDs), and 17,661 gallery images.

Why chosen: Large-scale, high-resolution, and complex dataset originating from multiple synchronized cameras, making it a robust testbed for ReID algorithms.

The following are the results from Table II of the original paper:

Dataset	#cameras	#identities				#images
Dataset	#cameras	total	train & val	test probe	test gallery	train & val	test probe	test gallery
Market-1501	6	1,501+2	751	750	750+2	12,936	3,368	19,732
CUHK03	10	1,467	1,367	100	100	26,253	952	988
Duke	8	1404 + 408	702	702	702 + 408	16,522	2,228	17,661
Total		4,782	2,820	1,552	1,962	55,711	6,548	38,381

5.2. Evaluation Metrics

The paper employs two commonly used metrics to evaluate the performance of ReID methods: Single Query (SQ) accuracy and Mean Average Precision (mAP). Rank-k accuracy is also implicitly evaluated.

5.2.1. Single Query (SQ) Accuracy / Rank-k Accuracy

Conceptual Definition: Single Query accuracy (often referred to as Rank-1 accuracy) measures whether the first retrieved image in the gallery (the one deemed most similar to the probe) is a correct match for the probe's identity. Rank-k accuracy generalizes this by checking if any correct match is found within the top $k$ retrieved images. In ReID, for SQ, if multiple ground-truth matches exist in the gallery, only the first correct match counts. This metric focuses on how well the model can put a true positive at a very high rank.
Mathematical Formula: Let $Q$ $Q$ be the set of query images, and for each query $q \in Q$ $q \in Q$ , let $G_q$ $G_{q}$ be the ranked list of gallery images. The Rank-k accuracy is defined as: $\text{Rank-k Accuracy} = \frac{1}{|Q|} \sum_{q \in Q} I(\exists g \in G_q \text{ s.t. } \mathrm{id}(g) = \mathrm{id}(q) \text{ and } \text{rank}(g) \le k)$
- Rank-1 accuracy is a special case where $k=1$ .
Symbol Explanation:
- $|Q|$ : The total number of query images.
- $I(\cdot)$ : The indicator function, which is 1 if the condition inside is true, and 0 otherwise.
- $\mathrm{id}(g)$ : The identity of gallery image $g$ .
- $\mathrm{id}(q)$ : The identity of query image $q$ .
- $\text{rank}(g)$ : The position (rank) of gallery image $g$ in the sorted list $G_q$ .
- The condition ensures that at least one gallery image with the same identity as the query is found within the top $k$ ranks.

5.2.2. Mean Average Precision (mAP)

Conceptual Definition: Mean Average Precision (mAP) provides a more comprehensive evaluation than Rank-k accuracy, especially when there are multiple ground-truth matches for a single query in the gallery. mAP considers both the precision (how many retrieved items are relevant) and recall (how many relevant items are retrieved) of an algorithm across different ranks. It is the mean of the Average Precision (AP) scores for all queries. AP for a single query is calculated by taking the average of precision values at each point where a relevant item is retrieved.
Mathematical Formula: For a single query $q$ $q$ , Average Precision (AP) is calculated as: $\text{AP}_q = \sum_{k=1}^{n_g} P(k) \Delta r(k)$ where:
- P(k): The precision at cutoff $k$ in the ranked list.
- $\Delta r(k)$ : The change in recall from k-1 to $k$ . This term is 0 if the $k$ -th item is irrelevant, and $1 / (\text{total relevant items})$ if it is relevant. mAP is then the average of AP over all queries: $\text{mAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}_q$
Symbol Explanation:
- $n_g$ : The total number of gallery images.
- P(k): Precision at rank $k$ . Defined as $\frac{\text{Number of relevant items up to rank } k}{k}$ .
- r(k): Recall at rank $k$ . Defined as $\frac{\text{Number of relevant items up to rank } k}{\text{Total number of relevant items}}$ .
- $\Delta r(k)$ : Indicates that the precision is only considered at ranks where a new relevant item is found.
- $|Q|$ : Total number of queries.

5.3. Baselines

The paper compares its LoopNet method against different configurations of common backbone networks and various hard sample mining strategies.

5.3.1. Backbone Networks

ResNet50 [43]: A deep residual network architecture, known for its ability to train very deep models by using skip connections (residual blocks) to alleviate the vanishing gradient problem. It's a powerful feature extractor widely used in computer vision. Referred to as Res in experiments.
GoogLeNetv3 [42] (Inception-v3): An Inception architecture that uses inception modules (parallel convolutional layers with different filter sizes) to capture multi-scale features, while also optimizing computational efficiency. Referred to as Inc in experiments.

5.3.2. Sample Mining Strategies

The paper introduces a three-symbol sequence to denote the experimental settings for sample mining:

First Symbol (Mining Range):
- $L$ : Local mining in a mini-batch. Hard samples are selected only from the current batch of images.
- $G$ : Global mining using the ranking list. Hard samples are selected from the entire training set via the maintained ranking lists.
Second Symbol (Positive Sample Mining Mode):
- $R$ : Randomly choose positive samples.
- $S$ : Choose semi-hard positive samples. (Though often semi-hard applies to negatives, the paper explores its effect for positives too).
- $H$ : Choose the top hardest positive samples (e.g., from the ranking list or batch).
Third Symbol (Negative Sample Mining Mode):
- $R$ : Randomly choose negative samples.
- $S$ : Choose semi-hard negative samples. (The definition of semi-hard from FaceNet [28] is: negative samples that are closer to the anchor than positive samples, but still farther away than the positive samples from the anchor, i.e., $d(A,P) < d(A,N) < d(A,P) + \alpha$ ).
- $H$ : Choose the top hardest negative samples (e.g., from the ranking list or batch).

Examples of notations:

LRS: Local mining (in mini-batch), Randomly chosen positive samples, Semi-hard negative samples. This is similar to the approach in FaceNet [28] if all positive samples are used.
GHS: Global mining (from ranking list), Top hardest positive samples, Semi-hard negative samples.
$*RR$ : Randomly chosen positive samples and Randomly chosen negative samples. The * indicates that for purely random selection, there's no distinction between local or global mining range.

The models are implemented using CAFFE [48] and trained for 120,000 iterations on each dataset.

6. Results & Analysis

The experimental results explore the performance of different sample mining ranges (local vs. global) and modes (random, semihard, hardest) for both triplet loss (equivalent to multiplet loss with dimension $n=1$ ) and the proposed multiplet loss. The evaluation uses mAP and SQ (Rank-1) accuracy on the Market-1501 dataset for detailed comparisons, and then Rank-k and mAP across all three datasets for state-of-the-art comparisons.

6.1. Core Results Analysis

6.1.1. Comparison with Baseline (Market-1501)

The following figures and tables show the performance comparisons on the Market-1501 dataset using Inception-v3 and ResNet50 as backbones.

The following figure (Figure 4 from the original paper) displays the mAP and SQ accuracy of the Inception-v3 model on the Market-1501 dataset, showing performance variations across different multiplet dimensions and sample mining strategies.

该图像是图表，展示了Inceptionv3模型在不同多重体维度下的mAP和SQ准确率。图表(a)显示了mAP，图表(b)显示了SQ，纵坐标为准确率(%)，横坐标为多重体的维度。这些结果展示了各种方法在不同维度上的性能变化。

The following are the results from Table III of the original paper:

Mode	mAP Triplet	Best Multiplet	SQ Triplet Best Multiplet
*RR	58.76	68.26	78.50
LRS	64.52	66.87	82.24
LRH	68.60	68.60	84.74 84.44 84.44
LHS	61.07	68.14	79.45 85.12
LHH	67.09	67.09	83.46 83.46
GRS	62.56	67.36	80.67 83.73
GRH	62.54	67.07	80.20
GHS	51.48	59.38	84.95 78.89
			74.05 85.75
GHH	63.25	70.06	81.21

Triplet Loss (Inception-v3, $n=1$ ):

The $*RR$ method (random positive, random negative) performs poorly, indicating that hard sample mining is indeed effective.
LRH (local, random positive, hardest negative) achieves the highest performance (mAP 68.60, SQ 84.74). This suggests that for triplet loss, focusing on the hardest negative samples within a mini-batch is very effective.
Generally, local mining methods (LRS, LRH, LHS, LHH) outperform global mining methods (GRS, GRH, GHS, GHH) when using triplet loss.
Mining the hardest negative samples (e.g., LRH, LHH) is more effective than semi-hard negative samples (e.g., LRS, LHS) for triplet loss in ReID, which contrasts with findings in some face recognition studies.

Multiplet Loss (Inception-v3):

The multiplet loss consistently improves performance over triplet loss across most mining modes. For instance, $*RR$ jumps from 58.76 mAP (triplet) to 68.26 mAP (multiplet).
GHH (global, hardest positive, hardest negative) emerges as the top performer for multiplet loss (mAP 70.06, SQ 85.75). This is a crucial finding: with the multiplet loss, global hard sample mining surpasses local mining. This suggests that the multiplet loss effectively leverages the higher quality of globally mined hard samples due to its sample priority mechanism.

The following figure (Figure 5 from the original paper) compares the mAP and SQ performance of the ResNet50 baseline on the Market-1501 dataset, illustrating the accuracy performance of various methods across different multiplet dimensions.

该图像是图表，比较了ResNet50基线在Market-1501数据集上的mAP和SQ性能。图(a)展示了不同多重体维度下的mAP变化，图(b)则展示了SQ的变化。各条曲线代表不同的实验配置，反映了它们在各维度下的准确率表现。

The following are the results from Table IV of the original paper:

Mode	mAP		SQ
Mode	Triplet	Best Multiplet	Triplet	Best Multiplet
*RR	38.63	42.03	56.56	62.32
LRS	53.37	53.37	74.23	74.23
LRH	52.35	54.33	72.65	74.67
LHS	50.23	50.23	71.05	71.05
LHH	54.85	54.85	74.85	74.85
GRS	39.66	53.14	59.47	72.68
GRH	38.10	41.78	57.75	59.53
GHS	34.28	46.55	57.07	69.60
GHH	41.31	57.56	60.75	75.92

Triplet Loss (ResNet50, $n=1$ ):

Similar to Inception-v3, $*RR$ performs poorly, validating hard sample mining.
LHH (local, hardest positive, hardest negative) achieves the best triplet loss performance (mAP 54.85, SQ 74.85). Again, local mining outperforms global mining for triplet loss.
The effectiveness of mining hardest samples (positive and negative) is evident over other choices.

Multiplet Loss (ResNet50):

Again, multiplet loss provides significant improvements over triplet loss across almost all configurations.
GHH (global, hardest positive, hardest negative) achieves the highest performance with multiplet loss (mAP 57.56, SQ 75.92). This reinforces the finding that global hard sample mining combined with multiplet loss is the most effective strategy.

Time Consumption:

The paper reports that global mining adds a minimal overhead: ResNet50 global mining takes 14.4 hours vs. 14 hours for local (2.9% more), and GoogLeNetv3 global mining takes 18.4 hours vs. 18 hours for local (2.2% more) for 120,000 iterations. This demonstrates that their progressive initialization and online updating strategy for global mining is computationally efficient and practical.

6.1.2. Comparison with the State of the Art

The proposed method (GHH with multiplet loss) is compared against 13 other state-of-the-art methods on Market-1501, CUHK03, and Duke datasets using Rank-1, Rank-5, Rank-10 (for SQ) and mAP.

The following are the results from Table V of the original paper:

Method	Ref.	Market-1501			CUHK03				Duke
		Rank-1	Rank-5	Rank-10	mAP	Rank-1	Rank-5	Rank-10	mAP	Rank-1	Rank-5	Rank-10	mAP
		LOMO+XQDA[52]	CVPR15	43.80			22.20	52.00				30.80		17.00
APR[51]	arXiv17	84.29	93.20	95.19	64.67					70.69			51.88
MSCAN[53]	CVPR17	80.31			57.53	67.99	91.04	95.36				-
Re-ranking[37]	CVPR17	77.11			63.63	61.60			67.60
CSBT[54]	CVPR17	42.9			20.3	55.5	84.3
P2S [32]	CVPR17	70.72			44.27
Spindle[55]	CVPR17	76.90	91.50	94.60		88.50	97.80	98.60
MI56]	CVPR17	82.10				77.70				68.10
ACRN[57]	CVPR17	83.61	92.61	95.34	62.60	62.63	89.69	94.72	70.20	72.58	84.79	88.87	51.96
LSRO[46]	ICCV17	78.06			56.23	73.10	92.70	96.70	77.40	67.70			47.10
SPGAN+LMP[58]	CVPR18	58.10	76.00	82.70	26.90					46.90	62.60	68.50	26.40
MCAM[59]	CVPR18	83.55			74.25	49.29			49.89
TJ-AIDL[60]	CVPR18	58.20	74.80	81.10	26.50					44.30	59.60	65.00	23.00
Our GHH		85.36	94.30	96.23	70.06	89.08	94.85	96.74	86.59	75.63	87.30	89.99	57.90

Overall Performance:

Market-1501: Our GHH achieves the highest Rank-1 (85.36%), Rank-5 (94.30%), Rank-10 (96.23%), and mAP (70.06%) among all compared methods. This strongly validates the effectiveness of the proposed global hard sample mining with multiplet loss.
CUHK03: Our GHH achieves Rank-1 (89.08%) and mAP (86.59%) which are higher than Spindle [55] (88.50% Rank-1) despite Spindle using a larger combined training dataset. Spindle does show slightly higher Rank-5 and Rank-10 accuracies, possibly due to its extensive training data. However, Rank-1 is often considered the most critical metric for ReID.
Duke: Our GHH also demonstrates superior performance with Rank-1 (75.63%), Rank-5 (87.30%), Rank-10 (89.99%), and mAP (57.90%), outperforming all other methods in this comparison.

These results collectively confirm the validity and competitive nature of the LoopNet model with global hardest sample mining and multiplet loss across diverse ReID benchmarks.

6.2. Ablation Studies / Parameter Analysis

While the paper doesn't explicitly present a dedicated "ablation study" section, the extensive comparison of different sample mining modes (R, S, H) and mining ranges (L, G) for both triplet ( $n=1$ ) and multiplet losses effectively serves as an ablation study for these components.

Key insights from this implicit ablation:

Effectiveness of Hard Sample Mining: Comparing $*RR$ (random) with any hard sample mining method shows a clear performance uplift for both triplet and multiplet losses across both backbones. This confirms that hard sample mining is crucial for ReID.
Hardest vs. Semi-Hard Samples: For triplet loss, mining the hardest samples (e.g., LRH, LHH) consistently outperforms semi-hard samples (e.g., LRS, LHS) in ReID, which is a notable finding differing from some prior work (e.g., FaceNet where semi-hard was preferred).
Local vs. Global Mining for Triplet Loss: When using triplet loss (equivalent to multiplet with $n=1$ ), local mining methods (e.g., LRH, LHH) generally perform better than global mining methods (e.g., GRH, GHH). This suggests that triplet loss might not be able to effectively leverage the benefits of globally harder samples without additional mechanisms.
Impact of Multiplet Loss: The multiplet loss itself, by extending triplet/quadruplet and incorporating adaptive margins, significantly boosts performance across nearly all mining strategies compared to triplet loss.
Synergy of Global Hardest Mining and Multiplet Loss: The most critical finding is the strong synergy between global hardest sample mining (GHH) and the multiplet loss. GHH consistently achieves the best multiplet performance, demonstrating that multiplet loss is uniquely capable of effectively utilizing the truly hardest samples identified through global mining. This synergy is a direct validation of the paper's core hypothesis regarding the benefits of global mining when coupled with an appropriate loss function.
Impact of $n$ (Multiplet Dimension): While not explicitly varied in tables III and IV for "Best Multiplet" (which implies an optimized $n$ ), Figure 4 and 5 clearly show performance curves across different dimensions of the multiplet. This implicitly shows that choosing an appropriate $n > 1$ (beyond just triplet) is beneficial for multiplet loss, as performance generally improves from $n=1$ and often peaks at a certain $n$ .

7. Conclusion & Reflections

7.1. Conclusion Summary

This article introduces LoopNet, a novel listwise ranking network for Person Re-identification (ReID) that addresses the challenge of efficient global hard sample mining. The core contributions include the design of positive and negative ranking lists to globally mine the hardest or semi-hard samples, which is demonstrated to be superior to local mini-batch mining when combined with the proposed multiplet loss. The multiplet loss itself is a key innovation, enabling the progressive initialization of these ranking lists to avoid computationally expensive full pre-calculations. Furthermore, the multiplet loss incorporates sample priority through adaptive margins, giving harder samples greater influence during training. Experimental results validate these theories, showing that LoopNet achieves competitive, and often state-of-the-art, performance on public ReID datasets with only a marginal increase in computational time compared to local mining methods.

Key conclusions drawn from the research are:

Hard sample mining is undeniably effective in improving ReID performance.
When using standard triplet loss, local sample mining within a mini-batch tends to yield the best results.
However, when sample priority and effectiveness are accounted for (as in the multiplet loss), global hardest sample mining significantly outperforms all other global and local mining methods, demonstrating a powerful synergy between global sample selection and a context-aware loss function.

7.2. Limitations & Future Work

The authors acknowledge a primary area for future work:

Generalization to Other Problems: The study focuses specifically on ReID, which is a ranking problem. A remaining question is whether these conclusions regarding hard sample mining strategies and the benefits of global mining can be generalized to other machine learning tasks, such as classification, detection, and generation.
Consistency of Hard Sample Distribution: Given the wide variation in applications, it's an interesting topic to investigate whether hard samples are subject to the same statistical distribution across different problems, and whether the observed results from hard mining methods remain consistent. This calls for further and comprehensive study into the nature of hard samples in diverse contexts.

7.3. Personal Insights & Critique

This paper presents a rigorous and well-executed solution to a critical problem in Person Re-identification: efficient global hard sample mining. My personal insights and critiques are as follows:

Strengths and Inspirations:

Addressing the "Global" Challenge: The core strength is making global hard sample mining practical. The progressive initialization strategy for ranking lists is ingenious, cleverly bypassing the prohibitive computational cost of full pre-computation. This approach could inspire similar solutions in other domains where global sample selection is theoretically superior but computationally infeasible.
The Power of Multiplet Loss: The multiplet loss is more than just an extension; its adaptive margins introduce a crucial priority mechanism that effectively leverages the quality of globally mined hard samples. This is a sophisticated way to integrate prior knowledge (sample hardness) into the learning objective. The experimental results clearly show its synergy with global mining, outperforming triplet loss.
Practical Efficiency: Demonstrating that global mining can be achieved with only a marginal increase in training time is highly impactful. It removes a major barrier to adopting global strategies.
Rigorous Experimental Design: The comprehensive comparison of different mining ranges and modes (R, S, H) for both triplet and multiplet losses provides valuable insights into the nuanced behavior of hard sample mining in ReID. The finding that hardest negatives are better than semi-hard for ReID is a useful contribution to the field's understanding.
Applicability to Other Domains: While the authors list this as future work, the core idea of a dynamically updated ranking list and a prioritized multiplet-like loss could be transferable. For example, in face recognition or object retrieval, where hard negatives are abundant, this LoopNet framework could provide similar benefits.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Robustness of Progressive Initialization: The progressive initialization relies on randomly chosen samples to gradually populate the ranking lists. While this is efficient, the initial stages might be susceptible to noisy random samples or a skewed initial distribution if the random sampling isn't sufficiently diverse. The speed at which the ranking list becomes "reliable" could impact early training stability.
"Repeat Hardest Sample" Strategy: When the number of positive gallery images is less than $n$ , the paper resorts to repeating the hardest positive sample. While practical for meeting the multiplet dimension requirement, this could potentially lead to overfitting on that single hard sample or reduce the diversity of positive signals if this situation occurs frequently for certain identities. An alternative could be to dynamically reduce $n$ for such cases or use other forms of data augmentation.
Hyperparameter Sensitivity ( $\alpha, \beta, n$ ): The specific values for the base margins ( $\alpha=1.0, \beta=0.5$ ) and the dimension $n$ are chosen empirically. While these values work well, their optimal configuration might be dataset-dependent. A more thorough ablation study on the sensitivity to these parameters, perhaps with an adaptive selection mechanism, could further strengthen the model.
Definition of "Global": While the ranking lists span the entire training set, they are still a subset of all possible hard samples for a given probe (due to memory limits and progressive building). Thus, it's "global" relative to mini-batches, but not necessarily an exhaustive global search in every iteration. The authors address memory saving by limiting list length, which is a practical trade-off.
Computational Cost of Bitonic Sorting: While the overhead is reported as minimal, Bitonic sorting has a complexity of $O(N \log^2 N)$ in the worst case. For very large ranking lists (if memory limits were relaxed), this could become a bottleneck. The current approach with limited list lengths likely keeps this manageable.

Overall, this paper makes a significant contribution to Person ReID by providing a practical and effective solution for global hard sample mining. The LoopNet and multiplet loss framework demonstrate a deep understanding of the challenges in metric learning and offer valuable insights for future research in hard sample mining across various computer vision tasks.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Mining Hard Samples Globally and Efficiently for Person Reidentification

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~31 min read · 39,442 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Person Re-identification (ReID)

3.1.2. Ranking Task

3.1.3. Hard Sample Mining

3.1.4. Convolutional Neural Networks (CNNs)

3.1.5. Metric Learning

3.1.6. Triplet Loss

3.1.7. Quadruplet Loss

3.1.8. Mini-batch Training

3.2. Previous Works

3.2.1. Loss Function Improvements (Listwise Focus)

3.2.2. Hard Sample Mining

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Ranking List

4.1.1. Positive Ranking List

4.1.2. Negative Ranking List

4.2. Multiplet Loss

4.2.1. Adaptive Margins

4.2.2. Ideal Sample Mining from Ranking Lists

4.3. Backpropagation

4.4. Pipeline of LoopNet

4.5. Sampling Layer

4.6. Pair Images

4.7. Rank Samples

5. Experimental Setup

5.1. Datasets

5.1.1. Market-1501

5.1.2. CUHK03

5.1.3. DukeMTMC-ReID (Duke)

5.2. Evaluation Metrics

5.2.1. Single Query (SQ) Accuracy / Rank-k Accuracy

5.2.2. Mean Average Precision (mAP)

5.3. Baselines

5.3.1. Backbone Networks

5.3.2. Sample Mining Strategies

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Comparison with Baseline (Market-1501)

6.1.2. Comparison with the State of the Art

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers