Unleashing the Full Potential of Product Quantization for Large-Scale Image Retrieval
TL;DR Summary
This paper introduces a product quantization-based deep hashing framework that addresses computational costs and accuracy issues in large-scale datasets, efficiently learning class-level codes through a differentiable PQ branch, demonstrating superior retrieval performance across
Abstract
Due to its promising performance, deep hashing has become a prevalent method for approximate nearest neighbors search (ANNs). However, most of current deep hashing methods are validated on relatively small-scale datasets, leaving potential threats when are applied to large-scale real-world scenarios. Specifically, they can be constrained either by the computational cost due to the large number of training categories and samples, or unsatisfactory accuracy. To tackle those issues, we propose a novel deep hashing framework based on product quantization (PQ). It uses a softmax-based differentiable PQ branch to learn a set of predefined PQ codes of the classes. Our method is easy to implement, does not involve large-scale matrix operations, and learns highly discriminate compact codes. We validate our method on multiple large-scaled datasets, including ImageNet100, ImageNet1K, and Glint360K, where the category size scales from 100 to 360K and sample number scales from 10K to 17 million, respectively. Extensive experiments demonstrate the superiority of our method. Code is available at https://github.com/yuleung/FPPQ.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Unleashing the Full Potential of Product Quantization for Large-Scale Image Retrieval
1.2. Authors
The paper lists the following authors and their affiliations:
- Yu Liang (College of Computer Science and Electronic Engineering, Hunan University; Intellifusion Inc.)
- Shiliang Zhang (National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University)
- Kenli Li (College of Computer Science and Electronic Engineering, Hunan University)
- Xiaoyu Wang (The Hong Kong University of Science and Technology (Guangzhou))
1.3. Journal/Conference
The paper was published on OpenReview.net on 2023-11-02. OpenReview is a platform often used for peer review and dissemination of research papers, particularly for conferences like ICLR and NeurIPS. Its presence on OpenReview suggests it was likely submitted to or accepted by a reputable conference in machine learning or computer vision.
1.4. Publication Year
2023
1.5. Abstract
Deep hashing has emerged as a prominent method for Approximate Nearest Neighbors Search (ANNs) due to its strong performance. However, its application to large-scale, real-world datasets is often hindered by computational costs (due to numerous training categories and samples) or unsatisfactory accuracy, as most current methods are validated on smaller datasets. To address these limitations, the paper proposes a novel deep hashing framework based on Product Quantization (PQ). This framework utilizes a softmax-based differentiable PQ branch to learn a set of predefined PQ codes for classes. The method is designed to be easy to implement, avoids large-scale matrix operations, and learns highly discriminative compact codes. The authors validate their approach on multiple large-scale datasets, including ImageNet100, ImageNet1K, and Glint360K, which feature category sizes ranging from 100 to 360K and sample numbers from 10K to 17 million. Extensive experiments are presented to demonstrate the method's superiority.
1.6. Original Source Link
https://openreview.net/pdf?id=scG0cwftEe (Published on 2023-11-02T00:00:00.000Z)
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the limitation of existing deep hashing methods when applied to large-scale datasets in real-world scenarios. While Approximate Nearest Neighbors Search (ANNs) is crucial for various applications like image retrieval, video search, recommendation systems, and anomaly detection, and deep hashing has significantly advanced ANNs by mapping high-dimensional data into compact binary codes, current deep hashing techniques face two major challenges:
-
Computational Cost: Many methods become prohibitively expensive due to the large number of training categories and samples in real-world large-scale datasets (e.g., millions of categories and hundreds of millions of samples in face recognition).
-
Unsatisfactory Accuracy: Performance tends to degrade significantly when these methods are scaled up, often underperforming simpler, traditional methods like
Product Quantization (PQ)on large data.Product Quantization (PQ)is a widely used and efficient quantization method. However, its major drawback is that its retrieval performance rapidly declines as theencoding length(number of bits) decreases, making it impractical for applications requiring very short codes (e.g., edge devices) due to increased quantization errors.
The paper's innovative idea or entry point is to "further unlock the potential of PQ" by addressing its performance decay at shorter encoding lengths. The authors hypothesize that this decay stems from insufficient clustering conditions in the sub-spaces of the original feature vector, leading to large quantization errors. They aim to obtain a feature representation that is better suited for cluster assignment in multiple sub-spaces.
2.2. Main Contributions / Findings
The paper's primary contributions are threefold:
- First-time Comprehensive Evaluation on Extremely Large Datasets: The authors comprehensively investigate the retrieval performance of popular and advanced deep hashing methods on extremely large datasets with extensive categories and samples (e.g.,
Glint360Kwith 360K classes and 17 million images). They demonstrate that many existing deep hashing methods are indeed unsuitable for such large-scale scenarios due to computational resource limitations or performance degradation. - Novel End-to-End Deep Product Quantization Method (FPPQ): They propose a simple yet effective end-to-end deep feature quantization compression method, named
FPPQ (Full Potential Product Quantization). This method is based onclass-level product quantization coding supervisionand achieves a significantly lower-slope decay of retrieval performance with decreasing code lengths compared to traditional PQ. It can be seamlessly integrated into traditional PQ retrieval modules. - Extensive Experimental Validation: The method is rigorously validated through extensive experiments on diverse datasets:
Glint360K(360K classes, 17M images),ImageNet1K(1K classes, 1.2M images), andImageNet100(100 classes, subset of ImageNet1K). The experimental results consistently demonstrate the superiority and effectiveness ofFPPQacross various scales andbit-lengths. Key findings include:FPPQsignificantly outperforms existing deep hashing methods and traditional PQ onGlint360Kat multiple bit lengths, especially at shorter code lengths (e.g., nearly 57% increase in Top-1 performance at 32-bits overGreedyHash).- It maintains consistent and significant improvements across different network structures (
iResnet18,iResnet50,iResnet100). - Visualization using t-SNE shows that
FPPQeffectively learns class-level PQ labels, leading to better clustering of sub-vectors in the subspace. FPPQalso performs well on smaller datasets likeImageNet100and achieves competitive results onImageNet1K, even in challenging scenarios with limited categories or feature discriminability.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the paper, several fundamental concepts in information retrieval, machine learning, and deep learning are essential:
-
Approximate Nearest Neighbors Search (ANNs):
- Conceptual Definition: In many applications (like image search, recommendation systems), we need to find items in a large database that are "similar" or "closest" to a given query item. When the database is huge (millions or billions of items) and items are represented by high-dimensional vectors, finding the exact nearest neighbors becomes computationally infeasible. ANNs are algorithms that aim to find neighbors that are "approximately" closest to the query, trading off some accuracy for significant speed-up.
- How it works: ANNs typically build an index structure (e.g., hash tables, tree-based structures, graph-based structures) over the dataset that allows for quick lookups, even if it doesn't guarantee finding the absolute closest neighbor every time.
-
Hashing:
- Conceptual Definition: Hashing is a technique that transforms high-dimensional data (e.g., image features represented by floating-point vectors) into compact, low-dimensional binary codes (sequences of 0s and 1s). The goal is to preserve the similarity relationships: if two original high-dimensional items are similar, their hash codes should also be similar (e.g., having a small Hamming distance).
- Why it's used:
- Storage Efficiency: Binary codes require significantly less memory than floating-point vectors.
- Retrieval Speed: Comparing binary codes (e.g., using Hamming distance) is much faster than computing Euclidean or cosine distances between high-dimensional floating-point vectors.
- Hamming Distance: The
Hamming distancebetween two binary codes of equal length is the number of positions at which the corresponding bits are different. A smaller Hamming distance implies greater similarity between the items.
-
Deep Hashing:
- Conceptual Definition: Traditional hashing methods often rely on hand-crafted features or data-independent projections.
Deep Hashingcombines the power of deep neural networks with hashing. A deep learning model (e.g., a Convolutional Neural Network or CNN) is trained to directly learn optimal feature representations and convert them into binary hash codes in an end-to-end manner. - Advantages: Deep learning models can learn highly discriminative features from raw data, leading to more effective hash codes and better retrieval performance compared to non-deep methods.
- Conceptual Definition: Traditional hashing methods often rely on hand-crafted features or data-independent projections.
-
Product Quantization (PQ):
- Conceptual Definition:
Product Quantizationis a specific and popular technique for vector quantization used in ANNs. Instead of quantizing an entire high-dimensional vector at once, it divides the vector into several independent sub-vectors (or segments). Each sub-vector is then quantized separately using its own small codebook. The final representation of the original vector is a concatenation of the indices (or IDs) of the codewords from each sub-vector's codebook. - How it works (simplified):
- Divide: An input feature vector is split into equal-length sub-vectors: , where each .
- Cluster (Codebook Learning): For each segment , all sub-vectors from the entire dataset are clustered into centroids (codewords) using an algorithm like
k-means. These centroids form the codebook for that segment. - Encode: For a given feature vector, each of its sub-vectors is assigned to its closest centroid in its respective segment's codebook. The index of this centroid is the
PQ codefor that sub-vector. The fullPQ codefor the original feature is a sequence of these indices: . - Retrieval: During retrieval, distances are computed efficiently using
Look-Up Tables (LUTs). The distance between a query vector and a database vector can be approximated by summing up the distances between their corresponding sub-vector codewords.
- Benefits: Reduces memory usage significantly and speeds up distance calculations by converting a high-dimensional distance computation into low-dimensional lookups and summations.
- k-means clustering: An unsupervised machine learning algorithm used to partition data points into clusters. It iteratively assigns each data point to the closest cluster centroid and then updates the centroids to be the mean of all data points assigned to that cluster. This process minimizes the within-cluster sum of squares.
- Conceptual Definition:
-
Codebook: In
Product Quantization, acodebookis a collection of representative vectors (called codewords or centroids). For each segment of a feature vector, there is a separate codebook containing codewords. These codewords are learned (e.g., via k-means) to best represent the sub-vectors in that segment. -
Softmax Function:
- Conceptual Definition: The
softmaxfunction is typically used in the output layer of a neural network for multi-class classification problems. It takes a vector of real numbers (logits) and converts them into a probability distribution, where each value is between 0 and 1, and all values sum to 1. - Mathematical Formula: For an input vector , the
softmaxoutput is calculated as: $ \sigma(\mathbf{z})i = \frac{e^{z_i}}{\sum{j=1}^K e^{z_j}} $ - Symbol Explanation:
- : The -th element of the input vector (logit).
- : Euler's number (the base of the natural logarithm).
- : The sum of the exponentials of all elements in the input vector, acting as a normalization factor.
- : The -th element of the output probability distribution.
- Conceptual Definition: The
-
Cross-Entropy Loss:
- Conceptual Definition:
Cross-entropy loss(also known aslog loss) is a commonly used loss function for classification tasks, especially when the output is a probability distribution (from a softmax layer). It measures the difference between the true probability distribution (one-hot encoded ground truth) and the predicted probability distribution. The goal during training is to minimize this loss. - Mathematical Formula (for a single sample with classes):
$
L_{CE} = -\sum_{i=1}^K y_i \log(\hat{y}i)
$
For
one-hot encodingwhere for the true class and for : $ L{CE} = -\log(\hat{y}_k) $ - Symbol Explanation:
- : The true probability for class (1 if is the ground-truth class, 0 otherwise).
- : The predicted probability for class (output of the
softmaxfunction). - : The natural logarithm.
- Conceptual Definition:
-
Metric Learning Losses (e.g., CosFace):
- Conceptual Definition: These losses are designed to learn feature embeddings where semantically similar data points are mapped close to each other, and dissimilar points are mapped far apart in the embedding space. This is crucial for tasks like face recognition or image retrieval, where the discriminative power of features is paramount.
- CosFace (Additive Angular Margin Loss): A specific
metric learning lossthat enhances feature discriminability by adding an angularmarginto thecosine similarityscore of the ground-truth class. This pushes features of the same class closer in angular space while pushing features of different classes further apart. - Mathematical Formula (simplified for the paper's context, based on [39]): $ L = -\log \frac{e^{s(\cos(\theta_{k^l} + \text{margin}))}}{e^{s(\cos(\theta_{k^l} + \text{margin}))} + \sum_{k \neq k^l} e^{s \cos(\theta_k)}} $
- Symbol Explanation:
- : The angle between the feature vector and the weight vector corresponding to the ground-truth class .
- : A fixed positive parameter (e.g., 0.2) that controls the magnitude of the angular margin. Adding it to (or subtracting it from ) makes it harder for the model to correctly classify the target, forcing it to learn more discriminative features.
- : A scaling factor (e.g., 64) that normalizes the magnitude of the logits, making the angular separation more significant.
- : The cosine similarity between the feature vector and the weight vector of class .
3.2. Previous Works
The paper categorizes previous hashing methods in several ways and discusses their limitations, especially for large-scale datasets.
-
Hash Distance Calculation Methods:
- Hamming distance-based methods: Aim to learn binary codes such that similarity is measured by
Hamming distance. Examples includeLSH [16],GreedyHash [37],OrthoHash [18],CSQ [46],HHF [42],HashNet [7],DPN [12],DFH [25],JMLH [36],DCDH [48].- LSH (Locality Sensitive Hashing) [16]: A data-independent random hashing algorithm. It projects high-dimensional data onto lower dimensions using random hyperplanes or other methods, then buckets similar items into the same hash bins.
- Dictionary-based methods (Quantization methods): Calculate distance by looking up a codebook. These methods typically involve clustering or learning codewords. Examples include
PQ [20],AQ [3],DPQ [23],DTQ [27],PQN [45],ADSVQ [51],OPQN [49].
- Hamming distance-based methods: Aim to learn binary codes such that similarity is measured by
-
Non-deep vs. Deep Methods:
-
Non-deep methods:
- LSH [16]: As mentioned above, a foundational random projection method.
- PQ (Product Quantization) [20]: Divides vectors into sub-vectors and quantizes each independently.
- OPQ (Optimized Product Quantization) [14]: Improves PQ by applying an orthogonal transformation to the original data before quantization to reduce
quantization error. - LOPQ (Locally Optimized Product Quantization) [21]: An extension of OPQ that considers local relationships of vectors to further improve quantization.
- CQ (Composite Quantization) [50], AQ (Additive Quantization) [3]: Other quantization methods that represent vectors as sums of codewords from multiple codebooks.
- Characteristics: Generally higher efficiency but lower performance compared to deep methods.
-
Deep methods: Leverage neural networks to learn hash codes and feature representations.
- Gradient Issue: The quantization process (converting real-valued features to discrete binary or integer codes) is non-differentiable, making end-to-end training challenging. Researchers use various techniques:
Sigmoidortanhfunctions to approximate thesignfunction [7].Straight-Through (ST) Estimator[5]: A technique to allow gradients to flow through non-differentiable operations by treating the non-differentiable function as an identity during the backward pass. Used inDPQ [23]andGreedyHash [37].
- Examples:
HashNet [7],DPQ [23],GreedyHash [37],OrthoHash [18],DQN [47],PQN [45],DTQ [27],OPQN [49],ADSVQ [51],DDH [48].
- Gradient Issue: The quantization process (converting real-valued features to discrete binary or integer codes) is non-differentiable, making end-to-end training challenging. Researchers use various techniques:
-
-
Limitations of Existing Methods on Large-Scale Datasets:
- DPQ [23]: Uses a
Gini Impurity-related penalty. This constraint is difficult to apply to large-scale data and can cause network collapse, mapping many data points to the same code. - PQN [45], DTQ [27]: Computationally expensive due to constructing
triples(combinations of anchor, positive, and negative samples), even with optimizations like dividing datasets into groups. - DQN [47]: Requires frequent clustering of the entire training data to update hash code centers, which is inefficient for large datasets.
- ADSVQ [51], DCDH [48]: Need to calculate a square matrix of the number of categories during training, limiting the scalability to hundreds of thousands or millions of classes.
- OPQN [49]:
Number of bitsis limited to the feature dimension, leading to significant training overhead when longer bit numbers are needed for performance. - Predefined Hash Codes: Works like
DPN [12],CSQ [46],OrthoHash [18]use predefined hash codes (e.g.,Hadamard matrix[41] orBernoulli sampling) as labels.Hadamard matrixbecomes impractical for many categories due to its large dimensions.Bernoulli samplingmight struggle to achieve near-uniform distribution of compressed hash codes as supervised targets, making convergence difficult.
- DPQ [23]: Uses a
3.3. Technological Evolution
Hashing technology has evolved from simple data-independent methods (like LSH in the late 90s) to data-dependent learning-based methods, and then to deep hashing in the past decade.
-
Early Hashing (e.g., LSH): Focused on randomized projections, simple, but often suboptimal performance.
-
Learning-to-Hash (Non-deep): Methods that learn hash functions from data, often using spectral methods or boosting.
PQand its variants (OPQ, LOPQ) emerged as powerful quantization-based approaches. -
Deep Hashing: Integrated deep learning to learn both features and hash codes end-to-end, showing significant performance gains by extracting more discriminative features. This era saw methods addressing the non-differentiability of hashing.
-
Current Challenge: Large-Scale Data: The current frontier, which this paper addresses, is scaling
deep hashingmethods toreal-world large-scale datasetsthat have millions of samples and hundreds of thousands of classes. Many deep hashing methods, while effective on smaller benchmarks, struggle with the computational burden and performance degradation when faced with this scale.This paper's work fits within the technological timeline by pushing the boundaries of
deep hashingto effectively operate in theselarge-scalescenarios, specifically by enhancingProduct Quantizationthrough a deep learning framework.
3.4. Differentiation Analysis
Compared to the main methods in related work, FPPQ differentiates itself by:
- Scalability: Unlike
ADSVQ,DCDH,DPQ,PQN,DTQ, andDQNwhich suffer from large-scale matrix operations, high computational costs, or frequent clustering of entire datasets,FPPQis designed to be efficient for large-scale data. It "does not involve large-scale matrix operations, converges quickly, and overcomes the limitations of previous methods for large-scale datasets." - PQ Performance at Short Bits:
FPPQdirectly tackles the core weakness of traditionalPQ: the rapid decline in retrieval performance at short encoding lengths. It aims for a "low-slope decay" of performance by learning features better suited forPQ. While traditionalPQis strong at long bits,FPPQenhances it significantly at short bits. - Class-Level Supervision for PQ: Instead of relying on generic hash code supervision (like
Hadamard matrixorBernoulli samplingused byDPN,CSQ,OrthoHash),FPPQgeneratesclass-level PQ labelsfrom pre-trained features. This implicitly captures inter-class relationships in a way that is tailored to thePQstructure. - Differentiable PQ Branch: It integrates a
softmax-based differentiablePQ branchdirectly into the deep learning framework, enabling end-to-end learning of a codebook suitable forPQwithout requiring post-trainingk-meansfor the codebook. - Flexibility in Loss: The use of
CosFace[39] (ametric learning loss) forPQ sub-branchesenhances discriminability, which is often superior to simplesoftmax cross-entropyfor retrieval tasks. - Robustness to Preprocessing: The method demonstrates insensitivity to feature preprocessing (normalization, segment normalization) during retrieval, simplifying deployment.
4. Methodology
4.1. Principles
The original Product Quantization (PQ) method aims to minimize quantization error through clustering. However, when compression strength is increased (i.e., encoding bits are reduced), the number of segments decreases, and the sub-vector dimension (D/M) increases. This leads to two main problems that degrade PQ performance:
-
Increased Impact of Misallocation: As
sub-vector dimensiongrows,quantization errorwithin a segment increases. Also, if fewersegmentsare used, each segment becomes more critical, so amisallocation(assigning a sub-vector to a wrong cluster center) has a larger impact on the overall accuracy. -
Uniform Sub-vector Distribution: With increased
sub-vector dimensionand fewersegments,sub-vectorsstart to resemblecomplete featuresmore closely, often exhibiting stronger discriminability and a more uniform distribution. When classes (where is the number of categories in the dataset) need to be mapped to clustering centers (, where is bits per segment), and , it becomes highly probable for features of different classes to be assigned to the samePQ codes, or for samples of the same class to be assigned to differentPQ codes. This negatively impacts retrieval accuracy.To mitigate these issues, the core idea behind
FPPQ (Full Potential Product Quantization)is to train abackboneneural network to generate feature representations that are inherently better suited forshort PQ encoding. Specifically, the method aims to achieve two goals: -
Class-level PQ Code Assignment: Features belonging to the same class should ideally be assigned to the same
PQ code. -
Sub-vector Clustering within Sub-spaces: Within each
sub-space(i.e., for each segment), thesub-vectorsthat belong to the same desiredPQ codeshould cluster together tightly.These goals are achieved by introducing a differentiable
PQ branchthat is supervised by pre-definedclass-level PQ labels, guiding the feature learning process to produce morePQ-friendly embeddings.
4.2. Core Methodology In-depth (Layer by Layer)
The FPPQ framework operates in an end-to-end manner, as illustrated in Figure 1.
The overall flow of the framework consists of two parallel branches during the training phase:
- A
one-hot classification branchfor the entire feature vector. - A
PQ classification branchthat constrains the sub-vectors of the feature.
Figure 1: The overall flow of our framework.
Figure 1: The overall flow of our framework. It consists two branches during the training phase: a one-hot classification branch for the entire feature and another PQ classification branch that constrains sub-vectors of the feature. We use the predefined class-level PQ labels to guide the learning of the PQ branch and improve feature separation in sub-spaces. During the retrieval phase, the only difference with traditional PQ is that our framework does not require clustering for the codebook. Instead, we directly use the learned codebook obtained from the PQ branch, which comprises the fully connected weights of multiple sub-branches in the PQ branch.
Training Phase:
Given an input image, a backbone neural network extracts a deep feature representation . This feature vector is then fed into two parallel branches:
4.2.1. One-Hot Classification Branch
This is a standard classification branch that takes the entire feature and uses a fully connected (FC) layer followed by a softmax activation to predict the class label. It is optimized using a standard classification loss (e.g., softmax cross-entropy loss), denoted as . This branch ensures that the backbone learns generally discriminative features for overall classification.
4.2.2. PQ Classification Branch
This is the novel component designed to optimize features specifically for Product Quantization.
-
Feature Segmentation: The feature vector is first divided into equal-length
segments(sub-vectors), consistent with thePQ encoding process: $ \mathcal{F} = [ \mathcal{F}_1, \cdot \cdot \cdot, \mathcal{F}_m, \cdot \cdot \cdot, \mathcal{F}_M ] $ where each sub-vector . -
Multiple Sub-Branches: For each of the segments, there is a separate
PQ sub-branch. EachPQ sub-branchis essentially afully connected (FC) layerthat maps the corresponding sub-vector to output units. Here, represents the number of clusters (or codewords) in each segment'scodebook(typically , where is the number of bits per segment, e.g., for 8 bits). -
Softmax-based Loss: Each
PQ sub-branchapplies asoftmaxfunction to its outputs and is trained to maximize the posterior probability of theground-truth PQ codefor that segment. The loss for thePQ branchis defined as: $ L_{pq} = - \frac{1}{B M} \sum_{b=1}^{B} \sum_{m=1}^{M} \log \frac{e^{z_{mk^l}}}{\sum_{k=1}^{K} e^{z_{mk}}} $- Symbol Explanation:
- : The
batch size, representing the number of samples processed in one training iteration. - : The number of
PQ code segments(or sub-vectors) into which the main feature is divided. - : The number of clusters (codewords) in each
segment's codebook, which is typically for bits per segment. - : The
logit(raw output beforesoftmax) of the -th unit in the -thPQ sub-branch. - : The
ground-truth code value(index) for the -th segment, derived from thepredefined class-level PQ label.
- : The
- Symbol Explanation:
-
Optimizing Sub-branches with Angle (using Metric Learning Principles): To enhance the discriminability of each sub-branch's output,
FPPQleveragesmetric learningprinciples. The output from anFC layercan be seen as a dot product between the input sub-vector and the weight vector of the -th unit in the -th sub-branch ().-
Euclidean Distance Interpretation: The
Euclidean distancebetween the sub-vector and a classification weight is: $ D_{Euclidean}\langle \mathcal{F}m, w{mk} \rangle = \sqrt{ \left. \mathcal{F}m \right.^2 + \left. w{mk} \right.^2 - 2 \left. \mathcal{F}m \right. \left. w{mk} \right. \cos \theta_{mk} } $- Symbol Explanation:
- : The -th sub-vector of the input feature.
- : The weight vector of the -th unit in the -th
PQ sub-branch. This vector effectively acts as a codeword for the -th cluster in the -th segment. - : The squared Euclidean norm (magnitude) of the sub-vector , i.e., .
- : The squared Euclidean norm of the weight vector , i.e., .
- : The cosine of the angle between and .
- Symbol Explanation:
-
Normalization for Angular Optimization: By normalizing the weights such that , Equation (3) simplifies to: $ D_{Euclidean} \big\langle \mathcal{F}m, w{mk} \big\rangle = \sqrt{ | \mathcal{F}_m |^2 + 1 - 2 | \mathcal{F}m | \cos \theta{mk} } $ This shows that the ranking of
Euclidean distances(and thus the choice of the closest codeword) for a given depends solely on theanglesbecause is constant for a given sub-vector. To further stabilize training and accelerate convergence, the authors:- Omit the
bias termin theFC layer. - Normalize such that . This results in the output becoming: $ z_{mk} = | \mathcal{F}m | | w{mk} | \cos \theta_{mk} = \cos \theta_{mk} $
- Symbol Explanation:
- : The Euclidean norm (magnitude) of the sub-vector .
- : The Euclidean norm of the weight vector .
- : The cosine similarity between the normalized sub-vector and normalized weight vector .
- Omit the
-
Incorporating Metric Learning Loss (CosFace): A simple
softmax cross-entropy lossmight not provide sufficient discriminability for retrieval tasks. Therefore,FPPQintegratesCosFace[39] (anadditive angular margin loss) into thePQ branch lossto enhance the separation between classes in the angular space. The modifiedL_pqbecomes: $ L_{pq} = - \frac{1}{B M} \sum_{b=1}^{B} \sum_{m=1}^{M} \log \frac{e^{s(\cos(\theta_{mk^l} + \text{margin}))}}{e^{s(\cos(\theta_{mk^l} + \text{margin}))} + \sum_{k=1, k \neq k^l}^{K} e^{s \cos(\theta_{mk})}} $- Symbol Explanation:
- : A fixed parameter (e.g., 0.2) to control the magnitude of the cosine margin, making the decision boundary stricter for the ground-truth class.
- : A scaling factor (e.g., 64) applied to the logits to emphasize angular differences.
- : The angle between the -th sub-vector and the weight vector for its ground-truth
PQ code. - : The angle between the -th sub-vector and the weight vector for any other
PQ code.
- Symbol Explanation:
-
4.2.3. Final Optimization Objective
The overall training objective combines the classification loss for the full feature and the PQ branch loss:
$
\mathcal{L} = \mathcal{L}{cls} + \mathcal{L}{pq}
$
- Symbol Explanation:
-
: The total loss function to be minimized during training.
-
: The
classification loss(e.g.,softmax cross-entropy) for the main classification branch, which helps learn a generally discriminativebackbone. -
: The
PQ branch loss(incorporatingCosFace), which specifically optimizes the sub-vectors forProduct Quantization.This combined loss enables
end-to-endoptimization, where thebackbonelearns features that are both generally discriminative and specifically well-suited forPQ encoding.
-
4.2.4. Predefined Class-Level PQ Label
A crucial part of FPPQ is the generation of class-level PQ labels to supervise the PQ branch. Unlike prior works that use generic hash codes (e.g., Hadamard matrix or Bernoulli sampling), FPPQ creates PQ codes that implicitly capture inter-class relationships derived from the data itself.
- Process:
- Obtain Class Average Features: A pre-trained model (whose classifier is assumed to be discarded, a common scenario in retrieval) is used to extract features. For each class, the
average featureis computed by averaging the features of all samples belonging to that class. - PQ Encoding on Average Features: Traditional
PQ encodingis then performed on theseclass average features. This involves:- Dividing each
average featureinto segments. - Performing
k-means clusteringin each of the sub-spaces across all class average features. - Assigning each sub-vector of each
class average featureto its closest centroid to generate aPQ codefor that segment.
- Dividing each
- Construct Class-Level PQ Labels: The resulting
M-tupleof segment indices forms theclass-level PQ labelfor each class: $ PQ_{label} = PQ ( { \mathcal{F}{avg} } ) = { [k_1, \cdots, k_m, \cdots, k_M]{cls} ; cls \in [1, class_num] } $- Symbol Explanation:
- : The set of
Product Quantization labelsfor all classes. - : The average feature vector for a given class.
- : The
M-tuplePQ codeassigned to a specific classcls, where is the index of the codeword for the -th segment. - : The total number of classes in the dataset.
- : The set of
- Symbol Explanation:
- Handle Duplication: In scenarios with many data categories and short
encoding bits,PQ code duplicationcan occur (i.e., different classes might end up with the samePQ code). To ensure unique and meaningful relationships,FPPQremoves duplicates by assigning them to the position with the lowestquantization error, provided that position is not already occupied by anotherPQ code.
- Obtain Class Average Features: A pre-trained model (whose classifier is assumed to be discarded, a common scenario in retrieval) is used to extract features. For each class, the
4.2.5. Encode and Retrieval
After the training phase is complete, the FPPQ framework is used for encoding and retrieval:
- Feature Extraction: The trained
backboneis used to extract feature representations for all input images in thegallery set(database) and anyquery images. - Codebook from PQ Branch Weights: Crucially, the weights of the
FC layersin thePQ branchare directly used as thecodebookfor thePQ algorithm. These weights, denoted as , are normalized: . The complete codebook is formed where . At this point, thePQ algorithmno longer requires its own clustering step; it uses the learnedcodebookdirectly. - Encoding Gallery Samples: Each sample in the
gallery setis encoded into anM-tuple PQ code: $ \overline{PQ(\mathcal{F}_{\mathcal{G}})} = { [k_1, \cdots, k_m, \cdots, k_M]_i ; i \in [1, num_images] } $ where is the index of the closest codeword to the -th sub-vector of .
Product Quantization supports two main retrieval methods:
-
Symmetric Retrieval (SDC - Symmetric Distance Computation):
- Conceptual Definition: In symmetric retrieval, both the query vector and the gallery vectors are first quantized into their
PQ codes. The distance between a query and a gallery item is then computed as the sum of distances between their corresponding quantized representations (i.e., their codewords). - Mathematical Formula: The distance between a query (whose -th segment is represented by codeword ) and a gallery sample (whose -th segment is represented by codeword ) is:
$
\mathcal{D}{sym} = \sum{m=1}^{M} \langle \mathcal{C}{mk^*}, \mathcal{C}{mk^i} \rangle
$
- Symbol Explanation:
- : The symmetric distance between the query and gallery sample.
- : Number of segments.
- : Denotes a distance metric (e.g.,
Euclidean distanceorcosine distance) between two vectors. - : The index of the codeword in the -th segment's codebook that is closest to the query's -th sub-vector .
- : The actual codeword (quantized representation) for the -th segment of the query sample.
- : The actual codeword (quantized representation) for the -th segment of the -th gallery sample.
- Symbol Explanation:
- Codeword Selection: For the query, is determined by finding the codeword that minimizes the distance to the query's -th sub-vector :
$
k^* = \underset{k \in [0, K-1]}{\arg\min} \ \big\langle \mathcal{F}m, \mathcal{C}{mk} \big\rangle
$
- Symbol Explanation:
- : Number of codewords per segment.
- : Query's -th sub-vector.
- Symbol Explanation:
- Conceptual Definition: In symmetric retrieval, both the query vector and the gallery vectors are first quantized into their
-
Asymmetric Retrieval (ADC - Asymmetric Distance Computation):
- Conceptual Definition: In asymmetric retrieval, the query vector is used in its original, unquantized (floating-point) form, while the gallery vectors are represented by their
PQ codes(quantized form). The distance is computed between the original query sub-vectors and the codeword representations of the gallery sub-vectors. This typically offers higher accuracy than symmetric retrieval because it avoids quantizing the query. - Mathematical Formula: The distance between a query () and a gallery sample (whose -th segment is represented by codeword ) is:
$
\mathcal{D}{asym} = \sum{m=1}^{M} \big< \mathcal{F}m^{query}, \mathcal{C}{mk^i} \big>
$
- Symbol Explanation:
- : The asymmetric distance between the query and gallery sample.
- : The original, unquantized -th sub-vector of the query sample.
- : The codeword for the -th segment of the -th gallery sample.
- Symbol Explanation:
- Conceptual Definition: In asymmetric retrieval, the query vector is used in its original, unquantized (floating-point) form, while the gallery vectors are represented by their
-
Feature Preprocessing for Retrieval: The paper notes that during training, segment normalization (setting ) is performed. However, for retrieval,
FPPQfound that applying segment normalization or full feature normalization to the query feature had little impact on the ultimate retrieval performance due to the learned compactness of the encoding. Thus,FPPQ"does not need to perform any additional processing on the features compared with PQ algorithm" during retrieval.
5. Experimental Setup
5.1. Datasets
The authors validate their method on a variety of datasets, emphasizing large-scale scenarios.
-
Glint360K [2]:
- Source & Scale: One of the largest publicly available face datasets. It contains over 360,000 unique identities (
IDs) and approximately 17 million images. - Characteristics & Domain: A large-scale facial recognition dataset.
- Usage: All available data from
Glint360Kis used to train the hashing methods. Performance is then evaluated on unseen retrieval tasks usingMegaFaceandFaceScrubdatasets, ensuring no overlap.
- Source & Scale: One of the largest publicly available face datasets. It contains over 360,000 unique identities (
-
MegaFace [22] and FaceScrub [33]:
- Source & Scale:
MegaFace: Contains one million images of over 690,000 different individuals.FaceScrub: Comprises over 100,000 photos of 530 celebrities.
- Characteristics & Domain: Both are standard datasets for evaluating face recognition performance.
- Usage:
MegaFaceis used as thegallery set(database) for evaluation.FaceScrubis used to construct thequery set. Following standard practice [22], a subset ofFaceScrubwith 80 individuals (40 females, 40 males), each having more than 50 images, is used. After noise removal and de-duplication (followinginsightface projectrecommendations), the final query set consists of 3529 face images from these 80 individuals.- Evaluation Protocol: For each query individual, one image is taken as a query, and the remaining images of that individual are incorporated into the
gallery setto test retrieval.
- Source & Scale:
-
ImageNet1K [35]:
- Source & Scale: A widely used large-scale visual recognition challenge dataset. It contains 1000 classes and over 1.2 million images.
- Characteristics & Domain: General object recognition dataset.
- Usage:
- The training set of
ImageNet1Kis used to train the hashing methods. - The validation set (50,000 images) is used for evaluation.
- Query Set: Top 5 images from each of the 1000 classes in the validation set, totaling 5000 images.
- Gallery Set: The remaining 45,000 images from the validation set.
- The training set of
-
ImageNet100 [7, 35]:
- Source & Scale: A subset of
ImageNet1K, consisting of 100 classes. - Characteristics & Domain: Smaller-scale general object recognition.
- Usage: Popularly used in previous deep hashing methods for benchmarking.
- Source & Scale: A subset of
5.2. Evaluation Metrics
The retrieval performance is assessed using standard metrics relevant to image retrieval tasks.
-
Top-K Accuracy (used for Glint360K, MegaFace, FaceScrub):
- Conceptual Definition: Measures how often the correct item (i.e., an image of the same individual as the query) is found among the top retrieved results. If the ground-truth item is present in the closest items to the query, it's considered a hit.
- Mathematical Formula: $ \text{Top-K Accuracy} = \frac{\text{Number of queries where ground-truth is in top K results}}{\text{Total number of queries}} $
- Symbol Explanation:
Number of queries where ground-truth is in top K results: The count of query samples for which the true matching item (or an item of the same class/identity) is found within the top positions of the ranked retrieval list.Total number of queries: The total count of query samples evaluated.
- Specific Metrics:
Top-1,Top-5, andTop-20accuracies are reported, meaning the ground-truth is expected in the 1st, 5th, or 20th position, respectively.
-
Mean Average Precision (mAP) @R1000 (used for ImageNet1K and ImageNet100):
- Conceptual Definition:
mAPis a commonly used metric for evaluating information retrieval systems. It's the mean of theaverage precision (AP)scores for each query.APfor a single query averages theprecisionvalues at each relevant item in the ranked list. The@R1000indicates that only the top 1000 retrieved results are considered for calculatingAP. A highermAPindicates better retrieval performance across all queries. - Mathematical Formula: $ \text{mAP} = \frac{1}{Q} \sum_{q=1}^{Q} \text{AP}_q $ where $ \text{AP}q = \frac{\sum{k=1}^{n} (P(k) \cdot \text{rel}(k))}{\text{Number of relevant items for query } q} $
- Symbol Explanation:
- : The total number of queries.
- : The Average Precision for query .
- : The total number of retrieved items considered (here, up to 1000).
P(k): The precision at cutoff (i.e., the proportion of relevant items among the top retrieved items). .- : An indicator function, equal to 1 if the item at rank is relevant, and 0 otherwise.
Number of relevant items for query q: The total count of relevant items for query in the entire gallery.
- Conceptual Definition:
5.3. Baselines
The paper compares FPPQ against a wide range of competitive hashing methods, encompassing both traditional and deep learning-based approaches:
-
Traditional PQ [20]: The foundational
Product Quantizationmethod, serving as a direct comparison forFPPQ's improvements. -
HHF [42]:
Hashing-guided Hinge Function for Deep Hashing Retrieval. -
CSQ [46]:
Central Similarity Quantization. -
OrthoHash [18]:
One loss for all: Deep hashing with a single cosine similarity based learning objective. -
GreedyHash [37]:
Towards fast optimization for accurate hash coding in CNN. -
DPN [12]:
Deep Polarized Network for Supervised Learning of Accurate Binary Hashing Codes. -
DFH [25]:
Deep Fisher Hashing. -
DCDH [48]:
Deep Center-based Dual-constrained Hashing for discriminative face image retrieval. -
JMLH [36]:
Embarrassingly Simple Binary Representation Learning. -
DPQ [23]:
End-to-end Supervised Product Quantization for Image Search and Retrieval. -
HBS-RL [43]:
Hash Bit Selection with Reinforcement Learning for Image Retrieval. -
DWDM [10]:
Deep Hashing with Discrete Wasserstein Distributional Matching.The authors note that some methods like
DCDHandDWDMwere abandoned due to exceeding computational resource limitations with large-scale matrix multiplication onGlint360K, whileJMLHandDPQdid not achieve effective performance in their tests. This highlights the scalability challenges. For methods usingpredefined hash centers(CSQ,OrthoHash),Bernoulli samplingwas used to generate non-redundant hash centers due to the inefficiency ofHadamard matrixfor hundreds of thousands of classes.
5.4. Implementation Details
- Framework: All experiments were conducted using
PyTorch. - Hardware: 8 NVIDIA 2080Ti or 3090Ti GPUs for training.
- Class-level PQ Labels: A pre-trained model was used to get features for acquiring
class-level PQ labels. This is considered fair as other comparative methods also use pre-trained models for initialization. - Hyperparameters (General):
- (scaling factor for angular loss): Set to 64.
margin(angular margin for CosFace): Set to 0.2.Bits per segment: Fixed at 8 bits, a common setting in practical applications. This means the number of categories per sub-branch was .Number of segments (M): Varied depending on the target totalbits.- 16-bits: segments.
- 32-bits: segments.
- 64-bits: segments.
- 128-bits: segments.
- Baseline Implementations: Methods were implemented based on the
cisip-FIRe project(a common benchmark framework for deep hashing) and public projects provided by specific authors. Default settings fromcisip-FIReor paper-specified hyperparameters were used. - Training Strategy (General):
- All methods were trained with the same
backbone,epochs, andbatch sizes. - Most methods used
fine-tuningbased on a pre-trained model. Hash layerinitial learning rate: 0.0001, decreased to 0.00001 after 80% of total epochs.Backbonelearning rate: 0.1 times that of thehash layer.
- All methods were trained with the same
- Specifics for Glint360K:
Backbone:iResnet[11] (variants iResnet18, iResnet50, iResnet100).Feature dimension: 512.Optimizer:SGDwithweight decayof 0.0001.Batch size: (likely 128 samples per GPU across 8 GPUs).Learning rate: Initialized at 0.1 and linearly decreased to 0 based on the number of training steps.
- Specifics for ImageNet100/ImageNet1K:
Backbone:ResNet50[17].Initialization: Parameters pre-trained onImageNet1K.Feature dimension: 2048.- ImageNet100 Special Case: To prevent overfitting due to limited data, the
backbonewas fixed, and an additional linear layer (2048 to 128 dimensions) was added at the end for learning. Fine-tuned for 100 epochs with a learning rate of 0.0001. - ImageNet1K Special Case: Addressed the challenge of obtaining good
class-level PQ labelsdue to low feature discriminability and high duplication rates forPQ codesof average features. AdoptedOrthoHash's approach of generating hash codes using repeatedBernoulli samplingto ensure sufficientHamming distance, then segmented and converted todecimal codesforPQ labels. Training strategy fromcisip-FIRe project, fine-tuning for 90 epochs.
- Retrieval:
Asymmetric retrievalwas used, andFPPQperformed no feature preprocessing during retrieval.
6. Results & Analysis
6.1. Core Results Analysis
The experiments demonstrate FPPQ's significant superiority, particularly on large-scale datasets and at shorter encoding lengths, where traditional PQ and many deep hashing methods struggle.
The following are the results from Table 1 of the original paper:
| Method | 32 bits | 64 bits | 128 bits | ||||||
| Top-1 | Top-5 | Top-20 | Top-1 | Top-5 | Top-20 | Top-1 | Top-5 | Top-20 | |
| PQ [20] | 0.1724 | 0.2999 | 0.4254 | 0.6665 | 0.7749 | 0.8404 | 0.9184 | 0.9492 | 0.9610 |
| HHF [42] | 0.0001 | 0.0014 | 0.0054 | 0.0773 | 0.1646 | 0.2661 | 0 | 0 | 0 |
| CSQ [46] | 0.0061 | 0.0113 | 0.0183 | 0.0124 | 0.0197 | 0.0269 | 0.0283 | 0.0372 | 0.0451 |
| OrthoHash [18] | 0.3098 | 0.3706 | 0.4171 | 0.5526 | 0.6007 | 0.6351 | 0.6626 | 0.7104 | 0.7454 |
| GreedyHash [37] | 0.3688 | 0.5220 | 0.6323 | 0.6102 | 0.7452 | 0.8251 | 0.8259 | 0.9003 | 0.9338 |
| FPPQ(Ours) | 0.9343 | 0.9601 | 0.9652 | 0.9467 | 0.9571 | 0.9628 | 0.9568 | 0.9690 | 0.9731 |
Table 1: On the large-scale dataset Glint360K, the performance comparison of multiple methods under multiple code lengths, it should be noted that some methods were not listed due to ineffective performance or requiring unrealistic computational resources.
- Dominance on Glint360K:
FPPQconsistently achieves the highestTop-1,Top-5, andTop-20accuracies across all tested bit lengths (32, 64, 128 bits) on the challengingGlint360Kdataset.- At 32 bits:
FPPQshows a remarkableTop-1accuracy of 0.9343, vastly outperformingGreedyHash(0.3688),OrthoHash(0.3098), and especiallyPQ(0.1724). This translates to approximately a 57% increase inTop-1performance overGreedyHash. This demonstratesFPPQ's ability to maintain high accuracy even with highly compressed codes, which is a major breakthrough forPQ. - At 64 bits:
FPPQ(0.9467Top-1) still significantly surpasses all baselines, includingPQ(0.6665),GreedyHash(0.6102), andOrthoHash(0.5526). This is roughly a 28% performance growth over traditionalPQ. - At 128 bits:
FPPQ(0.9568Top-1) maintains its lead, showing approximately a 4% improvement overPQ(0.9184). This indicates that even at longer bit lengths where traditionalPQis strong,FPPQcan still extract more discriminative features.
- Limitations of Baselines:
- Many advanced deep hashing methods (
HHF,CSQ,OrthoHash,GreedyHash) struggle or fail to perform well onGlint360K, especially at shorter bit lengths. Some methods were even excluded due to "ineffective performance or requiring unrealistic computational resources." HHFexhibited very poor performance at shorter bits and failed completely at 128 bits, suggesting issues with hyperparameters or convergence on such a large scale.CSQalso performed poorly across all bit lengths.- Even
OrthoHashandGreedyHash, which are competitive deep hashing methods, are inferior to the originalPQat 64 and 128 bits, highlighting the challenge of scaling deep hashing. This strongly validates the authors' initial motivation that existing methods are not well-suited forlarge-scale datasets.
- Many advanced deep hashing methods (
6.2. Ablation Studies / Parameter Analysis
6.2.1. Performance with Different Backbone Settings
The authors investigate the impact of different backbone architectures on FPPQ's performance.
The following are the results from Table 2 of the original paper:
| Glint360K | ||||||||||
| Backbone | iResnet18 | iResnet50 | iResnet100 | |||||||
| Method | Metric | 128 bits | 64 bits | 32 bits | 128 bits | 64 bits | 32 bits | 128 bits | 64 bits | 32 bits |
| PQ [20] | Top-1 | 0.8005 | 0.4588 | 0.0973 | 0.9184 | 0.6665 | 0.1724 | 0.9465 | 0.7380 | 0.2074 |
| Top-5 | 0.8659 | 0.5958 | 0.1940 | 0.9492 | 0.7749 | 0.2999 | 0.9658 | 0.8378 | 0.3569 | |
| Top-20 | 0.9025 | 0.6903 | 0.3031 | 0.9610 | 0.8404 | 0.4254 | 0.9720 | 0.8878 | 0.4941 | |
| FPPQ(Ours) | Top-1 | 0.8231 | 0.7019 | 0.6945 | 0.9568 | 0.9467 | 0.9342 | 0.9679 | 0.9578 | 0.9305 |
| Top-5 | 0.8824 | 0.7890 | 0.7576 | 0.9690 | 0.9571 | 0.9601 | 0.9746 | 0.9728 | 0.9700 | |
| Top-20 | 0.9133 | 0.8414 | 0.8006 | 0.9731 | 0.9628 | 0.9652 | 0.9772 | 0.9753 | 0.9734 | |
Table 2: Under different network structure and bits settings, our method achieved consistent and significant improvements.
- Consistent Improvements:
FPPQconsistently shows significant advantages overPQacross alliResnetarchitectures (iResnet18,iResnet50,iResnet100) and bit settings. The performance gap is particularly large at shorter bit lengths (e.g., 32 bits), wherePQ's performance severely degrades. - Impact of Backbone Size: Generally, larger
backbones(e.g.,iResnet50vs.iResnet18,iResnet100vs.iResnet50) lead to better performance for bothPQandFPPQ. This is expected as more powerfulbackbonesextract richer and more discriminative features. - 32-bit Performance Limit: Interestingly, for
iResnet100at 32 bits, theTop-1performance slightly decreases compared toiResnet50(0.9305 vs. 0.9342), althoughTop-5andTop-20still show slight improvements. The authors suggest this might be due to an "intrinsic constraint of compression strength" at very low bit settings, implying that even with a very powerfulbackbone, there's a limit to how much information can be preserved in extremely short codes.
6.2.2. Low-Slope Decay of Retrieval Performance
-
Analysis of Figure 2:
Figure 2: In Glint360K, comparison of the asymmetric retrieval performance trend of PQ with compression efficiency and the results of our methods for improving quality. 'L2' represents using the original features for L2 retrieval, while represents dividing each feature into 256 segments for PQ retrieval, and 8 bits per segment.Figure 2 visually confirms one of
FPPQ's main claims: it achieves a "low-slope decay of retrieval performance" as theencoding bit-lengthdecreases.- The curve for naive
Product Quantization (PQ)shows a steep drop in performance as the bit length decreases, especially from longer to shorter bits. This makes traditionalPQimpractical for highly compressed scenarios. - In contrast, the curve for
FPPQmaintains a much higher performance level at shorter bit lengths, and its performance degradation with decreasing bits is much less pronounced (a "lower slope"). This demonstrates thatFPPQeffectively addresses the weakness ofPQby enabling highly accurate retrieval even with very short codes, expanding its applicability. - The
L2retrieval (using original features) provides an upper bound for performance, showing the ideal scenario without any compression.FPPQgets significantly closer to this ideal thanPQ, especially at intermediate bit lengths. The curve is a specific configuration mentioned in the alt text (dividing into 256 segments, 8 bits/segment), likely referring to a very high bit length, but the general trend for FPPQ remains superior.
- The curve for naive
6.2.3. t-SNE Visualization
-
Analysis of Figure 3:
Figure 3: The visualization of the sub-vectors and clustering centers for the 10k classes in the Glint360K dataset of two approaches: (a) classification loss with PQ and (b) our proposed method. The black dots represent subfeatures, while the red dots in (a) depict the clustering centers obtained by performing k-means clustering on the sub-features of all classes. In contrast, in (b), the red dots represent the codewords trained by our proposed method.To illustrate the effectiveness of
FPPQin shaping feature distributions forPQ,t-SNE[38] is used to visualizesub-vectorsandclustering centersfor 10,000 classes fromGlint360K(specifically, the first segment of average features when compressed to 32 bits).- Figure 3(a) (Traditional
classification losswithPQ): Shows that thesub-vectors(black dots) are largelyevenly distributedthroughout thesub-space. Thek-means clustering centers(red dots) are scattered, reflecting this uniform distribution. This makes efficient quantization difficult becausesub-vectorsfrom different classes might be close to the same centroid, leading to highquantization errorand poor discriminability. - Figure 3(b) (
FPPQ): In stark contrast,FPPQeffectively learns toclusterthesub-vectors(black dots) into distinct, tighter groups around thelearned codewords(red dots). Thecodewordsare also well-separated. This visualization confirms thatFPPQsuccessfully learnsclass-level PQ labels, guiding thesub-vectorsto form well-defined clusters in thesub-space, which is crucial for reducingquantization errorand improving retrieval accuracy.
- Figure 3(a) (Traditional
6.2.4. Retrieval Result Visualization
-
Analysis of Figure 4:
Figure 4: Visualization of the Top-20 retrieval samples using FaceScrub as the query set and MegaFace as the gallery set.Figure 4 presents qualitative retrieval results (Top-20 samples) from
MegaFace(gallery) queried byFaceScrub. It comparesFPPQwith variousL2andPQ4(4 segments, 32 bits total, 8 bits/segment) retrieval settings.- The visual comparison shows that
FPPQ(bothL2retrieval andPQ4retrieval) produces highly similar and accurateTop-20results. - The labels like
PQ4-F(PQ4 without feature processing) andPQ4-Fnor(PQ4 with full feature normalization) andPQ4-SegFnor(PQ4 with segment normalization) are used to test the impact of feature preprocessing. The observation thatFPPQgenerates similarTop-20results under different feature preprocessing settings "further validating the insensitivity of our method to feature preprocessing." This is a practical advantage, simplifying deployment without needing complex data preparation steps during inference.
- The visual comparison shows that
6.3. Results on Smaller Scale Datasets
6.3.1. Results on ImageNet100
The experiment on ImageNet100 addresses a scenario with a small data scale and fewer categories (100 classes) compared to the number of clusters ( for 8 bits/segment).
The following are the results from Table 3 of the original paper:
| Method | ImageNet100 mAP@R1000 | ||
| 16 bits | 32 bits | 64 bits | |
| HashNet [7] | 0.5101 | 0.7059 | 0.7997 |
| CSQ [46] | 0.8379 | 0.8750 | 0.8874 |
| DPN [12] | 0.8543 | 0.8799 | 0.8927 |
| DFH [25] | 0.8352 | 0.8781 | 0.8849 |
| DCDH [48] | 0.7856 | 0.8158 | 0.8400 |
| JMLH [36] | 0.8366 | 0.8671 | 0.8799 |
| GreedyHash [37] | 0.8544 | 0.8796 | 0.8886 |
| HBS-RL [43] | 0.8465 | 0.8770 | 0.8660 |
| DPQ [23] | 0.8860 | 0.8910 | 0.8960 |
| HHF [42] | 0.8710 | 0.8869 | 0.8994 |
| OrthoHash [18] | 0.8693 | 0.8910 | 0.8960 |
| FPPQ(Ours) | 0.8956 | 0.9128 | 0.9154 |
Table 3: mAP@R1000 on ImageNet100
FPPQachieved the best performance (mAP@R1000) onImageNet100across all bit lengths (16, 32, 64 bits).- For 32 bits,
FPPQ(0.9128) outperformsDPQ(0.8910),OrthoHash(0.8910), andHHF(0.8869). - This indicates that
FPPQis effective even in scenarios where generatingclass-level PQ labelsis initially challenging due to the small number of categories relative to the clusters (k-means cannot be directly executed onclass average features). The workaround of trainingPQusing all instance features and then encoding class average features, combined with the fixed backbone and linear layer, proved effective.
6.3.2. Results on ImageNet1K
ImageNet1K presents another challenge: the difficulty of obtaining good class-level PQ labels due to potential low feature discrimination from a pre-trained model or duplicate PQ codes when class average features are compressed to short bits (e.g., >50% duplication for 1000 classes at 32 bits).
The following are the results from Table 4 of the original paper:
| Method | ImageNet1K mAP@R1000 | ||
| 16 bits | 32 bits | 64 bits | |
| HHF [42] | 0.2961 | 0.5979 | 0.6121 |
| DCDH [48] | 0.3666 | 0.4764 | 0.5299 |
| CSQ [46] | 0.5040 | 0.6061 | 0.6093 |
| JMLH [36] | 0.5286 | 0.5876 | 0.6098 |
| GreedyHash [37] | 0.5424 | 0.5896 | 0.5952 |
| OrthoHash [18] | 0.5936 | 0.6514 | 0.6761 |
| FPPQ(Ours) | 0.6207 | 0.6543 | 0.6649 |
Table 4: mAP @R1000 on ImageNet1K
FPPQachieved competitive results onImageNet1K, especially atshorter encoding lengths.- At 16 bits,
FPPQ(0.6207) significantly outperforms all other methods, includingOrthoHash(0.5936). - At 32 bits,
FPPQ(0.6543) is slightly better thanOrthoHash(0.6514). - At 64 bits,
OrthoHash(0.6761) slightly edges outFPPQ(0.6649). - The strategy of using
Bernoulli sampling(similar toOrthoHash) to generatePQ labelswhenclass average featuresare not discriminative enough or lead to too many duplicates is crucial here. This showsFPPQ's adaptability to different strategies forPQ labelgeneration, maintaining strong performance even when directk-meanson average features is suboptimal.
7. Conclusion & Reflections
7.1. Conclusion Summary
This study rigorously explored the limitations of current advanced deep hashing methods when applied to large-scale image retrieval scenarios, demonstrating their inadequacy in terms of computational resources or accuracy. To overcome these challenges, the paper proposed FPPQ (Full Potential Product Quantization), a novel end-to-end deep hashing framework. FPPQ employs multiple softmax branches supervised by specific class-level PQ labels, effectively guiding the backbone to learn features that are highly discriminative and well-suited for Product Quantization. A key achievement is its ability to mitigate the rapid performance decay of traditional PQ at shorter encoding lengths, leading to a "low-slope decay" in retrieval performance. The method is designed to be efficient, avoiding large-scale matrix operations, and integrates seamlessly with traditional PQ retrieval. Extensive experiments on Glint360K, ImageNet1K, and ImageNet100 consistently validated FPPQ's superior performance across various scales and bit-lengths.
7.2. Limitations & Future Work
The authors implicitly point out some limitations and future considerations:
- Intrinsic Compression Strength Limit: The observation that
iResnet100did not consistently improveTop-1performance overiResnet50at 32 bits suggests that there might be an "intrinsic constraint of compression strength." This implies that beyond a certain point, simply using a largerbackbonemight not yield proportional gains when features are compressed to extremely short bit lengths, as fundamental information loss becomes dominant. - PQ Label Generation Challenges: For datasets like
ImageNet1K, where feature discriminability is low orPQ code duplicationis high, the simple generation ofclass-level PQ labels(via k-means on average features) needs adaptation (e.g., usingBernoulli sampling). This highlights that the quality of these predefined labels is crucial and might require careful consideration or alternative strategies depending on the dataset characteristics. - Specific Loss Functions: While
CosFaceis effective, exploration of othermetric learning lossesor specialized loss functions forquantizationcould potentially yield further improvements.
7.3. Personal Insights & Critique
The FPPQ paper offers several compelling insights and makes significant strides in large-scale image retrieval using deep hashing:
- Practical Relevance: The paper's focus on
large-scale real-world scenarios(millions of categories, millions of images) is highly practical and addresses a critical gap where many academic deep hashing methods often fall short. The finding that even traditionalPQcan outperform advanced deep hashing methods onGlint360Kis a powerful validation of their motivation. - Elegant Integration of PQ: The method elegantly integrates the strengths of
Product Quantization(efficiency, scalability) withdeep learning(powerful feature learning, discriminability). By making thePQ codebooklearned and differentiable, and supervising it withclass-level PQ labels,FPPQeffectivelytrains the backbone to "think" in terms of PQ clusters, rather than just forcing features into a PQ structure post-training. This end-to-end optimization is key to its success. - Robustness to Preprocessing: The demonstrated insensitivity to
feature preprocessingduring retrieval is a valuable practical characteristic, simplifying deployment and making the system more robust in varied environments. - Adaptability in Label Generation: The flexibility to adapt
PQ label generation(e.g., usingBernoulli samplingforImageNet1K) shows that the framework is not rigidly tied to one specific label generation method, allowing for customization based on dataset properties. - Potential Broader Applications: The core idea of learning feature representations optimized for a specific quantization scheme (like
PQ) could be transferable to other compression or approximate search techniques. For instance, similar principles could be applied to improveAdditive Quantization (AQ)orComposite Quantization (CQ)in deep learning contexts. - Critique/Areas for Improvement:
-
Sensitivity to Pre-trained Model: The method relies on a pre-trained model to generate
class average featuresforPQ labelcreation. The quality of these initial features could heavily influence the effectiveness of thePQ labels. An ablation study on the impact of different pre-trained models or label generation strategies could provide deeper insights. -
Theoretical Guarantees: While empirical results are strong, providing more theoretical analysis on why the
softmax-basedPQ branch(especially with angular margin loss) leads to better clusterable sub-spaces could strengthen the paper. -
Computational Overhead for Label Generation: For truly massive datasets with new classes frequently appearing, generating
class average featuresand performingk-means(even on just average features) might still incur some computational cost. Discussing strategies for dynamic or incrementalPQ labelupdates would be valuable for real-time systems. -
Energy Efficiency/Edge Devices: While mentioning
edge devicesas a motivation for short codes, a more direct analysis or experiment on actual energy consumption or inference time on constrained hardware could further solidify its value proposition for such applications.Overall,
FPPQpresents a highly promising and practical approach to making deep hashing andProduct Quantizationviable for the most demandinglarge-scale image retrievaltasks. Its rigorous evaluation and significant performance gains position it as a strong baseline and a step forward in the field.
-
Similar papers
Recommended via semantic vector search.