Unleashing the Full Potential of Product Quantization for Large-Scale Image Retrieval

Xiaoyu Wang

Paper status: completed

Unleashing the Full Potential of Product Quantization for Large-Scale Image Retrieval

Published:11/02/2023

Approximate Nearest Neighbor Search (2)Deep Hashing Framework Based on Product Quantization (1)Large-Scale Image Retrieval (1)ImageNet100 Dataset (1)ImageNet1K Dataset (1)

Original Link

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a product quantization-based deep hashing framework that addresses computational costs and accuracy issues in large-scale datasets, efficiently learning class-level codes through a differentiable PQ branch, demonstrating superior retrieval performance across

Abstract

Due to its promising performance, deep hashing has become a prevalent method for approximate nearest neighbors search (ANNs). However, most of current deep hashing methods are validated on relatively small-scale datasets, leaving potential threats when are applied to large-scale real-world scenarios. Specifically, they can be constrained either by the computational cost due to the large number of training categories and samples, or unsatisfactory accuracy. To tackle those issues, we propose a novel deep hashing framework based on product quantization (PQ). It uses a softmax-based differentiable PQ branch to learn a set of predefined PQ codes of the classes. Our method is easy to implement, does not involve large-scale matrix operations, and learns highly discriminate compact codes. We validate our method on multiple large-scaled datasets, including ImageNet100, ImageNet1K, and Glint360K, where the category size scales from 100 to 360K and sample number scales from 10K to 17 million, respectively. Extensive experiments demonstrate the superiority of our method. Code is available at https://github.com/yuleung/FPPQ.

Mind Map

In-depth Reading

English Analysis~34 min read · 47,962 chars

1. Bibliographic Information

1.1. Title

Unleashing the Full Potential of Product Quantization for Large-Scale Image Retrieval

1.2. Authors

The paper lists the following authors and their affiliations:

Yu Liang (College of Computer Science and Electronic Engineering, Hunan University; Intellifusion Inc.)
Shiliang Zhang (National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University)
Kenli Li (College of Computer Science and Electronic Engineering, Hunan University)
Xiaoyu Wang (The Hong Kong University of Science and Technology (Guangzhou))

1.3. Journal/Conference

The paper was published on OpenReview.net on 2023-11-02. OpenReview is a platform often used for peer review and dissemination of research papers, particularly for conferences like ICLR and NeurIPS. Its presence on OpenReview suggests it was likely submitted to or accepted by a reputable conference in machine learning or computer vision.

1.4. Publication Year

2023

1.5. Abstract

Deep hashing has emerged as a prominent method for Approximate Nearest Neighbors Search (ANNs) due to its strong performance. However, its application to large-scale, real-world datasets is often hindered by computational costs (due to numerous training categories and samples) or unsatisfactory accuracy, as most current methods are validated on smaller datasets. To address these limitations, the paper proposes a novel deep hashing framework based on Product Quantization (PQ). This framework utilizes a softmax-based differentiable PQ branch to learn a set of predefined PQ codes for classes. The method is designed to be easy to implement, avoids large-scale matrix operations, and learns highly discriminative compact codes. The authors validate their approach on multiple large-scale datasets, including ImageNet100, ImageNet1K, and Glint360K, which feature category sizes ranging from 100 to 360K and sample numbers from 10K to 17 million. Extensive experiments are presented to demonstrate the method's superiority.

1.6. Original Source Link

https://openreview.net/pdf?id=scG0cwftEe (Published on 2023-11-02T00:00:00.000Z)

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the limitation of existing deep hashing methods when applied to large-scale datasets in real-world scenarios. While Approximate Nearest Neighbors Search (ANNs) is crucial for various applications like image retrieval, video search, recommendation systems, and anomaly detection, and deep hashing has significantly advanced ANNs by mapping high-dimensional data into compact binary codes, current deep hashing techniques face two major challenges:

Computational Cost: Many methods become prohibitively expensive due to the large number of training categories and samples in real-world large-scale datasets (e.g., millions of categories and hundreds of millions of samples in face recognition).
Unsatisfactory Accuracy: Performance tends to degrade significantly when these methods are scaled up, often underperforming simpler, traditional methods like Product Quantization (PQ) on large data.

Product Quantization (PQ) is a widely used and efficient quantization method. However, its major drawback is that its retrieval performance rapidly declines as the encoding length (number of bits) decreases, making it impractical for applications requiring very short codes (e.g., edge devices) due to increased quantization errors.

The paper's innovative idea or entry point is to "further unlock the potential of PQ" by addressing its performance decay at shorter encoding lengths. The authors hypothesize that this decay stems from insufficient clustering conditions in the sub-spaces of the original feature vector, leading to large quantization errors. They aim to obtain a feature representation that is better suited for cluster assignment in multiple sub-spaces.

2.2. Main Contributions / Findings

The paper's primary contributions are threefold:

First-time Comprehensive Evaluation on Extremely Large Datasets: The authors comprehensively investigate the retrieval performance of popular and advanced deep hashing methods on extremely large datasets with extensive categories and samples (e.g., Glint360K with 360K classes and 17 million images). They demonstrate that many existing deep hashing methods are indeed unsuitable for such large-scale scenarios due to computational resource limitations or performance degradation.
Novel End-to-End Deep Product Quantization Method (FPPQ): They propose a simple yet effective end-to-end deep feature quantization compression method, named FPPQ (Full Potential Product Quantization). This method is based on class-level product quantization coding supervision and achieves a significantly lower-slope decay of retrieval performance with decreasing code lengths compared to traditional PQ. It can be seamlessly integrated into traditional PQ retrieval modules.
Extensive Experimental Validation: The method is rigorously validated through extensive experiments on diverse datasets: Glint360K (360K classes, 17M images), ImageNet1K (1K classes, 1.2M images), and ImageNet100 (100 classes, subset of ImageNet1K). The experimental results consistently demonstrate the superiority and effectiveness of FPPQ across various scales and bit-lengths. Key findings include:
- FPPQ significantly outperforms existing deep hashing methods and traditional PQ on Glint360K at multiple bit lengths, especially at shorter code lengths (e.g., nearly 57% increase in Top-1 performance at 32-bits over GreedyHash).
- It maintains consistent and significant improvements across different network structures (iResnet18, iResnet50, iResnet100).
- Visualization using t-SNE shows that FPPQ effectively learns class-level PQ labels, leading to better clustering of sub-vectors in the subspace.
- FPPQ also performs well on smaller datasets like ImageNet100 and achieves competitive results on ImageNet1K, even in challenging scenarios with limited categories or feature discriminability.

3.1. Foundational Concepts

To understand the paper, several fundamental concepts in information retrieval, machine learning, and deep learning are essential:

Approximate Nearest Neighbors Search (ANNs):
- Conceptual Definition: In many applications (like image search, recommendation systems), we need to find items in a large database that are "similar" or "closest" to a given query item. When the database is huge (millions or billions of items) and items are represented by high-dimensional vectors, finding the exact nearest neighbors becomes computationally infeasible. ANNs are algorithms that aim to find neighbors that are "approximately" closest to the query, trading off some accuracy for significant speed-up.
- How it works: ANNs typically build an index structure (e.g., hash tables, tree-based structures, graph-based structures) over the dataset that allows for quick lookups, even if it doesn't guarantee finding the absolute closest neighbor every time.
Hashing:
- Conceptual Definition: Hashing is a technique that transforms high-dimensional data (e.g., image features represented by floating-point vectors) into compact, low-dimensional binary codes (sequences of 0s and 1s). The goal is to preserve the similarity relationships: if two original high-dimensional items are similar, their hash codes should also be similar (e.g., having a small Hamming distance).
- Why it's used:
  - Storage Efficiency: Binary codes require significantly less memory than floating-point vectors.
  - Retrieval Speed: Comparing binary codes (e.g., using Hamming distance) is much faster than computing Euclidean or cosine distances between high-dimensional floating-point vectors.
- Hamming Distance: The Hamming distance between two binary codes of equal length is the number of positions at which the corresponding bits are different. A smaller Hamming distance implies greater similarity between the items.
Deep Hashing:
- Conceptual Definition: Traditional hashing methods often rely on hand-crafted features or data-independent projections. Deep Hashing combines the power of deep neural networks with hashing. A deep learning model (e.g., a Convolutional Neural Network or CNN) is trained to directly learn optimal feature representations and convert them into binary hash codes in an end-to-end manner.
- Advantages: Deep learning models can learn highly discriminative features from raw data, leading to more effective hash codes and better retrieval performance compared to non-deep methods.
Product Quantization (PQ):
- Conceptual Definition: Product Quantization is a specific and popular technique for vector quantization used in ANNs. Instead of quantizing an entire high-dimensional vector at once, it divides the vector into several independent sub-vectors (or segments). Each sub-vector is then quantized separately using its own small codebook. The final representation of the original vector is a concatenation of the indices (or IDs) of the codewords from each sub-vector's codebook.
- How it works (simplified):
  1. Divide: An input feature vector $\mathcal{F} \in \mathbb{R}^D$ is split into $M$ equal-length sub-vectors: $\mathcal{F} = [\mathcal{F}_1, \dots, \mathcal{F}_M]$ , where each $\mathcal{F}_m \in \mathbb{R}^{D/M}$ .
  2. Cluster (Codebook Learning): For each segment $m$ , all sub-vectors $\{\mathcal{F}_m\}$ from the entire dataset are clustered into $K$ centroids (codewords) using an algorithm like k-means. These $K$ centroids form the codebook for that segment.
  3. Encode: For a given feature vector, each of its $M$ sub-vectors is assigned to its closest centroid in its respective segment's codebook. The index of this centroid is the PQ code for that sub-vector. The full PQ code for the original feature is a sequence of these $M$ indices: $[k_1, \dots, k_M]$ .
  4. Retrieval: During retrieval, distances are computed efficiently using Look-Up Tables (LUTs). The distance between a query vector and a database vector can be approximated by summing up the distances between their corresponding sub-vector codewords.
- Benefits: Reduces memory usage significantly and speeds up distance calculations by converting a high-dimensional distance computation into $M$ low-dimensional lookups and summations.
- k-means clustering: An unsupervised machine learning algorithm used to partition $N$ data points into $K$ clusters. It iteratively assigns each data point to the closest cluster centroid and then updates the centroids to be the mean of all data points assigned to that cluster. This process minimizes the within-cluster sum of squares.
Codebook: In Product Quantization, a codebook is a collection of representative vectors (called codewords or centroids). For each segment of a feature vector, there is a separate codebook containing $K$ codewords. These codewords are learned (e.g., via k-means) to best represent the sub-vectors in that segment.
Softmax Function:
- Conceptual Definition: The softmax function is typically used in the output layer of a neural network for multi-class classification problems. It takes a vector of real numbers (logits) and converts them into a probability distribution, where each value is between 0 and 1, and all values sum to 1.
- Mathematical Formula: For an input vector $\mathbf{z} = [z_1, z_2, \dots, z_K]$ , the softmax output $\sigma(\mathbf{z})$ is calculated as: $ \sigma(\mathbf{z})i = \frac{e^{z_i}}{\sum{j=1}^K e^{z_j}} $
- Symbol Explanation:
  - $z_i$ : The $i$ -th element of the input vector (logit).
  - $e$ : Euler's number (the base of the natural logarithm).
  - $\sum_{j=1}^K e^{z_j}$ : The sum of the exponentials of all elements in the input vector, acting as a normalization factor.
  - $\sigma(\mathbf{z})_i$ : The $i$ -th element of the output probability distribution.
Cross-Entropy Loss:
- Conceptual Definition: Cross-entropy loss (also known as log loss) is a commonly used loss function for classification tasks, especially when the output is a probability distribution (from a softmax layer). It measures the difference between the true probability distribution (one-hot encoded ground truth) and the predicted probability distribution. The goal during training is to minimize this loss.
- Mathematical Formula (for a single sample with $K$ classes): $ L_{CE} = -\sum_{i=1}^K y_i \log(\hat{y}i) $ For one-hot encoding where $y_k=1$ for the true class $k$ and $y_j=0$ for $j \neq k$ : $ L{CE} = -\log(\hat{y}_k) $
- Symbol Explanation:
  - $y_i$ : The true probability for class $i$ (1 if $i$ is the ground-truth class, 0 otherwise).
  - $\hat{y}_i$ : The predicted probability for class $i$ (output of the softmax function).
  - $\log$ : The natural logarithm.
Metric Learning Losses (e.g., CosFace):
- Conceptual Definition: These losses are designed to learn feature embeddings where semantically similar data points are mapped close to each other, and dissimilar points are mapped far apart in the embedding space. This is crucial for tasks like face recognition or image retrieval, where the discriminative power of features is paramount.
- CosFace (Additive Angular Margin Loss): A specific metric learning loss that enhances feature discriminability by adding an angular margin to the cosine similarity score of the ground-truth class. This pushes features of the same class closer in angular space while pushing features of different classes further apart.
- Mathematical Formula (simplified for the paper's context, based on [39]): $ L = -\log \frac{e^{s(\cos(\theta_{k^l} + \text{margin}))}}{e^{s(\cos(\theta_{k^l} + \text{margin}))} + \sum_{k \neq k^l} e^{s \cos(\theta_k)}} $
- Symbol Explanation:
  - $\theta_{k^l}$ : The angle between the feature vector and the weight vector corresponding to the ground-truth class $k^l$ .
  - $\text{margin}$ : A fixed positive parameter (e.g., 0.2) that controls the magnitude of the angular margin. Adding it to $\theta_{k^l}$ (or subtracting it from $\cos(\theta_{k^l})$ ) makes it harder for the model to correctly classify the target, forcing it to learn more discriminative features.
  - $s$ : A scaling factor (e.g., 64) that normalizes the magnitude of the logits, making the angular separation more significant.
  - $\cos(\theta_k)$ : The cosine similarity between the feature vector and the weight vector of class $k$ .

3.2. Previous Works

The paper categorizes previous hashing methods in several ways and discusses their limitations, especially for large-scale datasets.

Hash Distance Calculation Methods:
- Hamming distance-based methods: Aim to learn binary codes such that similarity is measured by Hamming distance. Examples include LSH [16], GreedyHash [37], OrthoHash [18], CSQ [46], HHF [42], HashNet [7], DPN [12], DFH [25], JMLH [36], DCDH [48].
  - LSH (Locality Sensitive Hashing) [16]: A data-independent random hashing algorithm. It projects high-dimensional data onto lower dimensions using random hyperplanes or other methods, then buckets similar items into the same hash bins.
- Dictionary-based methods (Quantization methods): Calculate distance by looking up a codebook. These methods typically involve clustering or learning codewords. Examples include PQ [20], AQ [3], DPQ [23], DTQ [27], PQN [45], ADSVQ [51], OPQN [49].
Non-deep vs. Deep Methods:
- Non-deep methods:
  - LSH [16]: As mentioned above, a foundational random projection method.
  - PQ (Product Quantization) [20]: Divides vectors into sub-vectors and quantizes each independently.
  - OPQ (Optimized Product Quantization) [14]: Improves PQ by applying an orthogonal transformation to the original data before quantization to reduce quantization error.
  - LOPQ (Locally Optimized Product Quantization) [21]: An extension of OPQ that considers local relationships of vectors to further improve quantization.
  - CQ (Composite Quantization) [50], AQ (Additive Quantization) [3]: Other quantization methods that represent vectors as sums of codewords from multiple codebooks.
  - Characteristics: Generally higher efficiency but lower performance compared to deep methods.
- Deep methods: Leverage neural networks to learn hash codes and feature representations.
  - Gradient Issue: The quantization process (converting real-valued features to discrete binary or integer codes) is non-differentiable, making end-to-end training challenging. Researchers use various techniques:
    - Sigmoid or tanh functions to approximate the sign function [7].
    - Straight-Through (ST) Estimator [5]: A technique to allow gradients to flow through non-differentiable operations by treating the non-differentiable function as an identity during the backward pass. Used in DPQ [23] and GreedyHash [37].
  - Examples: HashNet [7], DPQ [23], GreedyHash [37], OrthoHash [18], DQN [47], PQN [45], DTQ [27], OPQN [49], ADSVQ [51], DDH [48].
Limitations of Existing Methods on Large-Scale Datasets:
- DPQ [23]: Uses a Gini Impurity-related penalty. This constraint is difficult to apply to large-scale data and can cause network collapse, mapping many data points to the same code.
- PQN [45], DTQ [27]: Computationally expensive due to constructing triples (combinations of anchor, positive, and negative samples), even with optimizations like dividing datasets into groups.
- DQN [47]: Requires frequent clustering of the entire training data to update hash code centers, which is inefficient for large datasets.
- ADSVQ [51], DCDH [48]: Need to calculate a square matrix of the number of categories during training, limiting the scalability to hundreds of thousands or millions of classes.
- OPQN [49]: Number of bits is limited to the feature dimension, leading to significant training overhead when longer bit numbers are needed for performance.
- Predefined Hash Codes: Works like DPN [12], CSQ [46], OrthoHash [18] use predefined hash codes (e.g., Hadamard matrix [41] or Bernoulli sampling) as labels. Hadamard matrix becomes impractical for many categories due to its large dimensions. Bernoulli sampling might struggle to achieve near-uniform distribution of compressed hash codes as supervised targets, making convergence difficult.

3.3. Technological Evolution

Hashing technology has evolved from simple data-independent methods (like LSH in the late 90s) to data-dependent learning-based methods, and then to deep hashing in the past decade.

Early Hashing (e.g., LSH): Focused on randomized projections, simple, but often suboptimal performance.
Learning-to-Hash (Non-deep): Methods that learn hash functions from data, often using spectral methods or boosting. PQ and its variants (OPQ, LOPQ) emerged as powerful quantization-based approaches.
Deep Hashing: Integrated deep learning to learn both features and hash codes end-to-end, showing significant performance gains by extracting more discriminative features. This era saw methods addressing the non-differentiability of hashing.
Current Challenge: Large-Scale Data: The current frontier, which this paper addresses, is scaling deep hashing methods to real-world large-scale datasets that have millions of samples and hundreds of thousands of classes. Many deep hashing methods, while effective on smaller benchmarks, struggle with the computational burden and performance degradation when faced with this scale.

This paper's work fits within the technological timeline by pushing the boundaries of deep hashing to effectively operate in these large-scale scenarios, specifically by enhancing Product Quantization through a deep learning framework.

3.4. Differentiation Analysis

Compared to the main methods in related work, FPPQ differentiates itself by:

Scalability: Unlike ADSVQ, DCDH, DPQ, PQN, DTQ, and DQN which suffer from large-scale matrix operations, high computational costs, or frequent clustering of entire datasets, FPPQ is designed to be efficient for large-scale data. It "does not involve large-scale matrix operations, converges quickly, and overcomes the limitations of previous methods for large-scale datasets."
PQ Performance at Short Bits: FPPQ directly tackles the core weakness of traditional PQ: the rapid decline in retrieval performance at short encoding lengths. It aims for a "low-slope decay" of performance by learning features better suited for PQ. While traditional PQ is strong at long bits, FPPQ enhances it significantly at short bits.
Class-Level Supervision for PQ: Instead of relying on generic hash code supervision (like Hadamard matrix or Bernoulli sampling used by DPN, CSQ, OrthoHash), FPPQ generates class-level PQ labels from pre-trained features. This implicitly captures inter-class relationships in a way that is tailored to the PQ structure.
Differentiable PQ Branch: It integrates a softmax-based differentiable PQ branch directly into the deep learning framework, enabling end-to-end learning of a codebook suitable for PQ without requiring post-training k-means for the codebook.
Flexibility in Loss: The use of CosFace [39] (a metric learning loss) for PQ sub-branches enhances discriminability, which is often superior to simple softmax cross-entropy for retrieval tasks.
Robustness to Preprocessing: The method demonstrates insensitivity to feature preprocessing (normalization, segment normalization) during retrieval, simplifying deployment.

4. Methodology

4.1. Principles

The original Product Quantization (PQ) method aims to minimize quantization error through clustering. However, when compression strength is increased (i.e., encoding bits are reduced), the number of segments $M$ decreases, and the sub-vector dimension (D/M) increases. This leads to two main problems that degrade PQ performance:

Increased Impact of Misallocation: As sub-vector dimension grows, quantization error within a segment increases. Also, if fewer segments $M$ are used, each segment becomes more critical, so a misallocation (assigning a sub-vector to a wrong cluster center) has a larger impact on the overall accuracy.
Uniform Sub-vector Distribution: With increased sub-vector dimension and fewer segments, sub-vectors start to resemble complete features more closely, often exhibiting stronger discriminability and a more uniform distribution. When $N$ classes (where $N$ is the number of categories in the dataset) need to be mapped to $K$ clustering centers ( $K = 2^b$ , where $b$ is bits per segment), and $N \gg K$ , it becomes highly probable for features of different classes to be assigned to the same PQ codes, or for samples of the same class to be assigned to different PQ codes. This negatively impacts retrieval accuracy.

To mitigate these issues, the core idea behind FPPQ (Full Potential Product Quantization) is to train a backbone neural network to generate feature representations that are inherently better suited for short PQ encoding. Specifically, the method aims to achieve two goals:
Class-level PQ Code Assignment: Features belonging to the same class should ideally be assigned to the same PQ code.
Sub-vector Clustering within Sub-spaces: Within each sub-space (i.e., for each segment), the sub-vectors that belong to the same desired PQ code should cluster together tightly.

These goals are achieved by introducing a differentiable PQ branch that is supervised by pre-defined class-level PQ labels, guiding the feature learning process to produce more PQ-friendly embeddings.

4.2. Core Methodology In-depth (Layer by Layer)

The FPPQ framework operates in an end-to-end manner, as illustrated in Figure 1.

The overall flow of the framework consists of two parallel branches during the training phase:

A one-hot classification branch for the entire feature vector.
A PQ classification branch that constrains the sub-vectors of the feature.

Figure 1: The overall flow of our framework.

Figure 1: The overall flow of our framework. It consists two branches during the training phase: a one-hot classification branch for the entire feature and another PQ classification branch that constrains sub-vectors of the feature. We use the predefined class-level PQ labels to guide the learning of the PQ branch and improve feature separation in sub-spaces. During the retrieval phase, the only difference with traditional PQ is that our framework does not require clustering for the codebook. Instead, we directly use the learned codebook obtained from the PQ branch, which comprises the fully connected weights of multiple sub-branches in the PQ branch.

Training Phase: Given an input image, a backbone neural network extracts a deep feature representation $\mathcal{F} \in \mathbb{R}^D$ . This feature vector is then fed into two parallel branches:

4.2.1. One-Hot Classification Branch

This is a standard classification branch that takes the entire feature $\mathcal{F}$ and uses a fully connected (FC) layer followed by a softmax activation to predict the class label. It is optimized using a standard classification loss (e.g., softmax cross-entropy loss), denoted as $\mathcal{L}_{cls}$ . This branch ensures that the backbone learns generally discriminative features for overall classification.

4.2.2. PQ Classification Branch

This is the novel component designed to optimize features specifically for Product Quantization.

Feature Segmentation: The feature vector $\mathcal{F}$ is first divided into $M$ equal-length segments (sub-vectors), consistent with the PQ encoding process: $ \mathcal{F} = [ \mathcal{F}_1, \cdot \cdot \cdot, \mathcal{F}_m, \cdot \cdot \cdot, \mathcal{F}_M ] $ where each sub-vector $\mathcal{F}_m \in \mathbb{R}^{D/M}$ .
Multiple Sub-Branches: For each of the $M$ segments, there is a separate PQ sub-branch. Each PQ sub-branch is essentially a fully connected (FC) layer that maps the corresponding sub-vector $\mathcal{F}_m$ to $K$ output units. Here, $K$ represents the number of clusters (or codewords) in each segment's codebook (typically $K=2^b$ , where $b$ is the number of bits per segment, e.g., $2^8 = 256$ for 8 bits).
Softmax-based Loss: Each PQ sub-branch applies a softmax function to its $K$ outputs and is trained to maximize the posterior probability of the ground-truth PQ code for that segment. The loss for the PQ branch is defined as: $ L_{pq} = - \frac{1}{B M} \sum_{b=1}^{B} \sum_{m=1}^{M} \log \frac{e^{z_{mk^l}}}{\sum_{k=1}^{K} e^{z_{mk}}} $
- Symbol Explanation:
  - $B$ : The batch size, representing the number of samples processed in one training iteration.
  - $M$ : The number of PQ code segments (or sub-vectors) into which the main feature $\mathcal{F}$ is divided.
  - $K$ : The number of clusters (codewords) in each segment's codebook, which is typically $2^b$ for $b$ bits per segment.
  - $z_{mk}$ : The logit (raw output before softmax) of the $k$ -th unit in the $m$ -th PQ sub-branch.
  - $k^l$ : The ground-truth code value (index) for the $m$ -th segment, derived from the predefined class-level PQ label.
Optimizing Sub-branches with Angle (using Metric Learning Principles): To enhance the discriminability of each sub-branch's output, FPPQ leverages metric learning principles. The output $z_{mk}$ from an FC layer can be seen as a dot product between the input sub-vector $\mathcal{F}_m$ and the weight vector $w_{mk}$ of the $k$ -th unit in the $m$ -th sub-branch ( $z_{mk} = \mathcal{F}_m^T w_{mk}$ ).
- Euclidean Distance Interpretation: The Euclidean distance between the sub-vector $\mathcal{F}_m$ and a classification weight $w_{mk}$ is: $ D_{Euclidean}\langle \mathcal{F}m, w{mk} \rangle = \sqrt{ \left. \mathcal{F}m \right.^2 + \left. w{mk} \right.^2 - 2 \left. \mathcal{F}m \right. \left. w{mk} \right. \cos \theta_{mk} } $
  - Symbol Explanation:
    - $\mathcal{F}_m$ : The $m$ -th sub-vector of the input feature.
    - $w_{mk}$ : The weight vector of the $k$ -th unit in the $m$ -th PQ sub-branch. This vector effectively acts as a codeword for the $k$ -th cluster in the $m$ -th segment.
    - $\left. \mathcal{F}_m \right.^2$ : The squared Euclidean norm (magnitude) of the sub-vector $\mathcal{F}_m$ , i.e., $\|\mathcal{F}_m\|^2$ .
    - $\left. w_{mk} \right.^2$ : The squared Euclidean norm of the weight vector $w_{mk}$ , i.e., $\|w_{mk}\|^2$ .
    - $\cos \theta_{mk}$ : The cosine of the angle $\theta_{mk}$ between $\mathcal{F}_m$ and $w_{mk}$ .
- Normalization for Angular Optimization: By normalizing the weights $\{w_{mk}\}$ such that $\|w_{mk}\| = 1$ , Equation (3) simplifies to: $ D_{Euclidean} \big\langle \mathcal{F}m, w{mk} \big\rangle = \sqrt{ | \mathcal{F}_m |^2 + 1 - 2 | \mathcal{F}m | \cos \theta{mk} } $ This shows that the ranking of Euclidean distances (and thus the choice of the closest codeword) for a given $\mathcal{F}_m$ depends solely on the angles $\{\theta_{mk}\}$ because $\|\mathcal{F}_m\|$ is constant for a given sub-vector. To further stabilize training and accelerate convergence, the authors:
  1. Omit the bias term in the FC layer.
  2. Normalize $\mathcal{F}_m$ such that $\|\mathcal{F}_m\| = 1$ . This results in the output $z_{mk}$ becoming: $ z_{mk} = | \mathcal{F}m | | w{mk} | \cos \theta_{mk} = \cos \theta_{mk} $
  - Symbol Explanation:
    - $\| \mathcal{F}_m \|$ : The Euclidean norm (magnitude) of the sub-vector $\mathcal{F}_m$ .
    - $\| w_{mk} \|$ : The Euclidean norm of the weight vector $w_{mk}$ .
    - $\cos \theta_{mk}$ : The cosine similarity between the normalized sub-vector $\mathcal{F}_m$ and normalized weight vector $w_{mk}$ .
- Incorporating Metric Learning Loss (CosFace): A simple softmax cross-entropy loss might not provide sufficient discriminability for retrieval tasks. Therefore, FPPQ integrates CosFace [39] (an additive angular margin loss) into the PQ branch loss to enhance the separation between classes in the angular space. The modified L_pq becomes: $ L_{pq} = - \frac{1}{B M} \sum_{b=1}^{B} \sum_{m=1}^{M} \log \frac{e^{s(\cos(\theta_{mk^l} + \text{margin}))}}{e^{s(\cos(\theta_{mk^l} + \text{margin}))} + \sum_{k=1, k \neq k^l}^{K} e^{s \cos(\theta_{mk})}} $
  - Symbol Explanation:
    - $\text{margin}$ : A fixed parameter (e.g., 0.2) to control the magnitude of the cosine margin, making the decision boundary stricter for the ground-truth class.
    - $s$ : A scaling factor (e.g., 64) applied to the logits to emphasize angular differences.
    - $\theta_{mk^l}$ : The angle between the $m$ -th sub-vector and the weight vector for its ground-truth PQ code $k^l$ .
    - $\theta_{mk}$ : The angle between the $m$ -th sub-vector and the weight vector for any other PQ code $k$ .

4.2.3. Final Optimization Objective

The overall training objective combines the classification loss for the full feature and the PQ branch loss: $ \mathcal{L} = \mathcal{L}{cls} + \mathcal{L}{pq} $

Symbol Explanation:
- $\mathcal{L}$ : The total loss function to be minimized during training.
- $\mathcal{L}_{cls}$ : The classification loss (e.g., softmax cross-entropy) for the main classification branch, which helps learn a generally discriminative backbone.
- $\mathcal{L}_{pq}$ : The PQ branch loss (incorporating CosFace), which specifically optimizes the sub-vectors for Product Quantization.
  
  This combined loss enables end-to-end optimization, where the backbone learns features that are both generally discriminative and specifically well-suited for PQ encoding.

4.2.4. Predefined Class-Level PQ Label

A crucial part of FPPQ is the generation of class-level PQ labels to supervise the PQ branch. Unlike prior works that use generic hash codes (e.g., Hadamard matrix or Bernoulli sampling), FPPQ creates PQ codes that implicitly capture inter-class relationships derived from the data itself.

Process:
1. Obtain Class Average Features: A pre-trained model (whose classifier is assumed to be discarded, a common scenario in retrieval) is used to extract features. For each class, the average feature $\{\mathcal{F}_{avg}\}$ is computed by averaging the features of all samples belonging to that class.
2. PQ Encoding on Average Features: Traditional PQ encoding is then performed on these class average features. This involves:
  - Dividing each average feature into $M$ segments.
  - Performing k-means clustering in each of the $M$ sub-spaces across all class average features.
  - Assigning each sub-vector of each class average feature to its closest centroid to generate a PQ code for that segment.
3. Construct Class-Level PQ Labels: The resulting M-tuple of segment indices forms the class-level PQ label for each class: $ PQ_{label} = PQ ( { \mathcal{F}{avg} } ) = { [k_1, \cdots, k_m, \cdots, k_M]{cls} ; cls \in [1, class_num] } $
  - Symbol Explanation:
    - $PQ_{label}$ : The set of Product Quantization labels for all classes.
    - $\mathcal{F}_{avg}$ : The average feature vector for a given class.
    - $[k_1, \cdots, k_m, \cdots, k_M]_{cls}$ : The M-tuple PQ code assigned to a specific class cls, where $k_m$ is the index of the codeword for the $m$ -th segment.
    - $class\_num$ : The total number of classes in the dataset.
4. Handle Duplication: In scenarios with many data categories and short encoding bits, PQ code duplication can occur (i.e., different classes might end up with the same PQ code). To ensure unique and meaningful relationships, FPPQ removes duplicates by assigning them to the position with the lowest quantization error, provided that position is not already occupied by another PQ code.

4.2.5. Encode and Retrieval

After the training phase is complete, the FPPQ framework is used for encoding and retrieval:

Feature Extraction: The trained backbone is used to extract feature representations for all input images in the gallery set (database) and any query images.
Codebook from PQ Branch Weights: Crucially, the weights of the $M$ FC layers in the PQ branch are directly used as the codebook for the PQ algorithm. These weights, denoted as $\widetilde{w}_{mk} \in \mathbb{R}^{D/M}$ , are normalized: $\widetilde{w}_{mk} = w_{mk} / ||w_{mk}||$ . The complete codebook $\mathcal{C} \in \mathbb{R}^{M \times K \times (D/M)}$ is formed where $\mathcal{C}_{mk} = \widetilde{w}_{mk}$ . At this point, the PQ algorithm no longer requires its own clustering step; it uses the learned codebook directly.
Encoding Gallery Samples: Each sample $\mathcal{F}_{\mathcal{G}}$ in the gallery set is encoded into an M-tuple PQ code: $ \overline{PQ(\mathcal{F}_{\mathcal{G}})} = { [k_1, \cdots, k_m, \cdots, k_M]_i ; i \in [1, num_images] } $ where $k_m$ is the index of the closest codeword $\mathcal{C}_{mk}$ to the $m$ -th sub-vector of $\mathcal{F}_{\mathcal{G}}$ .

Product Quantization supports two main retrieval methods:

Symmetric Retrieval (SDC - Symmetric Distance Computation):
- Conceptual Definition: In symmetric retrieval, both the query vector and the gallery vectors are first quantized into their PQ codes. The distance between a query and a gallery item is then computed as the sum of distances between their corresponding quantized representations (i.e., their codewords).
- Mathematical Formula: The distance between a query (whose $m$ $m$ -th segment is represented by codeword $\mathcal{C}_{m k^*}$ $C_{m k^{*}}$ ) and a gallery sample (whose $m$ $m$ -th segment is represented by codeword $\mathcal{C}_{m k^i}$ $C_{m k^{i}}$ ) is: $ \mathcal{D}{sym} = \sum{m=1}^{M} \langle \mathcal{C}{mk^*}, \mathcal{C}{mk^i} \rangle $
  - Symbol Explanation:
    - $\mathcal{D}_{sym}$ : The symmetric distance between the query and gallery sample.
    - $M$ : Number of segments.
    - $\langle \cdot, \cdot \rangle$ : Denotes a distance metric (e.g., Euclidean distance or cosine distance) between two vectors.
    - $k^*$ : The index of the codeword in the $m$ -th segment's codebook that is closest to the query's $m$ -th sub-vector $\mathcal{F}_m^{query}$ .
    - $\mathcal{C}_{mk^*}$ : The actual codeword (quantized representation) for the $m$ -th segment of the query sample.
    - $\mathcal{C}_{mk^i}$ : The actual codeword (quantized representation) for the $m$ -th segment of the $i$ -th gallery sample.
- Codeword Selection: For the query, $k^*$ $k^{*}$ is determined by finding the codeword $\mathcal{C}_{mk}$ $C_{mk}$ that minimizes the distance to the query's $m$ $m$ -th sub-vector $\mathcal{F}_m$ $F_{m}$ : $ k^* = \underset{k \in [0, K-1]}{\arg\min} \ \big\langle \mathcal{F}m, \mathcal{C}{mk} \big\rangle $
  - Symbol Explanation:
    - $K$ : Number of codewords per segment.
    - $\mathcal{F}_m$ : Query's $m$ -th sub-vector.
Asymmetric Retrieval (ADC - Asymmetric Distance Computation):
- Conceptual Definition: In asymmetric retrieval, the query vector is used in its original, unquantized (floating-point) form, while the gallery vectors are represented by their PQ codes (quantized form). The distance is computed between the original query sub-vectors and the codeword representations of the gallery sub-vectors. This typically offers higher accuracy than symmetric retrieval because it avoids quantizing the query.
- Mathematical Formula: The distance between a query ( $\mathcal{F}_m^{query}$ $F_{m}^{q u ery}$ ) and a gallery sample (whose $m$ $m$ -th segment is represented by codeword $\mathcal{C}_{m k^i}$ $C_{m k^{i}}$ ) is: $ \mathcal{D}{asym} = \sum{m=1}^{M} \big< \mathcal{F}m^{query}, \mathcal{C}{mk^i} \big> $
  - Symbol Explanation:
    - $\mathcal{D}_{asym}$ : The asymmetric distance between the query and gallery sample.
    - $\mathcal{F}_m^{query}$ : The original, unquantized $m$ -th sub-vector of the query sample.
    - $\mathcal{C}_{mk^i}$ : The codeword for the $m$ -th segment of the $i$ -th gallery sample.
Feature Preprocessing for Retrieval: The paper notes that during training, segment normalization (setting $\|\mathcal{F}_m\|=1$ ) is performed. However, for retrieval, FPPQ found that applying segment normalization or full feature normalization to the query feature had little impact on the ultimate retrieval performance due to the learned compactness of the encoding. Thus, FPPQ "does not need to perform any additional processing on the features compared with PQ algorithm" during retrieval.

5. Experimental Setup

5.1. Datasets

The authors validate their method on a variety of datasets, emphasizing large-scale scenarios.

Glint360K [2]:
- Source & Scale: One of the largest publicly available face datasets. It contains over 360,000 unique identities (IDs) and approximately 17 million images.
- Characteristics & Domain: A large-scale facial recognition dataset.
- Usage: All available data from Glint360K is used to train the hashing methods. Performance is then evaluated on unseen retrieval tasks using MegaFace and FaceScrub datasets, ensuring no overlap.
MegaFace [22] and FaceScrub [33]:
- Source & Scale:
  - MegaFace: Contains one million images of over 690,000 different individuals.
  - FaceScrub: Comprises over 100,000 photos of 530 celebrities.
- Characteristics & Domain: Both are standard datasets for evaluating face recognition performance.
- Usage:
  - MegaFace is used as the gallery set (database) for evaluation.
  - FaceScrub is used to construct the query set. Following standard practice [22], a subset of FaceScrub with 80 individuals (40 females, 40 males), each having more than 50 images, is used. After noise removal and de-duplication (following insightface project recommendations), the final query set consists of 3529 face images from these 80 individuals.
  - Evaluation Protocol: For each query individual, one image is taken as a query, and the remaining images of that individual are incorporated into the gallery set to test retrieval.
ImageNet1K [35]:
- Source & Scale: A widely used large-scale visual recognition challenge dataset. It contains 1000 classes and over 1.2 million images.
- Characteristics & Domain: General object recognition dataset.
- Usage:
  - The training set of ImageNet1K is used to train the hashing methods.
  - The validation set (50,000 images) is used for evaluation.
  - Query Set: Top 5 images from each of the 1000 classes in the validation set, totaling 5000 images.
  - Gallery Set: The remaining 45,000 images from the validation set.
ImageNet100 [7, 35]:
- Source & Scale: A subset of ImageNet1K, consisting of 100 classes.
- Characteristics & Domain: Smaller-scale general object recognition.
- Usage: Popularly used in previous deep hashing methods for benchmarking.

5.2. Evaluation Metrics

The retrieval performance is assessed using standard metrics relevant to image retrieval tasks.

Top-K Accuracy (used for Glint360K, MegaFace, FaceScrub):
- Conceptual Definition: Measures how often the correct item (i.e., an image of the same individual as the query) is found among the top $K$ retrieved results. If the ground-truth item is present in the $K$ closest items to the query, it's considered a hit.
- Mathematical Formula: $ \text{Top-K Accuracy} = \frac{\text{Number of queries where ground-truth is in top K results}}{\text{Total number of queries}} $
- Symbol Explanation:
  - Number of queries where ground-truth is in top K results: The count of query samples for which the true matching item (or an item of the same class/identity) is found within the top $K$ positions of the ranked retrieval list.
  - Total number of queries: The total count of query samples evaluated.
- Specific Metrics: Top-1, Top-5, and Top-20 accuracies are reported, meaning the ground-truth is expected in the 1st, 5th, or 20th position, respectively.
Mean Average Precision (mAP) @R1000 (used for ImageNet1K and ImageNet100):
- Conceptual Definition: mAP is a commonly used metric for evaluating information retrieval systems. It's the mean of the average precision (AP) scores for each query. AP for a single query averages the precision values at each relevant item in the ranked list. The @R1000 indicates that only the top 1000 retrieved results are considered for calculating AP. A higher mAP indicates better retrieval performance across all queries.
- Mathematical Formula: $ \text{mAP} = \frac{1}{Q} \sum_{q=1}^{Q} \text{AP}_q $ where $ \text{AP}q = \frac{\sum{k=1}^{n} (P(k) \cdot \text{rel}(k))}{\text{Number of relevant items for query } q} $
- Symbol Explanation:
  - $Q$ : The total number of queries.
  - $\text{AP}_q$ : The Average Precision for query $q$ .
  - $n$ : The total number of retrieved items considered (here, up to 1000).
  - P(k): The precision at cutoff $k$ (i.e., the proportion of relevant items among the top $k$ retrieved items). $P(k) = \frac{\text{Number of relevant items in top } k}{\text{k}}$ .
  - $\text{rel}(k)$ : An indicator function, equal to 1 if the item at rank $k$ is relevant, and 0 otherwise.
  - Number of relevant items for query q: The total count of relevant items for query $q$ in the entire gallery.

5.3. Baselines

The paper compares FPPQ against a wide range of competitive hashing methods, encompassing both traditional and deep learning-based approaches:

Traditional PQ [20]: The foundational Product Quantization method, serving as a direct comparison for FPPQ's improvements.
HHF [42]: Hashing-guided Hinge Function for Deep Hashing Retrieval.
CSQ [46]: Central Similarity Quantization.
OrthoHash [18]: One loss for all: Deep hashing with a single cosine similarity based learning objective.
GreedyHash [37]: Towards fast optimization for accurate hash coding in CNN.
DPN [12]: Deep Polarized Network for Supervised Learning of Accurate Binary Hashing Codes.
DFH [25]: Deep Fisher Hashing.
DCDH [48]: Deep Center-based Dual-constrained Hashing for discriminative face image retrieval.
JMLH [36]: Embarrassingly Simple Binary Representation Learning.
DPQ [23]: End-to-end Supervised Product Quantization for Image Search and Retrieval.
HBS-RL [43]: Hash Bit Selection with Reinforcement Learning for Image Retrieval.
DWDM [10]: Deep Hashing with Discrete Wasserstein Distributional Matching.

The authors note that some methods like DCDH and DWDM were abandoned due to exceeding computational resource limitations with large-scale matrix multiplication on Glint360K, while JMLH and DPQ did not achieve effective performance in their tests. This highlights the scalability challenges. For methods using predefined hash centers (CSQ, OrthoHash), Bernoulli sampling was used to generate non-redundant hash centers due to the inefficiency of Hadamard matrix for hundreds of thousands of classes.

5.4. Implementation Details

Framework: All experiments were conducted using PyTorch.
Hardware: 8 NVIDIA 2080Ti or 3090Ti GPUs for training.
Class-level PQ Labels: A pre-trained model was used to get features for acquiring class-level PQ labels. This is considered fair as other comparative methods also use pre-trained models for initialization.
Hyperparameters (General):
- $s$ (scaling factor for angular loss): Set to 64.
- margin (angular margin for CosFace): Set to 0.2.
- Bits per segment: Fixed at 8 bits, a common setting in practical applications. This means the number of categories $K$ per sub-branch was $2^8 = 256$ .
- Number of segments (M): Varied depending on the target total bits.
  - 16-bits: $M=2$ segments.
  - 32-bits: $M=4$ segments.
  - 64-bits: $M=8$ segments.
  - 128-bits: $M=16$ segments.
Baseline Implementations: Methods were implemented based on the cisip-FIRe project (a common benchmark framework for deep hashing) and public projects provided by specific authors. Default settings from cisip-FIRe or paper-specified hyperparameters were used.
Training Strategy (General):
- All methods were trained with the same backbone, epochs, and batch sizes.
- Most methods used fine-tuning based on a pre-trained model.
- Hash layer initial learning rate: 0.0001, decreased to 0.00001 after 80% of total epochs.
- Backbone learning rate: 0.1 times that of the hash layer.
Specifics for Glint360K:
- Backbone: iResnet [11] (variants iResnet18, iResnet50, iResnet100).
- Feature dimension: 512.
- Optimizer: SGD with weight decay of 0.0001.
- Batch size: $128 \times 8$ (likely 128 samples per GPU across 8 GPUs).
- Learning rate: Initialized at 0.1 and linearly decreased to 0 based on the number of training steps.
Specifics for ImageNet100/ImageNet1K:
- Backbone: ResNet50 [17].
- Initialization: Parameters pre-trained on ImageNet1K.
- Feature dimension: 2048.
- ImageNet100 Special Case: To prevent overfitting due to limited data, the backbone was fixed, and an additional linear layer (2048 to 128 dimensions) was added at the end for learning. Fine-tuned for 100 epochs with a learning rate of 0.0001.
- ImageNet1K Special Case: Addressed the challenge of obtaining good class-level PQ labels due to low feature discriminability and high duplication rates for PQ codes of average features. Adopted OrthoHash's approach of generating hash codes using repeated Bernoulli sampling to ensure sufficient Hamming distance, then segmented and converted to decimal codes for PQ labels. Training strategy from cisip-FIRe project, fine-tuning for 90 epochs.
Retrieval: Asymmetric retrieval was used, and FPPQ performed no feature preprocessing during retrieval.

6. Results & Analysis

6.1. Core Results Analysis

The experiments demonstrate FPPQ's significant superiority, particularly on large-scale datasets and at shorter encoding lengths, where traditional PQ and many deep hashing methods struggle.

The following are the results from Table 1 of the original paper:

Method	32 bits			64 bits			128 bits
Method	Top-1	Top-5	Top-20	Top-1	Top-5	Top-20	Top-1	Top-5	Top-20
PQ [20]	0.1724	0.2999	0.4254	0.6665	0.7749	0.8404	0.9184	0.9492	0.9610
HHF [42]	0.0001	0.0014	0.0054	0.0773	0.1646	0.2661	0	0	0
CSQ [46]	0.0061	0.0113	0.0183	0.0124	0.0197	0.0269	0.0283	0.0372	0.0451
OrthoHash [18]	0.3098	0.3706	0.4171	0.5526	0.6007	0.6351	0.6626	0.7104	0.7454
GreedyHash [37]	0.3688	0.5220	0.6323	0.6102	0.7452	0.8251	0.8259	0.9003	0.9338
FPPQ(Ours)	0.9343	0.9601	0.9652	0.9467	0.9571	0.9628	0.9568	0.9690	0.9731

Table 1: On the large-scale dataset Glint360K, the performance comparison of multiple methods under multiple code lengths, it should be noted that some methods were not listed due to ineffective performance or requiring unrealistic computational resources.

Dominance on Glint360K:
- FPPQ consistently achieves the highest Top-1, Top-5, and Top-20 accuracies across all tested bit lengths (32, 64, 128 bits) on the challenging Glint360K dataset.
- At 32 bits: FPPQ shows a remarkable Top-1 accuracy of 0.9343, vastly outperforming GreedyHash (0.3688), OrthoHash (0.3098), and especially PQ (0.1724). This translates to approximately a 57% increase in Top-1 performance over GreedyHash. This demonstrates FPPQ's ability to maintain high accuracy even with highly compressed codes, which is a major breakthrough for PQ.
- At 64 bits: FPPQ (0.9467 Top-1) still significantly surpasses all baselines, including PQ (0.6665), GreedyHash (0.6102), and OrthoHash (0.5526). This is roughly a 28% performance growth over traditional PQ.
- At 128 bits: FPPQ (0.9568 Top-1) maintains its lead, showing approximately a 4% improvement over PQ (0.9184). This indicates that even at longer bit lengths where traditional PQ is strong, FPPQ can still extract more discriminative features.
Limitations of Baselines:
- Many advanced deep hashing methods (HHF, CSQ, OrthoHash, GreedyHash) struggle or fail to perform well on Glint360K, especially at shorter bit lengths. Some methods were even excluded due to "ineffective performance or requiring unrealistic computational resources."
- HHF exhibited very poor performance at shorter bits and failed completely at 128 bits, suggesting issues with hyperparameters or convergence on such a large scale.
- CSQ also performed poorly across all bit lengths.
- Even OrthoHash and GreedyHash, which are competitive deep hashing methods, are inferior to the original PQ at 64 and 128 bits, highlighting the challenge of scaling deep hashing. This strongly validates the authors' initial motivation that existing methods are not well-suited for large-scale datasets.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Performance with Different Backbone Settings

The authors investigate the impact of different backbone architectures on FPPQ's performance.

The following are the results from Table 2 of the original paper:

Glint360K
Backbone		iResnet18			iResnet50			iResnet100
Method	Metric	128 bits	64 bits	32 bits	128 bits	64 bits	32 bits	128 bits	64 bits	32 bits
PQ [20]	Top-1	0.8005	0.4588	0.0973	0.9184	0.6665	0.1724	0.9465	0.7380	0.2074
	Top-5	0.8659	0.5958	0.1940	0.9492	0.7749	0.2999	0.9658	0.8378	0.3569
	Top-20	0.9025	0.6903	0.3031	0.9610	0.8404	0.4254	0.9720	0.8878	0.4941
FPPQ(Ours)	Top-1	0.8231	0.7019	0.6945	0.9568	0.9467	0.9342	0.9679	0.9578	0.9305
	Top-5	0.8824	0.7890	0.7576	0.9690	0.9571	0.9601	0.9746	0.9728	0.9700
	Top-20	0.9133	0.8414	0.8006	0.9731	0.9628	0.9652	0.9772	0.9753	0.9734

Table 2: Under different network structure and bits settings, our method achieved consistent and significant improvements.

Consistent Improvements: FPPQ consistently shows significant advantages over PQ across all iResnet architectures (iResnet18, iResnet50, iResnet100) and bit settings. The performance gap is particularly large at shorter bit lengths (e.g., 32 bits), where PQ's performance severely degrades.
Impact of Backbone Size: Generally, larger backbones (e.g., iResnet50 vs. iResnet18, iResnet100 vs. iResnet50) lead to better performance for both PQ and FPPQ. This is expected as more powerful backbones extract richer and more discriminative features.
32-bit Performance Limit: Interestingly, for iResnet100 at 32 bits, the Top-1 performance slightly decreases compared to iResnet50 (0.9305 vs. 0.9342), although Top-5 and Top-20 still show slight improvements. The authors suggest this might be due to an "intrinsic constraint of compression strength" at very low bit settings, implying that even with a very powerful backbone, there's a limit to how much information can be preserved in extremely short codes.

6.2.2. Low-Slope Decay of Retrieval Performance

Analysis of Figure 2:

Figure 2: In Glint360K, comparison of the asymmetric retrieval performance trend of PQ with compression efficiency and the results of our methods for improving quality. 'L2' represents using the original features for L2 retrieval, while $\\mathbf { \\hat { M } } = 2 5 6 \\mathbf { \\hat { \\rho } }$ represents dividing each feature into 256 segments for PQ retrieval, and 8 bits per segment.

Figure 2 visually confirms one of FPPQ's main claims: it achieves a "low-slope decay of retrieval performance" as the encoding bit-length decreases.
- The curve for naive Product Quantization (PQ) shows a steep drop in performance as the bit length decreases, especially from longer to shorter bits. This makes traditional PQ impractical for highly compressed scenarios.
- In contrast, the curve for FPPQ maintains a much higher performance level at shorter bit lengths, and its performance degradation with decreasing bits is much less pronounced (a "lower slope"). This demonstrates that FPPQ effectively addresses the weakness of PQ by enabling highly accurate retrieval even with very short codes, expanding its applicability.
- The L2 retrieval (using original features) provides an upper bound for performance, showing the ideal scenario without any compression. FPPQ gets significantly closer to this ideal than PQ, especially at intermediate bit lengths. The $\hat{M}=256$ curve is a specific configuration mentioned in the alt text (dividing into 256 segments, 8 bits/segment), likely referring to a very high bit length, but the general trend for FPPQ remains superior.

6.2.3. t-SNE Visualization

Analysis of Figure 3:

Figure 3: The visualization of the sub-vectors and clustering centers for the 10k classes in the Glint360K dataset of two approaches: (a) classification loss with PQ and (b) our proposed method. The black dots represent subfeatures, while the red dots in (a) depict the clustering centers obtained by performing k-means clustering on the sub-features of all classes. In contrast, in (b), the red dots represent the codewords trained by our proposed method.

To illustrate the effectiveness of FPPQ in shaping feature distributions for PQ, t-SNE [38] is used to visualize sub-vectors and clustering centers for 10,000 classes from Glint360K (specifically, the first segment of average features when compressed to 32 bits).
- Figure 3(a) (Traditional classification loss with PQ): Shows that the sub-vectors (black dots) are largely evenly distributed throughout the sub-space. The k-means clustering centers (red dots) are scattered, reflecting this uniform distribution. This makes efficient quantization difficult because sub-vectors from different classes might be close to the same centroid, leading to high quantization error and poor discriminability.
- Figure 3(b) (FPPQ): In stark contrast, FPPQ effectively learns to cluster the sub-vectors (black dots) into distinct, tighter groups around the learned codewords (red dots). The codewords are also well-separated. This visualization confirms that FPPQ successfully learns class-level PQ labels, guiding the sub-vectors to form well-defined clusters in the sub-space, which is crucial for reducing quantization error and improving retrieval accuracy.

6.2.4. Retrieval Result Visualization

Analysis of Figure 4:

Figure 4: Visualization of the Top-20 retrieval samples using FaceScrub as the query set and MegaFace as the gallery set.

Figure 4 presents qualitative retrieval results (Top-20 samples) from MegaFace (gallery) queried by FaceScrub. It compares FPPQ with various L2 and PQ4 (4 segments, 32 bits total, 8 bits/segment) retrieval settings.
- The visual comparison shows that FPPQ (both L2 retrieval and PQ4 retrieval) produces highly similar and accurate Top-20 results.
- The labels like PQ4-F (PQ4 without feature processing) and PQ4-Fnor (PQ4 with full feature normalization) and PQ4-SegFnor (PQ4 with segment normalization) are used to test the impact of feature preprocessing. The observation that FPPQ generates similar Top-20 results under different feature preprocessing settings "further validating the insensitivity of our method to feature preprocessing." This is a practical advantage, simplifying deployment without needing complex data preparation steps during inference.

6.3. Results on Smaller Scale Datasets

6.3.1. Results on ImageNet100

The experiment on ImageNet100 addresses a scenario with a small data scale and fewer categories (100 classes) compared to the number of clusters ( $K=256$ for 8 bits/segment).

The following are the results from Table 3 of the original paper:

Method	ImageNet100 mAP@R1000
Method	16 bits	32 bits	64 bits
HashNet [7]	0.5101	0.7059	0.7997
CSQ [46]	0.8379	0.8750	0.8874
DPN [12]	0.8543	0.8799	0.8927
DFH [25]	0.8352	0.8781	0.8849
DCDH [48]	0.7856	0.8158	0.8400
JMLH [36]	0.8366	0.8671	0.8799
GreedyHash [37]	0.8544	0.8796	0.8886
HBS-RL [43]	0.8465	0.8770	0.8660
DPQ [23]	0.8860	0.8910	0.8960
HHF [42]	0.8710	0.8869	0.8994
OrthoHash [18]	0.8693	0.8910	0.8960
FPPQ(Ours)	0.8956	0.9128	0.9154

Table 3: mAP@R1000 on ImageNet100

FPPQ achieved the best performance (mAP@R1000) on ImageNet100 across all bit lengths (16, 32, 64 bits).
For 32 bits, FPPQ (0.9128) outperforms DPQ (0.8910), OrthoHash (0.8910), and HHF (0.8869).
This indicates that FPPQ is effective even in scenarios where generating class-level PQ labels is initially challenging due to the small number of categories relative to the $K$ clusters (k-means cannot be directly executed on class average features). The workaround of training PQ using all instance features and then encoding class average features, combined with the fixed backbone and linear layer, proved effective.

6.3.2. Results on ImageNet1K

ImageNet1K presents another challenge: the difficulty of obtaining good class-level PQ labels due to potential low feature discrimination from a pre-trained model or duplicate PQ codes when class average features are compressed to short bits (e.g., >50% duplication for 1000 classes at 32 bits).

The following are the results from Table 4 of the original paper:

Method	ImageNet1K mAP@R1000
Method	16 bits	32 bits	64 bits
HHF [42]	0.2961	0.5979	0.6121
DCDH [48]	0.3666	0.4764	0.5299
CSQ [46]	0.5040	0.6061	0.6093
JMLH [36]	0.5286	0.5876	0.6098
GreedyHash [37]	0.5424	0.5896	0.5952
OrthoHash [18]	0.5936	0.6514	0.6761
FPPQ(Ours)	0.6207	0.6543	0.6649

Table 4: mAP @R1000 on ImageNet1K

FPPQ achieved competitive results on ImageNet1K, especially at shorter encoding lengths.
At 16 bits, FPPQ (0.6207) significantly outperforms all other methods, including OrthoHash (0.5936).
At 32 bits, FPPQ (0.6543) is slightly better than OrthoHash (0.6514).
At 64 bits, OrthoHash (0.6761) slightly edges out FPPQ (0.6649).
The strategy of using Bernoulli sampling (similar to OrthoHash) to generate PQ labels when class average features are not discriminative enough or lead to too many duplicates is crucial here. This shows FPPQ's adaptability to different strategies for PQ label generation, maintaining strong performance even when direct k-means on average features is suboptimal.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study rigorously explored the limitations of current advanced deep hashing methods when applied to large-scale image retrieval scenarios, demonstrating their inadequacy in terms of computational resources or accuracy. To overcome these challenges, the paper proposed FPPQ (Full Potential Product Quantization), a novel end-to-end deep hashing framework. FPPQ employs multiple softmax branches supervised by specific class-level PQ labels, effectively guiding the backbone to learn features that are highly discriminative and well-suited for Product Quantization. A key achievement is its ability to mitigate the rapid performance decay of traditional PQ at shorter encoding lengths, leading to a "low-slope decay" in retrieval performance. The method is designed to be efficient, avoiding large-scale matrix operations, and integrates seamlessly with traditional PQ retrieval. Extensive experiments on Glint360K, ImageNet1K, and ImageNet100 consistently validated FPPQ's superior performance across various scales and bit-lengths.

7.2. Limitations & Future Work

The authors implicitly point out some limitations and future considerations:

Intrinsic Compression Strength Limit: The observation that iResnet100 did not consistently improve Top-1 performance over iResnet50 at 32 bits suggests that there might be an "intrinsic constraint of compression strength." This implies that beyond a certain point, simply using a larger backbone might not yield proportional gains when features are compressed to extremely short bit lengths, as fundamental information loss becomes dominant.
PQ Label Generation Challenges: For datasets like ImageNet1K, where feature discriminability is low or PQ code duplication is high, the simple generation of class-level PQ labels (via k-means on average features) needs adaptation (e.g., using Bernoulli sampling). This highlights that the quality of these predefined labels is crucial and might require careful consideration or alternative strategies depending on the dataset characteristics.
Specific Loss Functions: While CosFace is effective, exploration of other metric learning losses or specialized loss functions for quantization could potentially yield further improvements.

7.3. Personal Insights & Critique

The FPPQ paper offers several compelling insights and makes significant strides in large-scale image retrieval using deep hashing:

Practical Relevance: The paper's focus on large-scale real-world scenarios (millions of categories, millions of images) is highly practical and addresses a critical gap where many academic deep hashing methods often fall short. The finding that even traditional PQ can outperform advanced deep hashing methods on Glint360K is a powerful validation of their motivation.
Elegant Integration of PQ: The method elegantly integrates the strengths of Product Quantization (efficiency, scalability) with deep learning (powerful feature learning, discriminability). By making the PQ codebook learned and differentiable, and supervising it with class-level PQ labels, FPPQ effectively trains the backbone to "think" in terms of PQ clusters, rather than just forcing features into a PQ structure post-training. This end-to-end optimization is key to its success.
Robustness to Preprocessing: The demonstrated insensitivity to feature preprocessing during retrieval is a valuable practical characteristic, simplifying deployment and making the system more robust in varied environments.
Adaptability in Label Generation: The flexibility to adapt PQ label generation (e.g., using Bernoulli sampling for ImageNet1K) shows that the framework is not rigidly tied to one specific label generation method, allowing for customization based on dataset properties.
Potential Broader Applications: The core idea of learning feature representations optimized for a specific quantization scheme (like PQ) could be transferable to other compression or approximate search techniques. For instance, similar principles could be applied to improve Additive Quantization (AQ) or Composite Quantization (CQ) in deep learning contexts.
Critique/Areas for Improvement:
- Sensitivity to Pre-trained Model: The method relies on a pre-trained model to generate class average features for PQ label creation. The quality of these initial features could heavily influence the effectiveness of the PQ labels. An ablation study on the impact of different pre-trained models or label generation strategies could provide deeper insights.
- Theoretical Guarantees: While empirical results are strong, providing more theoretical analysis on why the softmax-based PQ branch (especially with angular margin loss) leads to better clusterable sub-spaces could strengthen the paper.
- Computational Overhead for Label Generation: For truly massive datasets with new classes frequently appearing, generating class average features and performing k-means (even on just average features) might still incur some computational cost. Discussing strategies for dynamic or incremental PQ label updates would be valuable for real-time systems.
- Energy Efficiency/Edge Devices: While mentioning edge devices as a motivation for short codes, a more direct analysis or experiment on actual energy consumption or inference time on constrained hardware could further solidify its value proposition for such applications.
  
  Overall, FPPQ presents a highly promising and practical approach to making deep hashing and Product Quantization viable for the most demanding large-scale image retrieval tasks. Its rigorous evaluation and significant performance gains position it as a strong baseline and a step forward in the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Unleashing the Full Potential of Product Quantization for Large-Scale Image Retrieval

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~34 min read · 47,962 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. One-Hot Classification Branch

4.2.2. PQ Classification Branch

4.2.3. Final Optimization Objective

4.2.4. Predefined Class-Level PQ Label

4.2.5. Encode and Retrieval

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. Performance with Different Backbone Settings

6.2.2. Low-Slope Decay of Retrieval Performance

6.2.3. t-SNE Visualization

6.2.4. Retrieval Result Visualization

6.3. Results on Smaller Scale Datasets

6.3.1. Results on ImageNet100

6.3.2. Results on ImageNet1K

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers