Paper status: completed

Training data-efficient image transformers & distillation through attention

Published:12/24/2020

Data-Efficient Training of Vision Transformers (1)Attention-Based Distillation (1)Convolution-Free Vision Transformer (1)Single-Device Training on ImageNet (1)Transformer Teacher-Student Model (1)

Original Link PDF

Price: 0.100000

9 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper enables data-efficient Vision Transformer training on ImageNet, bypassing large datasets. It introduces a novel attention-based distillation method using a 'distillation token' for student models, achieving competitive ImageNet accuracy (up to 85.2%) and strong transfe

Abstract

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

Mind Map

In-depth Reading

English Analysis~14 min read · 18,445 chars

1. Bibliographic Information

Title: Training data-efficient image transformers & distillation through attention
Authors: Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
Affiliations: The authors are affiliated with Facebook AI (now Meta AI) and Sorbonne University. This indicates a strong background in both industrial research and academia, with a focus on large-scale machine learning and computer vision.
Journal/Conference: The paper was submitted to arXiv, a preprint server. While not peer-reviewed at the time of this version's publication, it has since been highly influential and widely cited, effectively becoming a foundational work in the field of Vision Transformers.
Publication Year: The first version was submitted in December 2020.
Abstract: The paper tackles the major limitation of Vision Transformers (ViTs): their reliance on massive, often private, datasets (e.g., JFT-300M) for pre-training. The authors demonstrate that a convolution-free transformer can be trained competitively using only the standard ImageNet-1k dataset. Their baseline model, DeiT-B, achieves 83.1% top-1 accuracy on ImageNet. The key innovation is a novel teacher-student distillation strategy specifically designed for transformers, which uses a dedicated distillation token. This token learns from the teacher model's predictions through the transformer's attention mechanism. This method, particularly effective with a convolutional network (convnet) as the teacher, pushes performance up to 85.2% on ImageNet, making transformers competitive with state-of-the-art convnets in a data-efficient setting.
Original Source Link:
- arXiv: https://arxiv.org/abs/2012.12877
- PDF: http://arxiv.org/pdf/2012.12877v2
- Publication Status: Preprint on arXiv.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: The groundbreaking Vision Transformer (ViT) model by Dosovitskiy et al. (2020) showed that a pure transformer architecture could achieve state-of-the-art results on image classification. However, this success came with a significant caveat: ViTs were shown to "not generalize well when trained on insufficient amounts of data," requiring pre-training on enormous datasets like JFT-300M (300 million images). This requirement for massive data and the associated computational cost made ViTs inaccessible to most researchers and practitioners.
- Gap in Prior Work: There was a clear performance gap between ViTs trained on ImageNet-1k alone and state-of-the-art convolutional neural networks (CNNs). This suggested that transformers lacked the inherent inductive biases (like locality and translation invariance) that allow CNNs to learn effectively from smaller datasets. The central question was: can this gap be closed without resorting to massive external datasets?
- Innovation: This paper, titled DeiT (Data-efficient Image Transformer), introduces a training and distillation strategy that allows a standard ViT architecture to be trained from scratch on ImageNet-1k to a competitive level. The most novel contribution is a distillation method that integrates the teacher's knowledge directly into the transformer's architecture via a new distillation token.
Main Contributions / Findings (What):
1. Data-Efficient Training Recipe: The authors developed a comprehensive training scheme involving specific data augmentation, regularization, and optimization strategies that enable a ViT to achieve high performance on ImageNet-1k alone. Their baseline model, DeiT-B, significantly outperforms the original ViT-B trained under similar conditions.
2. Novel Distillation through Attention: They proposed a new distillation procedure that adds a dedicated distillation token to the input sequence of the transformer. This token learns to reproduce the teacher model's predictions, interacting with the patch tokens and the standard class token through the self-attention mechanism.
3. Superiority of Convnets as Teachers: The study reveals that using a CNN as a teacher for a transformer student yields better results than using another transformer teacher. This suggests that distillation effectively transfers the beneficial inductive biases of CNNs to the transformer architecture.
4. State-of-the-Art Performance in a Data-Efficient Regime: The resulting DeiT models achieve a better accuracy-throughput trade-off than highly optimized CNNs like EfficientNet, demonstrating that transformers can be both accurate and efficient without massive pre-training. Their best model reaches 85.2% top-1 accuracy on ImageNet.

Foundational Concepts:
- Convolutional Neural Networks (CNNs): For decades, the standard architecture for computer vision. CNNs use convolutional filters to scan images, which builds in an inductive bias for locality (pixels nearby are related) and translation invariance (an object is the same regardless of its position). This makes them very data-efficient. Examples include ResNet and EfficientNet.
- Transformers: An architecture originally designed for Natural Language Processing (NLP) tasks like machine translation. Its core component is the self-attention mechanism, which allows the model to weigh the importance of all other elements in a sequence when processing a given element. This provides a global receptive field from the very first layer.
- Vision Transformer (ViT): The first work to show that a pure transformer could succeed at image classification. The key idea is to treat an image as a sequence of fixed-size patches. Each patch is flattened and linearly projected into an embedding, forming a sequence of "tokens" that the transformer can process, similar to words in a sentence. A special [CLS] (class) token is added to the sequence to aggregate global information for the final classification.
- Knowledge Distillation (KD): A model compression and training technique where a smaller "student" model is trained to mimic the output of a larger, pre-trained "teacher" model. Instead of only learning from the hard ground-truth labels (e.g., "cat" = [0, 1, 0]), the student also learns from the teacher's softened probability distribution over all classes (e.g., "cat" = [0.05, 0.9, 0.05]). This provides richer supervisory signals.
Previous Works & Technological Evolution:
- The field of image classification was dominated by CNNs, with continuous architectural improvements leading to models like AlexNet, VGG, ResNet, and EfficientNet.
- Attention mechanisms were gradually incorporated into CNNs (Squeeze-and-Excitation, Split-Attention Networks).
- The ViT paper [15] marked a paradigm shift by removing convolutions entirely. However, it concluded that transformers need huge datasets (JFT-300M) to outperform CNNs, framing them as data-hungry.
- This paper (DeiT) directly challenges that conclusion. It builds upon the ViT architecture but focuses on making the training process itself more data-efficient.
Differentiation:
- Versus ViT: The core architectural model (DeiT-B) is identical to ViT-B. The difference lies entirely in the training strategy and the introduction of the distillation token. DeiT proves that the training methodology, not just the architecture, was the key to unlocking performance on smaller datasets.
- Versus Standard Distillation: Traditional knowledge distillation modifies only the loss function. DeiT's distillation is architectural: it introduces a new distillation token that acts as a dedicated pathway for the teacher's knowledge to flow through the model's layers via self-attention, making the knowledge transfer more deeply integrated.

4. Methodology (Core Technology & Implementation)

The paper's methodology can be broken down into two main parts: a recap of the ViT architecture and the introduction of their novel distillation strategy.

Vision Transformer Architecture (Recap):
1. Input Processing: An input image is split into a grid of non-overlapping patches (e.g., $14 \times 14$ patches of size $16 \times 16$ pixels for a $224 \times 224$ image).
2. Patch Embeddings: Each patch is flattened into a vector and linearly projected into a D-dimensional embedding space. These are the patch tokens.
3. Positional Embeddings: Since self-attention is permutation-invariant, learnable positional embeddings are added to each patch token to encode their spatial location.
4. Class Token: A special learnable [CLS] token is prepended to the sequence of patch tokens. This token is designed to aggregate global information from all patches as it passes through the transformer layers.
5. Transformer Encoder: The sequence of tokens (class token + patch tokens) is fed through a stack of standard transformer blocks. Each block consists of:
  - Multi-Head Self-Attention (MSA): Allows tokens to interact and exchange information based on their content.
  - Feed-Forward Network (FFN): A two-layer MLP applied to each token independently.
  - Layer Normalization and residual connections are used around both sub-layers.
6. Classification Head: After the final transformer block, only the output embedding corresponding to the [CLS] token is used. It is passed through a linear layer to produce the final class predictions.
Distillation through Attention (DeiT's Core Contribution): The authors propose a new distillation strategy that is specific to the transformer architecture.

Principles: The core idea is to introduce a second source of supervision from a teacher model and integrate it deeply into the student transformer, rather than just at the final output layer. This is achieved by adding a new token that is solely responsible for learning from the teacher.

Steps & Procedures: As illustrated in Figure 2, the process is as follows:
1. A new learnable vector, the distillation token, is added to the input sequence alongside the class token and patch tokens. It is initialized randomly and learns via back-propagation.
2. This combined sequence of tokens ([CLS], [DISTIL], patch_1, ..., patch_N) is processed by the transformer encoder.
3. Crucially, the distillation token participates in the self-attention mechanism in every layer, just like the other tokens. This allows it to exchange information with both the patch representations and the class token.
4. At the output of the transformer, two separate linear classifiers are used:
  - One classifier takes the output class token embedding to predict the class. Its loss is calculated against the ground-truth labels.
  - Another classifier takes the output distillation token embedding to predict the class. Its loss is calculated against the teacher's predictions.
    
    Mathematical Formulas & Key Details: The paper explores two main types of distillation objectives for the distillation token.
5. Soft Distillation: This minimizes the Kullback-Leibler (KL) divergence between the student's and teacher's softened softmax outputs. The total loss is: $\mathcal{L}_{\mathrm{global}} = (1 - \lambda) \mathcal{L}_{\mathrm{CE}}(\psi(Z_{\mathrm{s}}), y) + \lambda \tau^2 \mathrm{KL}(\psi(Z_{\mathrm{s}} / \tau), \psi(Z_{\mathrm{t}} / \tau))$
  - $Z_{\mathrm{s}}$ and $Z_{\mathrm{t}}$ are the logits (pre-softmax outputs) of the student and teacher models, respectively.
  - $y$ is the ground-truth label.
  - $\psi$ is the softmax function.
  - $\mathcal{L}_{\mathrm{CE}}$ is the cross-entropy loss.
  - $\mathrm{KL}$ is the Kullback-Leibler divergence loss.
  - $\tau$ is the distillation temperature, which softens the probability distributions. A higher $\tau$ creates a softer distribution.
  - $\lambda$ is a coefficient that balances the standard cross-entropy loss with the distillation loss.
6. Hard-Label Distillation: The authors find this simpler variant to be more effective. Here, the student is trained to predict the teacher's "hard" decision (the class with the highest probability) as if it were a new ground-truth label. Let $y_{\mathrm{t}} = \mathrm{argmax}_{c} Z_{\mathrm{t}}(c)$ be the hard label predicted by the teacher. The loss function becomes a simple average of two cross-entropy terms: $\mathcal{L}_{\mathrm{global}}^{\mathrm{hardDistil}} = \frac{1}{2} \mathcal{L}_{\mathrm{CE}}(\psi(Z_s), y) + \frac{1}{2} \mathcal{L}_{\mathrm{CE}}(\psi(Z_s), y_{\mathrm{t}})$
  - The first term trains the student on the true label $y$ . In the DeiT architecture, this loss is applied to the class token's output.
  - The second term trains the student on the teacher's predicted label $y_{\mathrm{t}}$ . This loss is applied to the distillation token's output.
    
    Inference: At test time, the final prediction can be made using the class token head, the distillation token head, or a late fusion of both (by adding their softmax outputs). The paper shows that the fused approach works best.

5. Experimental Setup

Datasets: The experiments use a variety of standard public benchmarks. Table 6 from the paper is transcribed below.

This is a manual transcription of Table 6.

Dataset	Train size	Test size	#classes
ImageNet [42]	1,281,167	50,000	1000
iNaturalist 2018 [26]	437,513	24,426	8,142
iNaturalist 2019 [27]	265,240	3,003	1,010
Flowers-102 [38]	2,040	6,149	102
Stanford Cars [30]	8,144	8,041	196
CIFAR-100 [31]	50,000	10,000	100
CIFAR-10 [31]	50,000	10,000	10

ImageNet-1k: The primary dataset for pre-training and main evaluation.
Others: Used for transfer learning to evaluate the generalization ability of the pre-trained models.

Evaluation Metrics:
1. Top-1 Accuracy:
  - Conceptual Definition: This metric measures the standard classification accuracy. It calculates the percentage of images in the test set for which the model's predicted class with the highest probability (the "top-1" prediction) is the correct ground-truth class.
  - Mathematical Formula: $\text{Top-1 Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}(\hat{y}_i = y_i)$
  - Symbol Explanation:
    - $N$ is the total number of samples in the test set.
    - $y_i$ is the true label for the $i$ -th sample.
    - $\hat{y}_i$ is the predicted label for the $i$ -th sample (i.e., the class with the highest output probability).
    - $\mathbb{I}(\cdot)$ is the indicator function, which is 1 if the condition inside is true and 0 otherwise.
2. Throughput:
  - Conceptual Definition: Measures the model's inference speed. It is defined as the number of images that can be processed per second on a given hardware setup. Higher throughput is better.
  - The paper measures this on a single V100 GPU with the largest possible batch size to maximize hardware utilization.
Baselines:
- Convolutional Networks: ResNet, EfficientNet, and RegNetY. These represent the state-of-the-art in CNNs and are the primary performance targets.
- Vision Transformers: The original ViT models [15] trained on ImageNet-1k and the larger JFT-300M dataset. The DeiT models are directly compared to ViT to isolate the benefit of the new training and distillation strategy.
- DeiT Models: The paper introduces several variants of their model, differing in size:
  - DeiT-Ti (Tiny, 5M params)
  - DeiT-S (Small, 22M params)
  - DeiT-B (Base, 86M params), which is architecturally identical to ViT-B. The symbol $\pmb{\frac{\sigma}{m}}$ (represented as DeiT-B in the paper) denotes models trained with their distillation procedure.

6. Results & Analysis

The paper presents a comprehensive set of experiments to validate its claims.

Core Results: Distillation Strategy

Convnet Teachers are Better: Table 2 shows that using a RegNetY (a CNN) as a teacher consistently yields better student performance than using a DeiT-B model as a teacher, even when the DeiT-B teacher has a higher accuracy (81.8% vs. RegNetY-8GF's 81.7%). This strongly supports the hypothesis that CNNs transfer valuable inductive biases to the transformer student.

This is a manual transcription of Table 2.

Teacher Models Student: DeiT-B pretrain

DeiT-B acc. 81.8 81.9 ↑384 83.1

RegNetY-4GF 80.0 82.7 83.6

RegNetY-8GF 81.7 82.7 83.8

RegNetY-12GF 82.4 83.1 84.1

RegNetY-16GF 82.9 83.1 84.2

Superiority of the Distillation Token: Table 3 provides a clear comparison of different distillation methods.

Hard vs. Soft Distillation: Hard distillation (83.0% for DeiT-B) significantly outperforms soft distillation (81.8%), which performs no better than the baseline without distillation.
Distillation Token Benefit: The proposed DeiT method with the distillation token and hard labels further boosts performance to 83.4%.

Complementary Tokens: The class and distillation tokens provide complementary information. Using either one alone gives good results (83.0% and 83.1%), but combining them (class+distillation) gives the best performance (83.4%).

This is a manual transcription of Table 3.

	Supervision		ImageNet top-1 (%)
method ↓	label	teacher	Ti 224	S 224	B 224	B↑384
DeiT- no distillation	✓	X	72.2	79.8	81.8	83.1
DeiT- usual distillation	X	soft	72.2	79.8	81.8	83.2
DeiT- hard distillation	X	hard	74.3	80.9	83.0	84.0
DeiT: class embedding	✓	hard	73.9	80.9	83.0	84.2
DeiT: distil. embedding	✓	hard	74.6	81.1	83.1	84.4
DeiT: class+distillation	√	hard	74.5	81.2	83.4	84.5

Effect of Training Epochs: Figure 3 shows that while the baseline DeiT's performance saturates around 400 epochs, the distilled models continue to improve with longer training schedules, reaching over 85% accuracy after 1000 epochs.

$Figure 3: Distillation on ImageNet \[42\] with DeiT-B: performance as a function of the number of training epochs. We provide the performance without distillation (horizontal dotted line) as it saturat…$ 该图像是一个折线图，展示了使用DeiT-B在ImageNet上不同蒸馏方法的top-1准确率随训练epoch数变化的趋势。图中对比了未蒸馏、常规蒸馏、硬蒸馏、带蒸馏token以及带蒸馏token且训练384个epoch的效果，显示带蒸馏token方法显著优于其他方法且准确率随训练epoch增加逐渐提升。

Core Results: Accuracy vs. Efficiency

Closing the Gap with Convnets: Figure 1 visually demonstrates the main success of the paper.
- The baseline DeiT models (Ours) significantly outperform the original ViT models trained on ImageNet-1k, establishing a much stronger transformer baseline.
- The distilled DeiT models (Ours with alembic symbol) push the performance-throughput curve above that of the highly optimized EfficientNet family. For example, DeiT-B achieves higher accuracy than EfficientNet-B5 with significantly higher throughput.
  
  $Figure 1: Throughput and accuracy on Imagenet of our methods compared to EfficientNets, trained on Imagenet1k only. The throughput is measured as the number of images processed per second on a $\\math…$ 该图像是一个二维散点折线图，展示了不同模型在ImageNet数据集上的处理速度（以每秒处理图像数量images/s为横轴）和top-1准确率（百分比为纵轴）的对比。图中包含EfficientNet、ViT以及作者提出的方法（Ours）和其蒸馏版本，显示作者模型在单机训练且数据有限条件下依然取得较高准确率和合理速度的性能。

Detailed Performance Comparison: Table 5 provides extensive numerical results. The distilled DeiT-B ↑384 model trained for 1000 epochs reaches 85.2% Top-1 accuracy on ImageNet, outperforming the original ViT-B/16 (77.9%) by a massive margin and even surpassing the ViT-B/16 pre-trained on the JFT-300M dataset (84.15% reported in the ViT paper). This is achieved using only ImageNet-1k data.

This is a manual transcription of the relevant parts of Table 5.

Network	#param.	image size	throughput (image/s)	ImNet top-1	Real top-1	V2 top-1
Convnets
RegNetY-16GF*	84M	224²	334.7	82.9	88.1	72.4
EfficientNet-B4	19M	380²	349.4	82.9	88.3	73.6
EfficientNet-B7	66M	600²	55.1	84.3	-	-
Transformers
ViT-B/16 [15]	86M	384²	85.9	77.9	83.6
DeiT-B	86M	224²	292.3	81.8	86.7	71.5
DeiT-B↑384	86M	384²	85.9	83.1	87.7	72.4
DeiT-B (distilled)	87M	224²	290.9	83.4	88.3	73.2
DeiT-B ↑384 (distilled)	87M	384²	85.8	84.5	89.0	74.8
DeiT-B ↑384 / 1000 epochs (distilled)	87M	384²	85.8	85.2	89.3	75.2

Ablations / Parameter Sensitivity

Table 8 provides a detailed ablation study on the ingredients of the data-efficient training recipe. Key takeaways include:

Optimizer: AdamW is crucial; SGD performs much worse.
Data Augmentation: A combination of strong augmentations (Rand-Augment, Mixup, Cutmix, Random Erasing) is essential. Removing them individually hurts performance. Rand-Augment is shown to be slightly better than AutoAugment for this setup.
Regularization: Stochastic Depth and Repeated Augmentation are key contributors to the final performance. Repeated Augmentation, in particular, provides a significant boost.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully demonstrates that Vision Transformers do not inherently require massive datasets to perform well. Through a carefully designed training recipe and a novel distillation method, the authors show that a standard ViT architecture can be trained on ImageNet-1k alone to achieve performance that is not only competitive with but can exceed state-of-the-art convolutional networks. The core innovation, a distillation token, provides a powerful mechanism for transferring inductive biases from a teacher (preferably a CNN) into the student transformer, making the training process far more data-efficient.
Limitations & Future Work: The authors suggest that their work is just the beginning of optimizing transformers for vision. CNNs have benefited from nearly a decade of research into architecture and optimization. The training strategies used in DeiT are largely adapted from those developed for CNNs. The authors propose that future research into data augmentation and regularization techniques specifically designed for transformers could lead to even further gains.
Personal Insights & Critique:
- Impact: This paper was a landmark publication that significantly democratized research and application of Vision Transformers. By removing the dependency on private, large-scale datasets, DeiT made powerful transformer models accessible to the wider community, fueling a surge of innovation in the field.
- Strength: The simplicity and elegance of the distillation token idea are a major strength. It is an architectural modification that is intuitive, easy to implement, and demonstrably effective. The thoroughness of the experimental analysis, especially the ablation studies and comparisons, makes the paper's conclusions highly credible.
- Critique/Open Questions: While the paper convincingly shows that a CNN is a better teacher, the underlying reasons are still framed as transferring "inductive biases." A deeper theoretical or empirical analysis of what exactly is being transferred through the distillation token (e.g., attention patterns that mimic locality, feature map characteristics) would be a valuable extension. Furthermore, the reliance on hard-label distillation is interesting, as it contradicts some findings in other domains where soft labels are preferred; exploring this discrepancy could yield further insights into the learning dynamics of transformers.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Teacher Models		Student: DeiT-B pretrain
DeiT-B	acc. 81.8	81.9	↑384 83.1
RegNetY-4GF	80.0	82.7	83.6
RegNetY-8GF	81.7	82.7	83.8
RegNetY-12GF	82.4	83.1	84.1
RegNetY-16GF	82.9	83.1	84.2

Training data-efficient image transformers &amp; distillation through attention