Training data-efficient image transformers & distillation through attention
TL;DR Summary
This paper enables data-efficient Vision Transformer training on ImageNet, bypassing large datasets. It introduces a novel attention-based distillation method using a 'distillation token' for student models, achieving competitive ImageNet accuracy (up to 85.2%) and strong transfe
Abstract
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Training data-efficient image transformers & distillation through attention
- Authors: Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
- Affiliations: The authors are affiliated with Facebook AI (now Meta AI) and Sorbonne University. This indicates a strong background in both industrial research and academia, with a focus on large-scale machine learning and computer vision.
- Journal/Conference: The paper was submitted to arXiv, a preprint server. While not peer-reviewed at the time of this version's publication, it has since been highly influential and widely cited, effectively becoming a foundational work in the field of Vision Transformers.
- Publication Year: The first version was submitted in December 2020.
- Abstract: The paper tackles the major limitation of Vision Transformers (ViTs): their reliance on massive, often private, datasets (e.g., JFT-300M) for pre-training. The authors demonstrate that a convolution-free transformer can be trained competitively using only the standard ImageNet-1k dataset. Their baseline model,
DeiT-B, achieves 83.1% top-1 accuracy on ImageNet. The key innovation is a novel teacher-student distillation strategy specifically designed for transformers, which uses a dedicateddistillation token. This token learns from the teacher model's predictions through the transformer's attention mechanism. This method, particularly effective with a convolutional network (convnet) as the teacher, pushes performance up to 85.2% on ImageNet, making transformers competitive with state-of-the-art convnets in a data-efficient setting. - Original Source Link:
-
Publication Status: Preprint on arXiv.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: The groundbreaking Vision Transformer (ViT) model by Dosovitskiy et al. (2020) showed that a pure transformer architecture could achieve state-of-the-art results on image classification. However, this success came with a significant caveat: ViTs were shown to "not generalize well when trained on insufficient amounts of data," requiring pre-training on enormous datasets like JFT-300M (300 million images). This requirement for massive data and the associated computational cost made ViTs inaccessible to most researchers and practitioners.
- Gap in Prior Work: There was a clear performance gap between ViTs trained on ImageNet-1k alone and state-of-the-art convolutional neural networks (CNNs). This suggested that transformers lacked the inherent inductive biases (like locality and translation invariance) that allow CNNs to learn effectively from smaller datasets. The central question was: can this gap be closed without resorting to massive external datasets?
- Innovation: This paper, titled
DeiT(Data-efficient Image Transformer), introduces a training and distillation strategy that allows a standard ViT architecture to be trained from scratch on ImageNet-1k to a competitive level. The most novel contribution is a distillation method that integrates the teacher's knowledge directly into the transformer's architecture via a newdistillation token.
-
Main Contributions / Findings (What):
-
Data-Efficient Training Recipe: The authors developed a comprehensive training scheme involving specific data augmentation, regularization, and optimization strategies that enable a ViT to achieve high performance on ImageNet-1k alone. Their baseline model,
DeiT-B, significantly outperforms the original ViT-B trained under similar conditions. -
Novel Distillation through Attention: They proposed a new distillation procedure that adds a dedicated
distillation tokento the input sequence of the transformer. This token learns to reproduce the teacher model's predictions, interacting with the patch tokens and the standardclass tokenthrough the self-attention mechanism. -
Superiority of Convnets as Teachers: The study reveals that using a CNN as a teacher for a transformer student yields better results than using another transformer teacher. This suggests that distillation effectively transfers the beneficial inductive biases of CNNs to the transformer architecture.
-
State-of-the-Art Performance in a Data-Efficient Regime: The resulting
DeiTmodels achieve a better accuracy-throughput trade-off than highly optimized CNNs like EfficientNet, demonstrating that transformers can be both accurate and efficient without massive pre-training. Their best model reaches 85.2% top-1 accuracy on ImageNet.
-
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Convolutional Neural Networks (CNNs): For decades, the standard architecture for computer vision. CNNs use convolutional filters to scan images, which builds in an inductive bias for locality (pixels nearby are related) and translation invariance (an object is the same regardless of its position). This makes them very data-efficient. Examples include ResNet and EfficientNet.
- Transformers: An architecture originally designed for Natural Language Processing (NLP) tasks like machine translation. Its core component is the
self-attentionmechanism, which allows the model to weigh the importance of all other elements in a sequence when processing a given element. This provides a global receptive field from the very first layer. - Vision Transformer (ViT): The first work to show that a pure transformer could succeed at image classification. The key idea is to treat an image as a sequence of fixed-size patches. Each patch is flattened and linearly projected into an embedding, forming a sequence of "tokens" that the transformer can process, similar to words in a sentence. A special
[CLS](class) token is added to the sequence to aggregate global information for the final classification. - Knowledge Distillation (KD): A model compression and training technique where a smaller "student" model is trained to mimic the output of a larger, pre-trained "teacher" model. Instead of only learning from the hard ground-truth labels (e.g., "cat" =
[0, 1, 0]), the student also learns from the teacher's softened probability distribution over all classes (e.g., "cat" =[0.05, 0.9, 0.05]). This provides richer supervisory signals.
-
Previous Works & Technological Evolution:
- The field of image classification was dominated by CNNs, with continuous architectural improvements leading to models like AlexNet, VGG, ResNet, and EfficientNet.
- Attention mechanisms were gradually incorporated into CNNs (
Squeeze-and-Excitation,Split-Attention Networks). - The ViT paper [15] marked a paradigm shift by removing convolutions entirely. However, it concluded that transformers need huge datasets (JFT-300M) to outperform CNNs, framing them as data-hungry.
- This paper (
DeiT) directly challenges that conclusion. It builds upon the ViT architecture but focuses on making the training process itself more data-efficient.
-
Differentiation:
-
Versus ViT: The core architectural model (
DeiT-B) is identical toViT-B. The difference lies entirely in the training strategy and the introduction of the distillation token.DeiTproves that the training methodology, not just the architecture, was the key to unlocking performance on smaller datasets. -
Versus Standard Distillation: Traditional knowledge distillation modifies only the loss function.
DeiT's distillation is architectural: it introduces a newdistillation tokenthat acts as a dedicated pathway for the teacher's knowledge to flow through the model's layers via self-attention, making the knowledge transfer more deeply integrated.
-
4. Methodology (Core Technology & Implementation)
The paper's methodology can be broken down into two main parts: a recap of the ViT architecture and the introduction of their novel distillation strategy.
-
Vision Transformer Architecture (Recap):
- Input Processing: An input image is split into a grid of non-overlapping patches (e.g., patches of size pixels for a image).
- Patch Embeddings: Each patch is flattened into a vector and linearly projected into a D-dimensional embedding space. These are the
patch tokens. - Positional Embeddings: Since self-attention is permutation-invariant, learnable positional embeddings are added to each
patch tokento encode their spatial location. - Class Token: A special learnable
[CLS]token is prepended to the sequence of patch tokens. This token is designed to aggregate global information from all patches as it passes through the transformer layers. - Transformer Encoder: The sequence of tokens (class token + patch tokens) is fed through a stack of standard transformer blocks. Each block consists of:
- Multi-Head Self-Attention (MSA): Allows tokens to interact and exchange information based on their content.
- Feed-Forward Network (FFN): A two-layer MLP applied to each token independently.
- Layer Normalization and residual connections are used around both sub-layers.
- Classification Head: After the final transformer block, only the output embedding corresponding to the
[CLS]token is used. It is passed through a linear layer to produce the final class predictions.
-
Distillation through Attention (DeiT's Core Contribution): The authors propose a new distillation strategy that is specific to the transformer architecture.
Principles: The core idea is to introduce a second source of supervision from a teacher model and integrate it deeply into the student transformer, rather than just at the final output layer. This is achieved by adding a new token that is solely responsible for learning from the teacher.
Steps & Procedures: As illustrated in Figure 2, the process is as follows:
-
A new learnable vector, the
distillation token, is added to the input sequence alongside theclass tokenandpatch tokens. It is initialized randomly and learns via back-propagation. -
This combined sequence of tokens ([CLS], [DISTIL], patch_1, ..., patch_N) is processed by the transformer encoder.
-
Crucially, the
distillation tokenparticipates in the self-attention mechanism in every layer, just like the other tokens. This allows it to exchange information with both the patch representations and the class token. -
At the output of the transformer, two separate linear classifiers are used:
-
One classifier takes the output
class tokenembedding to predict the class. Its loss is calculated against the ground-truth labels. -
Another classifier takes the output
distillation tokenembedding to predict the class. Its loss is calculated against the teacher's predictions.
Mathematical Formulas & Key Details: The paper explores two main types of distillation objectives for the
distillation token.
-
-
Soft Distillation: This minimizes the Kullback-Leibler (KL) divergence between the student's and teacher's softened softmax outputs. The total loss is:
- and are the logits (pre-softmax outputs) of the student and teacher models, respectively.
- is the ground-truth label.
- is the softmax function.
- is the cross-entropy loss.
- is the Kullback-Leibler divergence loss.
- is the distillation temperature, which softens the probability distributions. A higher creates a softer distribution.
- is a coefficient that balances the standard cross-entropy loss with the distillation loss.
-
Hard-Label Distillation: The authors find this simpler variant to be more effective. Here, the student is trained to predict the teacher's "hard" decision (the class with the highest probability) as if it were a new ground-truth label. Let be the hard label predicted by the teacher. The loss function becomes a simple average of two cross-entropy terms:
-
The first term trains the student on the true label . In the
DeiTarchitecture, this loss is applied to theclass token's output. -
The second term trains the student on the teacher's predicted label . This loss is applied to the
distillation token's output.Inference: At test time, the final prediction can be made using the
class tokenhead, thedistillation tokenhead, or a late fusion of both (by adding their softmax outputs). The paper shows that the fused approach works best.
-
-
5. Experimental Setup
-
Datasets: The experiments use a variety of standard public benchmarks. Table 6 from the paper is transcribed below.
This is a manual transcription of Table 6.
Dataset Train size Test size #classes ImageNet [42] 1,281,167 50,000 1000 iNaturalist 2018 [26] 437,513 24,426 8,142 iNaturalist 2019 [27] 265,240 3,003 1,010 Flowers-102 [38] 2,040 6,149 102 Stanford Cars [30] 8,144 8,041 196 CIFAR-100 [31] 50,000 10,000 100 CIFAR-10 [31] 50,000 10,000 10 - ImageNet-1k: The primary dataset for pre-training and main evaluation.
- Others: Used for transfer learning to evaluate the generalization ability of the pre-trained models.
-
Evaluation Metrics:
- Top-1 Accuracy:
- Conceptual Definition: This metric measures the standard classification accuracy. It calculates the percentage of images in the test set for which the model's predicted class with the highest probability (the "top-1" prediction) is the correct ground-truth class.
- Mathematical Formula:
- Symbol Explanation:
- is the total number of samples in the test set.
- is the true label for the -th sample.
- is the predicted label for the -th sample (i.e., the class with the highest output probability).
- is the indicator function, which is 1 if the condition inside is true and 0 otherwise.
- Throughput:
- Conceptual Definition: Measures the model's inference speed. It is defined as the number of images that can be processed per second on a given hardware setup. Higher throughput is better.
- The paper measures this on a single V100 GPU with the largest possible batch size to maximize hardware utilization.
- Top-1 Accuracy:
-
Baselines:
- Convolutional Networks:
ResNet,EfficientNet, andRegNetY. These represent the state-of-the-art in CNNs and are the primary performance targets. - Vision Transformers: The original
ViTmodels [15] trained on ImageNet-1k and the larger JFT-300M dataset. TheDeiTmodels are directly compared toViTto isolate the benefit of the new training and distillation strategy. - DeiT Models: The paper introduces several variants of their model, differing in size:
DeiT-Ti(Tiny, 5M params)DeiT-S(Small, 22M params)DeiT-B(Base, 86M params), which is architecturally identical toViT-B. The symbol (represented asDeiT-Bin the paper) denotes models trained with their distillation procedure.
- Convolutional Networks:
6. Results & Analysis
The paper presents a comprehensive set of experiments to validate its claims.
Core Results: Distillation Strategy
-
Convnet Teachers are Better: Table 2 shows that using a RegNetY (a CNN) as a teacher consistently yields better student performance than using a DeiT-B model as a teacher, even when the DeiT-B teacher has a higher accuracy (81.8% vs. RegNetY-8GF's 81.7%). This strongly supports the hypothesis that CNNs transfer valuable inductive biases to the transformer student.
This is a manual transcription of Table 2.
Teacher Models Student: DeiT-B pretrain DeiT-B acc. 81.8 81.9 ↑384 83.1 RegNetY-4GF 80.0 82.7 83.6 RegNetY-8GF 81.7 82.7 83.8 RegNetY-12GF 82.4 83.1 84.1 RegNetY-16GF 82.9 83.1 84.2 -
Superiority of the Distillation Token: Table 3 provides a clear comparison of different distillation methods.
-
Hard vs. Soft Distillation: Hard distillation (83.0% for
DeiT-B) significantly outperforms soft distillation (81.8%), which performs no better than the baseline without distillation. -
Distillation Token Benefit: The proposed
DeiTmethod with the distillation token and hard labels further boosts performance to 83.4%. -
Complementary Tokens: The
classanddistillationtokens provide complementary information. Using either one alone gives good results (83.0% and 83.1%), but combining them (class+distillation) gives the best performance (83.4%).This is a manual transcription of Table 3.
Supervision ImageNet top-1 (%) method ↓ label teacher Ti 224 S 224 B 224 B↑384 DeiT- no distillation ✓ X 72.2 79.8 81.8 83.1 DeiT- usual distillation X soft 72.2 79.8 81.8 83.2 DeiT- hard distillation X hard 74.3 80.9 83.0 84.0 DeiT: class embedding ✓ hard 73.9 80.9 83.0 84.2 DeiT: distil. embedding ✓ hard 74.6 81.1 83.1 84.4 DeiT: class+distillation √ hard 74.5 81.2 83.4 84.5
-
-
Effect of Training Epochs: Figure 3 shows that while the baseline
DeiT's performance saturates around 400 epochs, the distilled models continue to improve with longer training schedules, reaching over 85% accuracy after 1000 epochs.
该图像是一个折线图,展示了使用DeiT-B在ImageNet上不同蒸馏方法的top-1准确率随训练epoch数变化的趋势。图中对比了未蒸馏、常规蒸馏、硬蒸馏、带蒸馏token以及带蒸馏token且训练384个epoch的效果,显示带蒸馏token方法显著优于其他方法且准确率随训练epoch增加逐渐提升。
Core Results: Accuracy vs. Efficiency
-
Closing the Gap with Convnets: Figure 1 visually demonstrates the main success of the paper.
-
The baseline
DeiTmodels (Ours) significantly outperform the originalViTmodels trained on ImageNet-1k, establishing a much stronger transformer baseline. -
The distilled
DeiTmodels (Ourswith alembic symbol) push the performance-throughput curve above that of the highly optimizedEfficientNetfamily. For example,DeiT-Bachieves higher accuracy thanEfficientNet-B5with significantly higher throughput.
该图像是一个二维散点折线图,展示了不同模型在ImageNet数据集上的处理速度(以每秒处理图像数量images/s为横轴)和top-1准确率(百分比为纵轴)的对比。图中包含EfficientNet、ViT以及作者提出的方法(Ours)和其蒸馏版本,显示作者模型在单机训练且数据有限条件下依然取得较高准确率和合理速度的性能。
-
-
Detailed Performance Comparison: Table 5 provides extensive numerical results. The distilled
DeiT-B ↑384model trained for 1000 epochs reaches 85.2% Top-1 accuracy on ImageNet, outperforming the originalViT-B/16(77.9%) by a massive margin and even surpassing theViT-B/16pre-trained on the JFT-300M dataset (84.15% reported in the ViT paper). This is achieved using only ImageNet-1k data.This is a manual transcription of the relevant parts of Table 5.
Network #param. image size throughput (image/s) ImNet top-1 Real top-1 V2 top-1 Convnets RegNetY-16GF* 84M 224² 334.7 82.9 88.1 72.4 EfficientNet-B4 19M 380² 349.4 82.9 88.3 73.6 EfficientNet-B7 66M 600² 55.1 84.3 - - Transformers ViT-B/16 [15] 86M 384² 85.9 77.9 83.6 DeiT-B 86M 224² 292.3 81.8 86.7 71.5 DeiT-B↑384 86M 384² 85.9 83.1 87.7 72.4 DeiT-B (distilled) 87M 224² 290.9 83.4 88.3 73.2 DeiT-B ↑384 (distilled) 87M 384² 85.8 84.5 89.0 74.8 DeiT-B ↑384 / 1000 epochs (distilled) 87M 384² 85.8 85.2 89.3 75.2
Ablations / Parameter Sensitivity
Table 8 provides a detailed ablation study on the ingredients of the data-efficient training recipe. Key takeaways include:
-
Optimizer:
AdamWis crucial;SGDperforms much worse. -
Data Augmentation: A combination of strong augmentations (
Rand-Augment,Mixup,Cutmix,Random Erasing) is essential. Removing them individually hurts performance.Rand-Augmentis shown to be slightly better thanAutoAugmentfor this setup. -
Regularization:
Stochastic DepthandRepeated Augmentationare key contributors to the final performance.Repeated Augmentation, in particular, provides a significant boost.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully demonstrates that Vision Transformers do not inherently require massive datasets to perform well. Through a carefully designed training recipe and a novel distillation method, the authors show that a standard ViT architecture can be trained on ImageNet-1k alone to achieve performance that is not only competitive with but can exceed state-of-the-art convolutional networks. The core innovation, a
distillation token, provides a powerful mechanism for transferring inductive biases from a teacher (preferably a CNN) into the student transformer, making the training process far more data-efficient. -
Limitations & Future Work: The authors suggest that their work is just the beginning of optimizing transformers for vision. CNNs have benefited from nearly a decade of research into architecture and optimization. The training strategies used in
DeiTare largely adapted from those developed for CNNs. The authors propose that future research into data augmentation and regularization techniques specifically designed for transformers could lead to even further gains. -
Personal Insights & Critique:
- Impact: This paper was a landmark publication that significantly democratized research and application of Vision Transformers. By removing the dependency on private, large-scale datasets,
DeiTmade powerful transformer models accessible to the wider community, fueling a surge of innovation in the field. - Strength: The simplicity and elegance of the
distillation tokenidea are a major strength. It is an architectural modification that is intuitive, easy to implement, and demonstrably effective. The thoroughness of the experimental analysis, especially the ablation studies and comparisons, makes the paper's conclusions highly credible. - Critique/Open Questions: While the paper convincingly shows that a CNN is a better teacher, the underlying reasons are still framed as transferring "inductive biases." A deeper theoretical or empirical analysis of what exactly is being transferred through the distillation token (e.g., attention patterns that mimic locality, feature map characteristics) would be a valuable extension. Furthermore, the reliance on hard-label distillation is interesting, as it contradicts some findings in other domains where soft labels are preferred; exploring this discrepancy could yield further insights into the learning dynamics of transformers.
- Impact: This paper was a landmark publication that significantly democratized research and application of Vision Transformers. By removing the dependency on private, large-scale datasets,
Similar papers
Recommended via semantic vector search.