DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Foundation Models
TL;DR Summary
DART proposes a differentiable, dynamic, adaptive region tokenizer to overcome fixed-grid tokenization bottlenecks in vision models. By using learnable scores and quantile partitioning, it creates content-aware, variable-sized patches. This method allows smaller models to match l
Abstract
The content-agnostic, fixed-grid tokenizers used by standard large-scale vision models like Vision Transformer (ViT) and Vision Mamba (Vim) represent a fundamental performance bottleneck, creating a trade-off between capturing fine-grained detail and suffering from redundant computation. To resolve this dilemma, we introduce DART, a fully differentiable Dynamic Adaptive Region Tokenizer. DART employs learnable region scores and quantile-based partitioning to create content-aware patches of varying sizes, intelligently allocating a higher token density to information-rich regions. The impact of this approach is profound: it unlocks a more intelligent scaling paradigm, where a DART-equipped DeiT-Small (22M parameters) matches the performance of a DeiT-Base (86M) with nearly double the inference speed by efficiently capturing high-resolution details in key regions. Furthermore, the principle of adaptive tokenization proves its generality with clear benefits in dense prediction and spatiotemporal video tasks. We argue that by resolving the tokenizer bottleneck at its source, adaptive tokenization is a key component for building the next generation of more efficient and capable foundation models for multimodal AI, robotics, and content generation. Code is available at https://github.com/HCPLab-SYSU/DART.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Foundation Models
- Authors: Shicheng Yin, Kaixuan Yin, Yang Liu, Weixing Chen, Liang Lin. The authors are affiliated with Sun Yat-sen University, China.
- Journal/Conference: The paper is available on arXiv, a public repository for electronic preprints of scientific papers. This indicates it has been shared with the research community but may not have completed a formal peer-review process for a specific conference or journal yet. Given the topic and quality, it would be suitable for top-tier computer vision venues like CVPR, ICCV, or ECCV, or general machine learning conferences like NeurIPS or ICLR.
- Publication Year: The paper's arXiv identifier (
2506.10390) suggests a publication date of June 2025. This is likely a typo in the provided source and should be June 2024, as the cited works are from 2024 or earlier. - Abstract: The paper addresses a key limitation in standard vision models like the Vision Transformer (ViT): their use of a fixed-grid tokenizer. This rigid approach creates a trade-off between capturing fine details (requiring high resolution and thus high computational cost) and efficiency (suffering from redundant processing of unimportant background areas). The authors introduce DART, a fully differentiable tokenizer that creates content-aware patches of varying sizes by learning region importance scores. This allows the model to allocate more tokens (higher resolution) to information-rich areas. The key finding is that DART enables a more intelligent scaling paradigm; for example, a small model (DeiT-Small) equipped with DART can match the performance of a much larger model (DeiT-Base) while being nearly twice as fast. The authors demonstrate DART's general applicability across various models and tasks (including dense prediction and video) and argue that adaptive tokenization is a crucial technology for future foundation models.
- Original Source Link: The paper is available as a preprint at
https://arxiv.org/abs/2506.10390and the PDF can be accessed athttp://arxiv.org/pdf/2506.10390v3.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Modern vision foundation models, especially the widely adopted Vision Transformer (ViT), rely on a primitive "tokenizer" that divides an image into a uniform grid of patches. This is computationally inefficient, as it treats a blank sky with the same importance as a detailed foreground object. Increasing image resolution to capture small objects leads to a quadratic explosion in the number of tokens and computational cost, most of which is wasted on redundant background information.
- Importance & Gaps: While hierarchical models like Swin Transformer were developed to mitigate this, they alter the simple, uniform architecture of ViT, which has become a de facto standard in Large Multimodal Models (LMMs) like LLaVA and Gemini. There is a need for a solution that resolves the tokenizer bottleneck without sacrificing the architectural simplicity and compatibility of the uniform ViT paradigm.
- Innovation: DART proposes to solve this problem "at the source." Instead of changing the model's architecture or trying to prune tokens after they've been created, DART replaces the fixed-grid tokenizer with a dynamic, content-aware one. This module intelligently allocates a fixed token budget by creating small, dense patches for important regions and large, coarse patches for the background.
-
Main Contributions / Findings (What):
- A Novel Differentiable Tokenizer (DART): The paper introduces a lightweight, fully differentiable module that can be dropped into existing models. It uses a learnable scoring mechanism and a novel quantile-based partitioning algorithm to create content-adaptive tokens.
- An Intelligent Scaling Paradigm: DART's most profound impact is that it enables smaller models to achieve the performance of much larger ones with significantly less computation. A DART-equipped DeiT-Small (22M parameters) matches a DeiT-Base (86M parameters) in accuracy but with 4x fewer parameters and nearly 2x the inference speed. This demonstrates that how a computational budget is spent is as important as its size.
- Broad Applicability and Validation: The authors show that DART is a universal enhancement, providing consistent improvements for both Transformer-based (DeiT) and State-Space-Model-based (Vision Mamba) backbones. Its benefits also extend beyond image classification to dense prediction (semantic segmentation) and spatiotemporal video understanding tasks.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Vision Transformer (ViT): The standard ViT model processes images by first splitting them into a grid of non-overlapping, fixed-size patches (e.g., 16x16 pixels). Each patch is flattened and linearly projected into a vector, called a "token." These tokens, along with positional information, are then fed into a standard Transformer encoder, which uses self-attention to learn relationships between them. The core limitation DART addresses is this initial, content-agnostic patching step.
- Uniform vs. Hierarchical Architectures: A uniform architecture, like ViT, maintains the same number of tokens and feature resolution throughout all its layers. A hierarchical architecture, like Swin Transformer, progressively merges patches in deeper layers, creating a feature pyramid with decreasing spatial resolution, similar to a traditional Convolutional Neural Network (CNN). While efficient, this complexity can create a mismatch with Large Language Models (LLMs) that expect a flat sequence of tokens.
- Tokenization in Vision: This refers to the process of converting a continuous input, like an image, into a sequence of discrete units (tokens) for processing by a sequence model like a Transformer. DART rethinks this fundamental step.
- Differentiability: A function or module is differentiable if its output's derivative with respect to its input can be calculated. In deep learning, this is essential because it allows gradients to flow backward through the module during training (via backpropagation), enabling the model's parameters to be learned end-to-end using optimization algorithms like gradient descent. DART's fully differentiable design is a key advantage over methods that make discrete, non-differentiable decisions.
-
Previous Works & Differentiation: The paper categorizes previous attempts to solve the tokenizer bottleneck into two main philosophies, contrasting them with DART's approach.
- Architectural Solutions (e.g., Swin Transformer, PVT): These models build the solution into the backbone itself by using mechanisms like patch merging to create multi-scale feature pyramids.
- Limitation: This is a "heavyweight" solution that fundamentally changes the simple ViT architecture, making it less compatible with the ecosystem of large, pretrained uniform ViTs used in LMMs. Furthermore, the merging process is typically structural and content-agnostic.
- Post-Tokenization Adaptation (e.g., DynamicViT, A-ViT): These methods first create tokens using the standard rigid grid and then use an internal module to prune (discard) or merge unimportant tokens in later layers.
- Limitation: This is a "post-hoc remedy" that acts after the inefficient tokenization has already occurred. The decision to discard a token is discrete and not naturally differentiable, often requiring complex proxies like Gumbel-Softmax for training. This can also lead to variable-length token sequences, which can be inefficient to process in batches on hardware like GPUs.
- DART's Differentiating Approach (Pre-Tokenization Adaptation): DART is a "front-end solution" that addresses the problem at its source, before the tokens are fed to the backbone.
- Innovation: It is a proactive optimization that ensures the token sequence is information-dense from the start. Its partitioning mechanism is fully differentiable, allowing for smooth, end-to-end learning of patch boundaries. Finally, it always produces a fixed-length token sequence, maintaining compatibility with standard training pipelines and hardware.
- Architectural Solutions (e.g., Swin Transformer, PVT): These models build the solution into the backbone itself by using mechanisms like patch merging to create multi-scale feature pyramids.
4. Methodology (Core Technology & Implementation)
DART replaces the standard fixed-grid patcher with a three-stage dynamic process. The core innovation is a fully differentiable method for partitioning an image based on learned importance.
-
Principles: The central idea is to invest a small amount of computation upfront in a "scouting" network to generate an information density map of the input image. This map then guides a partitioning algorithm to allocate a fixed token budget intelligently, creating small patches in high-information areas and large patches in low-information areas.
-
Steps & Procedures:
1. Score Prediction Network:
- An input image is passed through a lightweight, pretrained CNN (e.g., MobileNetV3).
- This extracts a feature map, which is then fed to a shallow Multi-Layer Perceptron (MLP) to predict a single-channel score map .
- The raw scores are normalized to produce a stable 2D probability distribution , where each value represents the relative importance of that location. This involves a
sigmoidfunction to constrain values to[0, 1]followed by a per-sample normalization to make them sum to 1.
2. Differentiable Quantile Partitioning: This is the technical heart of DART. It finds boundaries that divide a probability distribution into segments of equal cumulative probability (quantiles).
- Mathematical Formalism:
Given a 1D discrete probability distribution , the goal is to find
K-1boundary points.- Construct CDF: The discrete distribution is used to construct a continuous, piecewise-linear Cumulative Distribution Function (CDF), denoted as . At any integer point , its value is the sum of all preceding probabilities: Within any interval , the CDF is a straight line with slope :
- Invert CDF: To find the boundary corresponding to a target quantile (cumulative probability) (e.g.,
k/K), we solve . This is done by first finding the interval where falls, and then solving the linear equation for : This entire process is composed of differentiable operations, allowing gradients to flow back to the score prediction network.
3. Partitioning Strategies: DART offers two variants for applying this 1D algorithm to a 2D image.
-
DART-Grid (Grid-Preserving): As shown in Image 5, this method creates a non-uniform grid.
-
The 2D score map is summed along its rows and columns to get two 1D marginal probability distributions, and .
-
The 1D quantile algorithm is applied independently to to find horizontal boundaries (rows of varying heights) and to to find vertical boundaries (columns of varying widths).
-
The result is a grid where patch areas are inversely proportional to information density, but the overall grid topology is maintained.
该图像由三部分组成,属于示意图。左图展示了1维可微分分位数算法的柱状分布及其累积分布曲线;中间为一张包含猫和书籍的输入图像;右图根据DART方法进行内容自适应分区后生成的不同大小区域,右侧和顶部显示对应区域的分布权重(均为33%),体现了基于内容信息动态调整token区域大小与分配的示意过程。
Image 5: Illustration of the core partitioning mechanism. (a) A 1D distribution is partitioned by finding the x-values on its piecewise-linear CDF that correspond to uniform quantiles (e.g., 1/3, 2/3). (b) DART-Grid applies this 1D algorithm to the horizontal and vertical marginal distributions of a 2D score map to create a non-uniform grid.
-
-
DART-Flow (Topology-Breaking): This is the primary, more powerful method. It allows for a global reallocation of the token budget. The process is visualized in Image 6.
-
Adaptive Row Partitioning: First, it partitions the image horizontally into rows of varying heights, just like DART-Grid.
-
Global Token Allocation: It then conceptually concatenates these adaptive rows into a single, very long 1D sequence. The 1D quantile algorithm is applied once to the probability distribution of this flattened sequence to find all
N_{total}-1boundaries for the final patches. This key step allows tokens to "flow" from low-information rows to high-information rows, breaking the rigid grid structure and enabling a much more flexible concentration of resources.
该图像为示意图,展示了DART方法中基于热力图的动态自适应区域分割过程。首先对输入图像生成热力图并计算边际分布,按等积分划分成若干大区块(图中标记为1、2、3),然后对每个区块再次按等积分划分成更小的子区域,体现了根据图像内容密集度动态调整分割大小的思想,聚焦信息丰富区域以增强特征表达效率。
Image 6: The DART-Flow process. The partitioning is sequential: first, adaptive horizontal rows are created. Then, the token budget is allocated globally across these virtually flattened rows, allowing tokens to concentrate in the most important areas regardless of their initial row.4. Differentiable Resampling and Positional Transformation:
-
-
Image Content: Once the non-uniform patch boundaries are defined, a fixed-size patch (e.g., 16x16) is sampled from each region using bilinear interpolation. The boundaries define an affine transformation that maps the sampling grid to the input image. This entire sampling process is differentiable.
-
Positional Embeddings (PE): To inform the model about the location and size of each adaptive patch, the PEs are also transformed. The standard PE grid is treated as a learnable map. The PE for each new token is sampled from this map at the token's center coordinate, again using bilinear interpolation. This is a critical step to preserve spatial awareness.
-
Application to Video: The DART framework extends naturally to video. All frames of a clip are vertically concatenated into a single large image, and DART is applied. To capture temporal importance, the scoring network processes the difference between consecutive frames, causing it to focus on motion. As shown in Image 7, this efficiently compresses temporal redundancy, as static objects receive fewer tokens over time.
该图像为示意图,由八个编号子图组成,展示了DART方法在同一场景中基于内容自适应分割图像的过程。每个子图中的不同大小的矩形区域代表了动态划分的token区域,随着编号增加,token区域从均匀网格逐步变为对信息丰富区域(如手部、书页细节)更密集的划分,体现了DART通过自适应区域划分实现更有效的细节捕捉。
Image 7: An example of DART's partitioning on a video from the SSv2 dataset. The partitioning changes over time, allocating more tokens to regions with significant motion (e.g., the hand and the object it interacts with).
5. Experimental Setup
-
Datasets:
- ImageNet-1K: A large-scale dataset for image classification with 1.28 million training images across 1000 classes. It's the standard benchmark for evaluating vision backbones.
- ADE20k: A challenging scene parsing dataset for semantic segmentation, containing over 20,000 images with pixel-level annotations for 150 semantic categories.
- Something-Something-V2 (SSv2): A large-scale video dataset focused on human-object interactions, where temporal reasoning and motion are critical.
- Kinetics-400: A large-scale action recognition dataset with 400 human action classes, which is more scene-centric than SSv2.
-
Evaluation Metrics:
- Top-1 Accuracy:
- Conceptual Definition: This metric measures the percentage of predictions for which the class with the highest predicted probability is the correct class. It is the most common metric for image classification.
- Mathematical Formula:
- Symbol Explanation: is the total number of samples, is the true label for the -th sample, is the predicted label with the highest probability, and is the indicator function (1 if the condition is true, 0 otherwise).
- mean Intersection over Union (mIoU):
- Conceptual Definition: A standard metric for evaluating semantic segmentation. Intersection over Union (IoU) for a single class is the ratio of the area of overlap between the predicted segmentation mask and the ground truth mask to the area of their union. mIoU is simply the average of the IoU scores calculated over all classes in the dataset.
- Mathematical Formula:
- Symbol Explanation: is the number of classes. For each class , (True Positives), (False Positives), and (False Negatives) are the counts of pixels correctly classified, incorrectly classified as this class, and incorrectly classified as another class, respectively.
- FLOPs (Floating Point Operations): A measure of the total number of arithmetic operations required for a single forward pass of the model. It's a hardware-independent metric for computational complexity. Reported in GFLOPs (GigaFLOPs, or billions of FLOPs).
- Latency and Throughput (Img/s): Hardware-dependent metrics for inference speed. Latency is the time taken to process a single image or batch (in milliseconds). Throughput is the number of images that can be processed per second.
- Top-1 Accuracy:
-
Baselines:
- Backbones:
DeiT(Data-efficient Image Transformer) in Tiny, Small, and Base sizes;Vision Mamba (Vim)in Tiny, Small, and Base sizes;VideoMamba; andSwin Transformer. This covers both standard Transformer and emerging State Space Model backbones. - Dynamic Inference Methods:
DynamicViT,A-ViT,IA-RED2. These are representative token pruning/merging methods used for comparison.
- Backbones:
6. Results & Analysis
The experiments robustly validate DART's effectiveness and its core thesis of enabling a more intelligent scaling paradigm.
-
Core Results: Unlocking an Intelligent Scaling Paradigm The most compelling results demonstrate that using DART with a small model is a better path to high performance than simply training a larger model.
-
Efficiency on High-Resolution Inputs: Table 1 shows that when fine-tuned on higher-resolution inputs (which generates more tokens), a
DART-DeiT-Sachieves the same accuracy (81.5%) as the baselineDeiT-Sbut with only 46% of the FLOPs (7.2G vs. 15.5G).Backbone Tokenizer Params Patches FLOPs Top-1 (%) DeiT-s† Baseline 22M 576 15.5G 81.6 DeiT-S DART 24M 288 7.2G 81.5 VideoMamba-Ti† Baseline 7M 1296 7.11G 79.6 VideoMamba-Ti DART 8M 392 2.24G 79.7
-
Note: This table is a transcription of Table 1 from the original paper. † denotes long-sequence fine-tuning.
* <strong>Smaller Models Matching Larger Counterparts:</strong> Table 2 and Figure 2 show that `DART-DeiT-S` (22M params) <strong>matches the 81.8% accuracy of a DeiT-Base</strong> (86M params) while using only a quarter of the parameters, fewer FLOPs (10.1G vs 17.5G), and achieving nearly double the inference speed (from Table 6: 1.7x throughput). Similarly, `DART-Vim-S` surpasses `Vim-Base`. This is the central evidence for the "intelligent scaling" claim.
| Backbone | Params | Patches | FLOPs | Top-1 (%)
| :--- | :--- | :--- | :--- | :---
| <strong>DeiT Family</strong> | | | |
| DeiT-B (Target) | 86M | 196 | 17.5G | 81.8
| DeiT-s† | 22M | 576 | 15.5G | 81.6
| DeiT-S‡ + DART | 24M | 392 | 10.1G | 81.8
| <strong>Vim Family</strong> | | | |
| Vim-B (Target) | 98M | 196 | 19.9G | 81.9
| Vim-S‡ | 26M | 784 | 19.6G | 81.6
| Vim-S† + DART | 29M | 392 | 10.9G | 82.2
Note: This table is a transcription of Table 2 from the original paper. †/‡ denotes long-sequence fine-tuning.
该图像为图表,展示了三种模型(DeiT-S、小型DeiT-S结合DART以及DeiT-Base)在计算复杂度(FLOPs,单位G)与准确率(Accuracy,单位%)之间的对比关系。图中显示,DeiT-S结合DART在较低的计算量下取得了接近DeiT-Base的准确率,且随着序列长度(seq.len)增加,性能提升明显,表明DART有助于提升模型效率和精度。
Image 4: This plot visualizes the accuracy vs. FLOPs trade-off. The DeiT-S + DART line (blue) shows a much better scaling curve, achieving high accuracy at significantly lower computational cost compared to the baseline approach of scaling up the model size (DeiT-Base) or naively increasing sequence length.
- Ablations / Parameter Sensitivity:
-
Partitioning Strategy: Table 8 confirms that the topology-breaking
DART-Flowsignificantly outperforms the grid-preservingDART-Grid(+0.7% on DeiT-Ti), which in turn beats the baseline. This validates the importance of global token reallocation.Method Top-1 (%) Deit-Ti 72.2 +DART-Grid 73.1 +DART-Flow 73.8
-
Note: This table is a transcription of Table 8 from the original paper.
* <strong>Input Resolution:</strong> The ablation in Image 8 is insightful. For DART, performance steadily increases with input resolution because the higher pixel density allows it to extract more faithful details from its small, dense patches. In contrast, the baseline DeiT's performance degrades at very high resolution, as its fixed-size patches become too coarse to capture object semantics. This confirms DART's gains come from its fine-grained partitioning of high-resolution inputs.

*该图像为图表,展示了不同分辨率下DeiT与DeiT+DART模型的准确率对比及其计算量(FLOPs)。横轴为图像分辨率,纵轴为准确率(%)。图中蓝线表示DeiT+DART模型准确率随分辨率升高而提高且趋势稳定,红线表示DeiT准确率先升后降。圆圈大小代表计算量,DeiT在高分辨率下计算量最大但准确率却下降,表明DART能更高效地提升模型性能。*
<strong>Image 8:</strong> This plot shows accuracy as a function of input resolution. The DeiT+DART model (blue line) consistently benefits from higher resolution, while the baseline `DeiT` (red line) eventually suffers. The size of the circles indicates FLOPs, highlighting DART's superior efficiency.
* <strong>Learning Process:</strong> The visualization in Image 9 clearly shows that as training progresses, the DART module learns to focus its token budget on the salient object (the bird), with the partition boundaries becoming progressively tighter around it. This is a direct result of the end-to-end differentiable learning process.

*该图像为示意图,展示了DART动态自适应区域分割在不同训练周期(Epoch 0、40、70、300)中对图像的分块变化。随着训练进行,分块区域逐渐根据图像中信息丰富的部分(如动物头部)变得更细致,体现了自适应分块技术在捕捉关键细节方面的效果提升。*
<strong>Image 9:</strong> This visualization shows the evolution of the score map and patch boundaries from Epoch 0 to 300. The model starts with a diffuse focus and gradually learns to concentrate tokens precisely on the foreground object.
- Generalization to Other Tasks and Architectures:
- Drop-in Enhancement: As a simple drop-in replacement (Table 4), DART provides consistent gains (+0.8% to +1.6%) across
DeiT,Vim, andVideoMambaat various scales, proving its universal applicability. - Dense Prediction: On semantic segmentation with Swin Transformer (Table 3), DART-Grid provides a +0.5 mIoU improvement. This demonstrates that even for a strong hierarchical baseline, content-aware tokenization offers complementary benefits.
- Video Classification: DART improves performance on both motion-centric (SSv2) and scene-centric (Kinetics-400) video datasets (Table 5), highlighting its ability to adapt in the spatiotemporal domain.
- Drop-in Enhancement: As a simple drop-in replacement (Table 4), DART provides consistent gains (+0.8% to +1.6%) across
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces DART, a differentiable, content-aware tokenizer that resolves the fundamental efficiency-versus-detail dilemma of uniform vision backbones. Its primary contribution is not just an incremental improvement but the unlocking of a more intelligent and cost-effective scaling paradigm. By allowing smaller models to match or exceed the performance of models four times their size with significantly less computation, DART establishes that smart resource allocation at the tokenization stage is a powerful principle. The authors position adaptive tokenization as a key enabling technology for the next generation of efficient and capable foundation models.
-
Limitations & Future Work: The authors propose several promising directions in Appendix A:
- Integration into Large-Scale Systems: Applying DART as the vision front-end for LMMs, robotics, and generative models where efficient processing of complex visual input is critical.
- Domain-Specific Adaptation: Fine-tuning the scoring network on specialized data (e.g., medical or satellite imagery) to further boost performance on niche tasks.
- Inter-Sample Dynamic Allocation: Extending the framework to vary the total token budget based on sample complexity (i.e., allocating more tokens to "hard" images and fewer to "easy" ones). DART's ability to handle arbitrary sequence lengths provides the technical foundation for this.
- Co-design with Hierarchical Models: Developing new adaptive tokenizers that are more deeply integrated with the architectural principles of models like Swin Transformer.
-
Personal Insights & Critique:
- Elegance of the Approach: DART's design is elegant and intuitive. The "proactive optimization" philosophy of fixing the token representation at the source is fundamentally more efficient than "reactive" methods like token pruning, which attempt to correct a flawed representation later on.
- Differentiability is Key: The fully differentiable nature of the quantile-based partitioning is a significant technical achievement. It avoids the need for non-differentiable decision-making and associated complex training techniques (like reinforcement learning or Gumbel-Softmax), leading to stable and effective end-to-end optimization.
- A Generalizable Principle: The core idea—"investing a small computational overhead to scout the input allows for a much more efficient allocation of the main processing budget"—is a powerful principle that could be applied far beyond vision.
- Untested Assumptions: While the results are compelling, the claim that these efficiency gains will persist at "extreme scales" (e.g., ViT-L/H or larger) is a hypothesis that remains to be tested and would require substantial computational resources.
- Potential Bottleneck: The current scoring network is a lightweight CNN. For extremely high-resolution inputs (e.g., gigapixel images), this initial scouting step could itself become a bottleneck. However, for the resolutions commonly used in today's models, this is not an issue.
- Overall, DART is a strong contribution that convincingly addresses a well-known and important problem in computer vision. Its impressive results on scaling efficiency make it a highly practical and impactful innovation.
Similar papers
Recommended via semantic vector search.