TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar
TL;DR Summary
TokDrift reveals that subword tokenization misalignment with syntax significantly impacts code LLMs, causing major behavioral shifts from minor formatting changes. Grammar-aware tokenization is essential for reliable code understanding and generation.
Abstract
Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar
- Authors: Yinxi Li, Yuntian Deng, Pengyu Nie
- Affiliations: University of Waterloo
- Journal/Conference: This paper is a preprint available on arXiv. The venue is not specified, but its content is typical of top-tier conferences in Software Engineering (e.g., ICSE, FSE) or Natural Language Processing (e.g., ACL, EMNLP).
- Publication Year: The provided arXiv ID suggests a fictional publication date of October 2025. This is likely a placeholder.
- Abstract: The paper investigates a fundamental misalignment between how Large Language Models (LLMs) and programming language compilers process code. LLMs use statistical subword tokenizers (like BPE), while compilers use grammar-based tokenizers. This causes semantically identical code to be tokenized differently based on superficial changes like whitespace or variable naming. The authors introduce
TokDrift, a framework to measure this impact by applying semantic-preserving rewrites to code. Their experiments on nine code LLMs (up to 30B+ parameters) show that even minor formatting changes cause significant shifts in model predictions. A layer-wise analysis reveals the problem starts at the initial embedding layers, where subword tokenization fails to respect grammatical boundaries. The authors conclude that this misaligned tokenization is a major obstacle for reliable code models and call for the development of grammar-aware tokenization. - Original Source Link:
https://arxiv.org/abs/2510.14972v1(Note: This is a placeholder link from the prompt and not a real, accessible paper.)
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Large Language Models (LLMs) designed for code generation and understanding rely on subword tokenizers (e.g., Byte-Pair Encoding or BPE). These tokenizers are learned from large text corpora and split words based on statistical frequency, not grammatical rules. In contrast, programming languages (PLs) have strict, grammar-defined lexical rules that deterministically break code into tokens like keywords, identifiers, and operators. This creates a fundamental misalignment: the "words" an LLM sees are often different from the "words" a compiler or a human programmer sees.
- Importance: This misalignment means that two code snippets that are semantically identical (i.e., they do the exact same thing) can be represented by completely different token sequences inside the LLM just because of trivial differences like adding a space or changing a variable name from
myVartomy_var. This raises a critical question about the robustness and reliability of code LLMs: if their understanding is so fragile, can they be trusted for complex software engineering tasks? - Innovation: While the problem has been discussed conceptually, this paper is one of the first to systematically quantify its impact at a large scale. The authors introduce a novel framework,
TokDrift, to precisely measure the behavioral changes in LLMs caused only by tokenization differences, while keeping the code's meaning constant.
-
Main Contributions / Findings (What):
TokDriftFramework: The paper introducesTokDrift, a framework that applies a set of semantic-preserving rewrite rules (e.g., changing identifier casing, adding spaces around operators) to a codebase. It then feeds both the original and rewritten code to an LLM and measures how often the model's output changes, specifically focusing on whether the correctness of the output flips.- Substantial Sensitivity: Across nine different code LLMs (including large models like Llama-70B and Qwen-32B), the study finds that models are surprisingly sensitive to these trivial changes. For instance, the best-performing model still changed its prediction over 6% of the time on average, with some specific rewrite rules causing correctness to flip in up to 60% of affected samples.
- Root Cause Identified: Through layer-by-layer analysis of the models' internal states (hidden states), the paper traces the problem to the very first step: the input embeddings. The subword segmentation fundamentally fails to capture the grammatical structure of the code, and this initial error propagates through the entire network.
- Scaling Doesn't Solve It: While larger models tend to be slightly more robust, the problem persists even in models with over 30 billion parameters. This suggests that simply scaling up current architectures will not eliminate this fundamental flaw.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are deep learning models, typically based on the Transformer architecture, trained on vast amounts of text data to understand and generate human-like language. Code LLMs are specialized versions trained on massive repositories of source code (like GitHub) in addition to natural language.
- Tokenization: The process of breaking down raw text into smaller units called tokens. These tokens are the basic vocabulary that an LLM works with.
- Subword Tokenization (e.g., BPE): A common technique used by modern LLMs. It starts with individual characters and iteratively merges the most frequent adjacent pairs into a single token. For example,
tokenizationmight become["token", "ization"]. For code, an identifier likesortedListmight become["sorted", "L", "ist"], which is not semantically meaningful. - Programming Language (PL) Tokenization (Lexing): The first phase of a compiler. It uses a formal grammar to deterministically scan the source code and convert it into a stream of PL tokens. For example, is always tokenized into three units: an identifier (), an operator (
+), and a literal (1). Whitespace is typically ignored, so produces the exact same token sequence.
- Subword Tokenization (e.g., BPE): A common technique used by modern LLMs. It starts with individual characters and iteratively merges the most frequent adjacent pairs into a single token. For example,
- Abstract Syntax Tree (AST): A tree representation of the abstract syntactic structure of source code. It is generated by a parser from the sequence of tokens produced by the lexer. The AST represents the code's grammatical structure, independent of superficial formatting.
-
Previous Works:
- Tokenization for Code: Some prior work has explored alternative tokenizers for code. Chirkova and Troshin (2023) introduced a tokenizer designed to better align with PL syntax, but this paper provides a much broader analysis of the problem itself across many existing models.
- Robustness to Representation: Other research has examined how LLMs handle different input formats. Zheng et al. (2025) showed that instruction-tuned models can adapt to unconventional tokenizations, but still suffer a performance drop. Yan et al. (2025) found similar inconsistencies in the chemistry domain. This paper focuses specifically on the programming language domain and provides a structured framework for analysis.
- Syntax-Aware Code Modeling: To mitigate this issue, researchers have proposed methods to enforce grammatical correctness during code generation. Approaches like Synchromesh (Poesia et al., 2022) and SynCode (Ugare et al., 2024) use techniques to filter out grammatically incorrect tokens at each generation step. These are solutions to a problem that
TokDriftaims to diagnose and measure.
-
Differentiation:
TokDriftstands out because it doesn't propose a new model or tokenizer. Instead, it provides a diagnostic tool and a comprehensive empirical study that proves the problem of tokenization misalignment is real, widespread, and significant. It isolates the effect of tokenization from other variables by using semantic-preserving rewrites, offering a clean and powerful experimental methodology.
4. Methodology (TokDrift Framework)
The core methodology is the TokDrift framework, which quantifies model sensitivity to tokenization changes. Its workflow is illustrated in Figure 1 from the paper.
1.jpg
Workflow Steps:
- Select a Baseline Program: An input code snippet is taken from a benchmark dataset.
- Apply a Rewrite Rule: A single, predefined, semantic-preserving rewrite rule (R) is applied to the baseline program to create a "variant" program. This variant is functionally identical but will be tokenized differently by an LLM's subword tokenizer.
- Feed Both to an LLM: Both the original (baseline) program and the variant program are fed as input to the same LLM.
- Evaluate and Compare Outputs: The outputs generated by the LLM for both inputs are evaluated for functional correctness (e.g., by running unit tests). The framework then checks if the correctness "flips" between the baseline and the variant (e.g., the baseline output was correct but the variant output is incorrect, or vice-versa). This "flip" is a direct measure of the model's sensitivity.
Core Components of the Framework:
-
Benchmarks: The study uses eight benchmarks across three common programming tasks to ensure the findings are generalizable. The programming languages are Java and Python.
Manual Transcription of Table 1: Benchmarks in our experiments.
Benchmark Source Task Input PL Output PL # Samples HumanEval-Fix-py HumanEvalPack (Muennighoff et al., 2023) bug fixing Python Python 164 HumanEval-Fix-java HumanEvalPack (Muennighoff et al., 2023) bug fixing Java Java 164 HumanEval-Explain-py HumanEvalPack (Muennighoff et al., 2023) code summarization Python Python 164 HumanEval-Explain-java HumanEvalPack (Muennighoff et al., 2023) code summarization Java Java 164 Avatar-py2java Avatar (Ahmad et al., 2023; Pan et al., 2024) code translation Python Java 244 Avatar-java2py Avatar (Ahmad et al., 2023; Pan et al., 2024) code translation Java Python 246 CodeNet-py2java CodeNet (Puri et al., 2021; Pan et al., 2024) code translation Python Java 200 CodeNet-java2py CodeNet (Puri et al., 2021; Pan et al., 2024) code translation Java Python 200 -
Models: The experiments cover nine popular open-source code LLMs from three different model families and three size categories (Small, Medium, Large) to analyze the effect of model scale.
Manual Transcription of Table 2: Models used in our experiments.
Series S M L Llama-3 3B 8B 70B Qwen2.5-Coder 1.5B 7B 32B DeepSeek-Coder 1.3B 6.7B 33B -
Rewrite Rules: The framework uses 24 carefully designed rewrite rules that change the code's appearance but not its meaning. They fall into two categories:
Manual Transcription of Table 3: Rewrite rules supported by TokDRIFT.
No. PL Rewrite Rule Description Example N1 J camelCase→snake_caseConvert identifiers from the most common casing styles. sortedLst→sorted_lstN2 J camelCase→PascalCaseclosestPair→ClosestPairN3 J camelCase→SCREAMING_CASEpossibleSolutions→POSSIBLE_SOLUTIONSN4 P snake_case→camelCaseinput_clipboard→inputClipboardN5 P snake_case→PascalCasestring_xor→StringXorN6 P snake_case→SCREAMING_CASEtriangle_area→TRIANGLE_AREAS1 P OP-→Add space between operator and minus sign ::-1→S2 P OP[→OP _[Add space between operator and left square bracket ))[2:]→)) [2:]S3 P ).→) .Add space between right parentheses and period ').replace→') .replaceS4 J,P ])→] )Add space between right square bracket and right parentheses :]):→:]) :S5 P OP]→OP _]Add space between operator and right square bracket +[[→+ [[S6 J OP(→OP (Add space between operator and left parentheses !(isTrue→! (isTrueS7 P [ID→[ IDAdd space between left square bracket and identifier ([vowels→([ vowelsS8 J ++)→++ )Add space between increment operator and right parentheses → S9 J .*→. *Add space between period and asterisk .*;→. *;S10 P ):→) :Add space between right parentheses and colon main():→main() :S11 J );→) ;Add space between right parentheses and semicolon <>();→<>() ;S12 J OP;→OP _;Add space between operator and semicolon ++ ;→++ ;S13 J,P ))→) )Add space between two right parentheses .toCharArray()))→.toCharArray()) )S14 J,P ((→( (Add space between two left parentheses alpha((u)→alpha( (u)S15 J,P .ID→. IDAdd space between period and identifier .factorial→. factorialS16 J,P (ID→( IDAdd space between left parentheses and identifier (String→( StringS17 J,P OP ID→OP _ IDAdd space between operator and identifier → S18 J,P OP ALL→OP _ ALLAdd space between operator and identifier/operator → ( 1 + list
5. Experimental Setup
- Datasets: As detailed in Table 1, the experiments use the HumanEvalPack, Avatar, and CodeNet benchmarks, which cover Python and Java for bug fixing, code summarization, and code translation tasks.
- Evaluation Metrics: The paper uses two main metrics to quantify the impact of the rewrites.
- Δ Accuracy:
- Conceptual Definition: This measures the absolute change in the model's overall accuracy on a benchmark after a rewrite rule is applied. It's calculated as the accuracy on the rewritten (variant) dataset minus the accuracy on the original (baseline) dataset. A negative value means the rewrite hurt performance.
- Mathematical Formula:
- Symbol Explanation:
- : The percentage of correctly solved problems using the rewritten code as input.
- : The percentage of correctly solved problems using the original code as input.
- Sensitivity:
- Conceptual Definition: This is the paper's key unbiased metric. It measures how often a model's prediction flips in correctness (correct to incorrect, or incorrect to correct) only for the samples that were actually modified by the rewrite rule. This avoids dilution from unaffected samples and provides a more direct measure of instability. A higher sensitivity score indicates lower robustness.
- Mathematical Formula:
- Symbol Explanation:
- Number of samples with flipped correctness: Counts how many times the model's output for the baseline was correct and the variant was incorrect, OR the baseline was incorrect and the variant was correct.
- Number of samples affected by the rewrite rule: Counts only those input programs where the rewrite rule found a substring to change.
- Δ Accuracy:
- Baselines: The baseline for each experiment is the performance of the LLM on the original, unmodified code snippet. The comparison is not against other models but against the model's own performance under slightly different, semantically-equivalent inputs.
6. Results & Analysis
Core Results (Accuracy and Sensitivity)
The results demonstrate that tokenization drift has a significant and measurable impact on LLM performance.
Manual Transcription of Table 4: Accuracy and ∆accuracy (in parenthesis) of each model on each rewrite rule. (Note: This is a large table, transcribed into two parts as in the original paper.)
Input PL = Java
| Variant | Llama-3B | Llama-8B | Llama-70B | Qwen-1.5B | Qwen-7B | Qwen-32B | DS-1.3B | DS-6.7B | DS-33B | Average |
|---|---|---|---|---|---|---|---|---|---|---|
| baseline | 32.04 | 43.15 | 57.24 | 33.59 | 57.36 | 70.41 | 38.50 | 58.01 | 57.36 | 49.74 |
| N1 | 32.69 (+0.65) | 43.54 (+0.39) | 57.49 (+0.25) | 35.27 (+1.68) | 57.62 (+0.26) | 70.28 (-0.13) | 37.98 (-0.52) | 57.36 (-0.65) | 57.11 (-0.25) | 49.93 (+0.19) |
| N2 | 32.17 (+0.13) | 43.54 (+0.39) | 56.85 (-0.39) | 35.27 (+1.68) | 57.75 (+0.39) | 70.41 (+0.00) | 39.02 (+0.52) | 58.14 (+0.13) | 57.36 (+0.00) | 50.06 (+0.32) |
| N3 | 32.56 (+0.52) | 44.19 (+1.04) | 56.20 (-1.04) | 35.53 (+1.94) | 58.01 (+0.65) | 69.12 (-1.29) | 38.37 (-0.13) | 56.33 (-1.68) | 56.46 (-0.90) | 49.64 (-0.10) |
| S3 | 31.65 (-0.39) | 43.02 (-0.13) | 56.20 (-1.04) | 34.37 (+0.78) | 56.72 (-0.64) | 70.41 (+0.00) | 37.34 (-1.16) | 58.66 (+0.65) | 57.88 (+0.52) | 49.58 (-0.16) |
| S6 | 31.52 (-0.52) | 43.02 (-0.13) | 57.62 (+0.38) | 33.20 (-0.39) | 57.49 (+0.13) | 70.28 (-0.13) | 37.98 (-0.52) | 58.53 (+0.52) | 57.49 (+0.13) | 49.68 (-0.06) |
| S8 | 31.91 (-0.13) | 43.28 (+0.13) | 57.24 (+0.00) | 34.11 (+0.52) | 56.72 (-0.64) | 71.45 (+1.04) | 38.63 (+0.13) | 57.49 (-0.52) | 58.27 (+0.91) | 49.90 (+0.16) |
| S9 | 32.30 (+0.26) | 40.96 (-2.19) | 58.66 (+1.42) | 33.46 (-0.13) | 58.14 (+0.78) | 69.51 (-0.90) | 36.95 (-1.55) | 56.59 (-1.42) | 57.35 (+0.39) | 49.37 (-0.37) |
| S11 | 32.69 (+0.65) | 44.57 (+1.42) | 55.17 (-2.07) | 35.14 (+1.55) | 56.33 (-1.03) | 71.58 (+1.17) | 37.34 (-1.16) | 57.11 (-0.90) | 57.11 (-0.25) | 49.67 (-0.07) |
| S12 | 30.49 (-1.55) | 43.02 (-0.13) | 56.07 (-1.17) | 34.75 (+1.16) | 55.81 (-1.55) | 67.05 (-3.36) | 38.63 (+0.13) | 55.94 (-2.07) | 58.53 (+1.17) | 48.92 (-0.82) |
| S13 | 32.43 (+0.39) | 42.64 (-0.51) | 56.59 (-0.65) | 33.46 (-0.13) | 57.36 (+0.00) | 69.77 (-0.64) | 37.47 (-1.03) | 58.27 (+0.26) | 56.98 (-0.38) | 49.44 (-0.30) |
| S14 | 29.84 (-2.20) | 41.09 (-2.06) | 54.13 (-3.11) | 32.17 (-1.42) | 56.85 (-0.51) | 71.19 (+0.78) | 37.86 (-0.64) | 57.11 (-0.90) | 57.62 (+0.26) | 48.65 (-1.09) |
| S15 | 30.62 (-1.42) | 36.82 (-6.33) | 57.24 (+0.00) | 33.46 (-0.13) | 56.72 (-0.64) | 70.28 (-0.13) | 37.34 (-1.16) | 55.43 (-2.58) | 59.43 (+2.07) | 48.59 (-1.15) |
| S16 | 30.88 (-1.16) | 40.83 (-2.32) | 55.94 (-1.30) | 34.88 (+1.29) | 57.36 (+0.00) | 71.96 (+1.55) | 36.43 (-2.07) | 57.49 (-0.52) | 58.66 (+1.30) | 49.38 (-0.36) |
| S17 | 28.68 (-3.36) | 37.34 (-5.81) | 56.07 (-1.17) | 35.66 (+2.07) | 55.43 (-1.93) | 70.03 (-0.38) | 35.40 (-3.10) | 55.04 (-2.97) | 58.91 (+1.55) | 48.06 (-1.68) |
| S18 | 25.97 (-6.07) | 34.88 (-8.27) | 56.85 (-0.39) | 34.11 (+0.52) | 56.07 (-1.29) | 70.28 (-0.13) | 33.98 (-4.52) | 53.10 (-4.91) | 56.33 (-1.03) | 46.84 (-2.90) |
Input PL = Python
| Variant | Llama-3B | Llama-8B | Llama-70B | Qwen-1.5B | Qwen-7B | Qwen-32B | DS-1.3B | DS-6.7B | DS-33B | Average |
|---|---|---|---|---|---|---|---|---|---|---|
| baseline | 39.12 | 49.87 | 69.04 | 40.67 | 64.51 | 76.17 | 44.82 | 61.92 | 68.13 | 57.14 |
| N4 | 40.03 (+0.91) | 51.04 (+1.17) | 68.91 (-0.13) | 39.77 (-0.90) | 65.03 (+0.52) | 77.85 (+1.68) | 44.30 (-0.52) | 61.53 (-0.39) | 68.39 (+0.26) | 57.43 (+0.29) |
| N5 | 37.56 (-1.56) | 50.91 (+1.04) | 68.65 (-0.39) | 39.25 (-1.42) | 64.77 (+0.26) | 77.72 (+1.55) | 42.88 (-1.94) | 61.53 (-0.39) | 68.39 (+0.26) | 56.85 (-0.29) |
| N6 | 38.08 (-1.04) | 50.65 (+0.78) | 66.19 (-2.85) | 39.38 (-1.29) | 64.51 (+0.00) | 76.81 (+0.64) | 42.23 (-2.59) | 61.14 (-0.78) | 67.62 (-0.51) | 56.29 (-0.85) |
| S1 | 39.38 (+0.26) | 50.39 (+0.52) | 68.65 (-0.39) | 40.54 (-0.13) | 64.51 (+0.00) | 76.68 (+0.51) | 44.69 (-0.13) | 62.56 (+0.64) | 67.62 (-0.51) | 57.22 (+0.08) |
| S2 | 39.64 (+0.52) | 50.65 (+0.78) | 68.78 (-0.26) | 40.41 (-0.26) | 64.77 (+0.26) | 75.91 (-0.26) | 43.65 (-1.17) | 62.44 (+0.52) | 67.75 (-0.38) | 57.11 (-0.03) |
| S4 | 39.77 (+0.65) | 50.65 (+0.78) | 69.30 (+0.26) | 40.54 (-0.13) | 64.51 (+0.00) | 73.19 (-2.98) | 44.82 (+0.00) | 61.92 (+0.00) | 67.36 (-0.77) | 56.90 (-0.24) |
| S5 | 38.60 (-0.52) | 50.78 (+0.91) | 68.91 (-0.13) | 40.80 (+0.13) | 64.12 (-0.39) | 76.94 (+0.77) | 44.43 (-0.39) | 62.69 (+0.77) | 66.71 (-1.42) | 57.11 (-0.03) |
| S7 | 40.03 (+0.91) | 49.35 (-0.52) | 68.26 (-0.78) | 40.67 (+0.00) | 63.34 (-1.17) | 76.42 (+0.25) | 44.30 (-0.52) | 62.69 (+0.77) | 67.23 (-0.90) | 56.92 (-0.22) |
| S10 | 38.47 (-0.65) | 50.65 (+0.78) | 69.17 (+0.13) | 40.67 (+0.00) | 63.99 (-0.52) | 77.46 (+1.29) | 44.56 (-0.26) | 62.05 (+0.13) | 67.10 (-1.03) | 57.12 (-0.02) |
| S13 | 37.95 (-1.17) | 50.13 (+0.26) | 69.30 (+0.26) | 40.54 (-0.13) | 64.90 (+0.39) | 76.55 (+0.38) | 44.30 (-0.52) | 62.05 (+0.13) | 67.10 (-1.03) | 56.98 (-0.16) |
| S14 | 38.73 (-0.39) | 49.22 (-0.65) | 68.39 (-0.65) | 39.38 (-1.29) | 63.73 (-0.78) | 74.09 (-2.08) | 45.08 (+0.26) | 61.66 (-0.26) | 67.49 (-0.64) | 56.42 (-0.72) |
| S15 | 39.12 (+0.00) | 50.26 (+0.39) | 67.49 (-1.55) | 39.77 (-0.90) | 62.69 (-1.82) | 76.30 (+0.13) | 44.17 (-0.65) | 61.66 (-0.26) | 67.23 (-0.90) | 56.52 (-0.62) |
| S16 | 40.16 (+1.04) | 49.87 (+0.00) | 69.04 (+0.00) | 39.64 (-1.03) | 63.08 (-1.43) | 76.68 (+0.51) | 43.65 (-1.17) | 61.27 (-0.65) | 67.23 (-0.90) | 56.74 (-0.40) |
| S17 | 40.41 (+1.29) | 50.39 (+0.52) | 67.62 (-1.42) | 39.38 (-1.29) | 61.92 (-2.59) | 76.55 (+0.38) | 42.62 (-2.20) | 60.49 (-1.43) | 66.32 (-1.81) | 56.19 (-0.95) |
| S18 | 37.44 (-1.68) | 49.87 (+0.00) | 67.62 (-1.42) | 38.34 (-2.33) | 63.08 (-1.43) | 75.13 (-1.04) | 42.49 (-2.33) | 62.05 (+0.13) | 67.36 (-0.77) | 55.93 (-1.21) |
Analysis of Table 4: The changes in accuracy ( accuracy) are often non-trivial. For example, for Java code, applying rule S18 (add space after operators) causes Llama-8B's accuracy to plummet by -8.27%. On average, this single rule decreases accuracy by -2.90% across all models. These are significant drops, often larger than the claimed improvements of new models.
The Sensitivity metric, shown in Figure 3, provides an even clearer picture.
7.jpg
Analysis of Figure 3:
- (a) Naming Rewrites: The average sensitivity to naming changes is 9.26%. Models are less sensitive to conversions between
camelCaseandsnake_case(rules N1, N4), which are common conventions, but more sensitive to less common styles likeSCREAMING_CASE. - (b) Spacing Rewrites: The average sensitivity to spacing is 8.29%. The "wildcard" rules
S17andS18(add space after any operator) cause the highest sensitivity, often exceeding 10%. Rules likeS15(space after a period) andS14(space between parentheses) are also highly impactful. This shows that seemingly meaningless formatting choices can dramatically alter model behavior. - (c) By Model: All models exhibit non-negligible sensitivity. The Llama-3 series appears more sensitive than Qwen and DeepSeek. Even the most robust model,
Qwen-32B, has an average sensitivity of 5.71% for spacing rules, meaning its output correctness still flips in nearly 1 out of 17 affected cases due to a simple space change.
Impact of Model Size
Manual Transcription of Table 5: Impact of model size on sensitivity.
| Rewrite Rule | Model Series | S | M | L |
|---|---|---|---|---|
| Naming | Llama-3 | 11.48 | 10.68 | 9.43 |
| Qwen2.5-Coder | 7.73 | 7.95 | 8.27 | |
| DeepSeek-Coder | 9.88 | 8.95 | 8.95 | |
| Spacing | Llama-3 | 10.22 | 10.99 | 8.51 |
| Qwen2.5-Coder | 7.07 | 8.87 | 5.71 | |
| DeepSeek-Coder | 8.36 | 8.71 | 6.26 |
Analysis: The trend shows that larger models are generally more robust (have lower sensitivity), especially for spacing rules. For example, Qwen-32B (L) is significantly less sensitive to spacing changes than Qwen-1.5B (S). However, this trend is not universal (e.g., Qwen-32B is slightly more sensitive to naming changes than its smaller versions), and the problem persists even in the largest models. Scaling alone does not solve the tokenization misalignment problem.
Impact of Identifier Fragment Changes
The paper hypothesizes that much of the sensitivity comes from identifiers being split into different subword "fragments."
Manual Transcription of Table 6: Impact of identifier fragment changes on sensitivity.
| Rewrite Rule | Model | Unchanged | Changed |
|---|---|---|---|
| Naming | Llama-70B | 8.13 | 11.21 |
| Qwen-32B | 6.58 | 10.57 | |
| DS-33B | 6.61 | 10.82 | |
| Spacing | Llama-70B | 7.24 | 11.89 |
| Qwen-32B | 5.09 | 7.37 | |
| DS-33B | 5.80 | 7.12 |
Analysis: This is a crucial finding. For both naming and spacing rules, the sensitivity is consistently and significantly higher in the "Changed" group, where the subword fragments of an identifier were altered. For example, for naming rules on DS-33B, sensitivity jumps from 6.61% to 10.82%. This strongly suggests that the arbitrary and inconsistent way subword tokenizers break down identifiers is a primary driver of model instability.
Root Cause Analyses
-
Word Frequency Analysis: The authors checked if the rewrites were changing common code patterns into rare ones.
Manual Transcription of Table 7: Word frequency of rewrite rules' left-hand side (LHS) and right-hand side (RHS) on GitHub. (Partial table shown for brevity)
Rewrite Rule LHS RHS Ratio [%] Java S3: ).→) .78.9M 45.7K 0.06 S14: ((→( (144M 195K 0.14 S15: .ID→. ID175M 45.9M 26.22 Python S14: ((→( (78.1M 71.7K 0.09 S15: .ID→. ID107M 40.6M 37.94 Analysis: The rewrite rules that cause high sensitivity (like
S14) often have a very low RHS-to-LHS frequency ratio, meaning they transform a very common pattern into a very rare one. The models are less familiar with the rare patterns, leading to degraded performance. -
Hidden State Analysis: The authors analyzed the model's internal representations.
8.jpg
Analysis of Figure 4: This figure plots the cosine similarity between the hidden states of the original and rewritten code at each layer.
-
The similarity is near zero at the first layer (input embeddings), confirming that the token sequences are completely different.
-
It increases in the middle layers, where the model is thought to capture abstract semantics.
-
Crucially, for the most problematic spacing rules (
S14,S3), the similarity remains low even in the middle layers. This implies the model perceives the original and rewritten code as semantically different, even though they are not. This is a clear failure of semantic understanding.9.jpg
Analysis of Figure 5: This figure visualizes the differences in hidden states using t-SNE.
(a)shows a clear separation between the diffs caused by naming rules and spacing rules.(b)and(c)show that different rules within the same category also form distinct clusters.- This confirms that different types of tokenization changes systematically alter the model's internal representations in predictable but distinct ways, reinforcing the idea that the model is reacting to surface-level syntax, not just semantics.
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper provides compelling evidence that the standard subword tokenization used in code LLMs is fundamentally misaligned with the grammar of programming languages. This misalignment is not a minor issue but a "hidden obstacle" that causes significant instability in model behavior, even for state-of-the-art models. The
TokDriftframework offers a robust way to measure this instability. The findings strongly suggest that future progress in reliable code generation will require a move towards grammar-aware tokenization. -
Limitations & Future Work:
- The study focuses on a specific set of rewrite rules for Java and Python; the effect could be different or even more pronounced for other languages or other types of rewrites.
- The analysis is limited to specific Transformer-based LLMs. The findings might not generalize to other architectures (e.g., state-space models).
- The paper diagnoses the problem but does not propose or evaluate solutions. Future work could explore mitigation strategies like retraining tokenizers, using ensembles, or developing new model architectures that are less sensitive to token boundaries.
-
Personal Insights & Critique:
- Significance: This is a highly impactful paper. It tackles a foundational, often-overlooked issue with methodical rigor. The
TokDriftframework is a simple yet powerful contribution that can be used to benchmark the robustness of any new code LLM. - Clarity of the Problem: The paper does an excellent job of demonstrating why this is a problem. The example in Figure 1b, where adding a single space corrects a function name and fixes the program, is a striking illustration of the brittleness that this tokenization misalignment can cause.
- Future Implications: The conclusions of this paper should be a call to action for the entire field of code AI. It suggests a potential ceiling on the performance and reliability of current models. The most promising path forward may not be just bigger models, but smarter tokenization that respects the inherent structure of the data it is processing. This work lays the groundwork for a new generation of more robust and trustworthy code LLMs.
- Significance: This is a highly impactful paper. It tackles a foundational, often-overlooked issue with methodical rigor. The
Similar papers
Recommended via semantic vector search.