Paper status: completed

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Published:05/19/2025

Low-Rank Clone Model (1)Knowledge Distillation for Small Language Models (1)Efficient Pre-Training Method (1)Knowledge Transfer Optimization (1)Activation Clone Alignment (1)

Original Link PDF

Price: 0.100000

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study introduces Low-Rank Clone (LRC), an efficient knowledge distillation method that significantly enhances training efficiency for small language models. By compressing teacher model weights and aligning student activations, LRC achieves comparable performance using only 2

Abstract

Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens--while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf.

Mind Map

In-depth Reading

English Analysis~24 min read · 32,119 chars

1. Bibliographic Information

1.1. Title

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

1.2. Authors

The authors of the paper are:

Jitai Hao (Harbin Institute of Technology, Shenzhen)
Qiang Huang (Harbin Institute of Technology, Shenzhen)
Hao Liu (Baidu Inc.)
Xinyan Xiao (Baidu Inc.)
Zhaochun Ren (Leiden University)
Jun Yu (Harbin Institute of Technology, Shenzhen & Pengcheng Laboratory)

The author affiliations suggest a collaboration between academic institutions in China and the Netherlands, and a major industrial research lab (Baidu Inc.).

1.3. Journal/Conference

The paper was published on arXiv, which is a preprint server for academic papers in fields like physics, mathematics, and computer science. This means the paper has not yet undergone a formal peer-review process for a specific conference or journal. The future publication date suggests it has been submitted or is intended for a conference in 2025.

1.4. Publication Year

2025 (as listed on the preprint). The version analyzed is v2, submitted on May 19, 2025 (UTC).

1.5. Abstract

The abstract introduces the problem that training high-performing Small Language Models (SLMs) is computationally expensive, even when using techniques like knowledge distillation from larger "teacher" models. It identifies three key challenges with existing methods: information loss from aggressive "hard" pruning, inefficient alignment of model representations, and the underutilization of knowledge from Feed-Forward Networks (FFNs).

To solve these issues, the paper proposes Low-Rank Clone (LRC), a new pre-training method. LRC trains a set of low-rank projection matrices to create an SLM that behaves similarly to a larger teacher model. These matrices serve a dual purpose: they perform "soft pruning" by compressing the teacher's weights and enable "activation clone" by aligning the student's internal activations (including FFNs) with the teacher's. This unified design avoids the need for separate alignment modules and maximizes knowledge transfer.

Experiments using open-source teachers like Llama-3.2-3B and Qwen2.5-7B show that LRC can produce SLMs that match or exceed the performance of state-of-the-art models, while using only 20 billion training tokens compared to trillions. This represents a training efficiency improvement of over 1,000 times.

1.6. Original Source Link

Original Source: https://arxiv.org/abs/2505.12781v2
PDF Link: https://arxiv.org/pdf/2505.12781v2
Publication Status: This is a preprint available on arXiv. It has not yet been published in a peer-reviewed journal or conference.

2. Executive Summary

2.1. Background & Motivation

Core Problem: The primary challenge addressed is the immense cost of training high-quality Small Language Models (SLMs). While SLMs are smaller and more efficient to run than Large Language Models (LLMs), their initial training still requires vast computational resources and enormous datasets, often involving trillions of text tokens. This high barrier to entry limits their development to a few well-resourced organizations.
Importance and Gaps: SLMs are crucial for deploying advanced AI in real-world applications where resources are limited, such as on mobile devices or in scenarios requiring low latency. To reduce training costs, researchers often use knowledge distillation, where a smaller "student" model learns from a powerful pre-trained "teacher" LLM. However, existing methods have significant drawbacks:
1. Information Loss from Hard Pruning: Many methods first prune the teacher model by permanently removing components (neurons, attention heads, layers). This "hard pruning" can lead to a significant loss of knowledge stored in the teacher's weights, causing a sharp drop in performance that is difficult to recover.
2. Inefficient Representation Alignment: Feature-based distillation methods try to align the internal activations of the student and teacher models. This often requires extra, trainable projection modules to match the different dimensionalities of their representations, adding complexity and making the training process less efficient.
3. Underutilization of FFN Activations: Previous work has predominantly focused on aligning attention-related information, largely ignoring the activations from the Feed-Forward Network (FFN) layers. The authors argue that these FFNs store a wealth of factual and world knowledge, and failing to distill this information is a missed opportunity.
Innovative Idea: The paper's central innovation is Low-Rank Clone (LRC), a method that elegantly solves these three problems within a single, unified framework. Instead of training a student model from scratch or by crudely pruning a teacher, LRC generates the student's weights by applying a set of learnable low-rank projection matrices to the teacher's weights. This approach constitutes a "soft pruning" that preserves more information. Crucially, the same projection matrices are reused to align the student's and teacher's internal activations, including those from the FFNs. This eliminates the need for separate alignment modules and ensures a more comprehensive transfer of knowledge.

2.2. Main Contributions / Findings

Main Contributions:
1. A Novel Distillation Method (LRC): The paper proposes LRC, a highly efficient pre-training method that creates SLMs by learning to be a "low-rank clone" of a larger teacher model.
2. Unified Soft Pruning and Distillation: LRC introduces a single-stage framework where trainable low-rank projection matrices simultaneously compress the teacher's weights (soft pruning) and align the student's activations with the teacher's (knowledge distillation).
3. Full Utilization of FFN Knowledge: The method explicitly clones activations from FFN layers, demonstrating that these signals are critical for transferring factual and semantic knowledge, a factor often overlooked in prior work.
4. Alignment-Free Design: By reusing the weight projection matrices for activation alignment, LRC eliminates the need for additional alignment modules, simplifying the training process and improving efficiency.
Key Findings:
1. Extreme Training Efficiency: The most striking finding is that LRC can achieve performance comparable or superior to state-of-the-art SLMs while using over 1,000 times fewer training tokens. For instance, an LRC model trained on just 20 billion tokens can outperform a model like Qwen3-1.7B that was trained on 36 trillion tokens.
2. Superiority of Soft Pruning: The experiments demonstrate that LRC's low-rank projection approach (soft pruning) is more effective than hard pruning methods like Sheared Llama, as it better preserves the teacher's knowledge.
3. Critical Role of FFNs: Ablation studies confirm that cloning FFN activations is more important for successful knowledge transfer than cloning attention activations. Removing the FFN clone loss leads to a significant and persistent drop in performance.
4. Scalability and Robustness: LRC proves effective across different model sizes (1.5B to 4B parameters) and various teacher models (Llama, Qwen), demonstrating its general applicability.
  
  The following chart from the paper (Figure 1) visually summarizes the core finding of achieving high accuracy with drastically fewer training tokens.
  
  $Figure 1: LRC results that achieve higher accuracy with $1 { , } 0 0 0 \\times$ fewer training tokens, significantly boosting efficiency.$ 该图像是一个图表，展示了不同模型在训练过程中所需的标记数量与平均准确率的关系。数据点包括多个模型，其中标记为 LRC-1.5B 和 LRC-1.7B 的模型显示出高达 1000 倍的训练效率。

3.1. Foundational Concepts

Knowledge Distillation (KD): This is a machine learning technique where knowledge from a large, complex model (the "teacher") is transferred to a smaller, more efficient model (the "student"). The goal is for the student to mimic the teacher's behavior and achieve similar performance. There are two main types:
- Output-based Distillation: The student is trained to match the final output probability distribution (logits) of the teacher.
- Feature-based Distillation: The student is trained to match the teacher's intermediate representations (activations) at various layers. LRC is a sophisticated form of feature-based distillation.
Model Pruning: This is a technique for reducing the size of a neural network by removing "unimportant" components.
- Hard Pruning: This involves permanently deleting weights, neurons, attention heads, or entire layers based on some importance metric (e.g., magnitude). This reduces the model's size and computational cost but can cause a permanent loss of information.
- Soft Pruning: This is a more flexible approach where the model is encouraged to learn a compressed representation. LRC's use of low-rank projection is a form of soft pruning, as it doesn't discard teacher weights but rather learns an optimal way to compress them into a smaller space.
Transformer Architecture: This is the foundational architecture for modern language models like Llama and GPT. Its key components are:
- Self-Attention: A mechanism that allows the model to weigh the importance of different words in the input sequence when processing a particular word. It computes Query (Q), Key (K), and Value (V) vectors for each token to determine contextual relationships.
- Feed-Forward Network (FFN): A sub-layer within each Transformer block that applies a non-linear transformation to the output of the attention mechanism. It typically consists of two linear layers with an activation function in between. FFNs are believed to be where the model stores much of its factual and world knowledge. This paper uses the SwiGLU activation, a variant popular in models like Llama.
Low-Rank Approximation (or Factorization): A core mathematical concept from linear algebra. It states that a large matrix can be approximated by the product of two or more smaller matrices. For a matrix $W \in \mathbb{R}^{m \times n}$ , a low-rank approximation would be $W \approx UV^T$ , where $U \in \mathbb{R}^{m \times r}$ and $V \in \mathbb{R}^{n \times r}$ , and the rank $r$ is much smaller than $m$ and $n$ . LRC uses this principle to compress the teacher's large weight matrices into smaller student weight matrices.

3.2. Previous Works

Feature-based Distillation (TinyBERT, MiniLM): These methods were pioneers in using intermediate representations for distillation in Transformer models. For example, TinyBERT aligns the attention matrices and hidden states between the student and teacher at each layer. A key challenge is that the student and teacher often have different hidden state dimensions, requiring an additional learnable projection matrix $W_{proj}$ to match them, adding complexity. LRC's "alignment-free" design is a direct improvement on this.
Combined Pruning and Distillation (Minitron, Sheared Llama): These more recent methods combine hard pruning with distillation. They typically follow a multi-stage process:
1. Prune: Start with a large teacher model and remove a fixed percentage of its layers, heads, or neurons.
2. Distill/Train: Fine-tune the pruned model using knowledge distillation or continue pre-training to recover the performance lost during pruning. The authors of LRC argue this multi-stage approach is inefficient and the initial hard pruning step discards valuable knowledge.
PCA-based Pruning (SliceGPT): This method uses Principal Component Analysis (PCA), a classic linear dimensionality reduction technique, to find a projection that preserves the most variance in the weight matrices before pruning. While more principled than simple magnitude pruning, PCA's linear assumptions may not fully capture the complex, non-linear structures within LLM weights. LRC improves on this by using learnable low-rank projections, which can adapt more flexibly to the data and model structure.

3.3. Technological Evolution

The field of model compression for LLMs has evolved from simple techniques to more integrated and sophisticated approaches:

Early KD: Focused on matching the final output logits (soft labels).
Feature-based KD: Advanced to matching intermediate layer activations (TinyBERT, MiniLM), providing richer guidance to the student.
Hard Pruning: Methods emerged to permanently remove parts of the model (LLM-Pruner).
Pruning + KD Hybrids: Researchers combined hard pruning with distillation to recover performance (Minitron, Sheared Llama).
Principled Compression: More advanced methods like SliceGPT used mathematical tools like PCA for more structured pruning.
Unified Framework (LRC): This paper represents the next step in this evolution. It proposes a single, unified framework where pruning is not a separate, destructive step but a "soft," learnable compression that is integrated directly with a comprehensive distillation process.

3.4. Differentiation Analysis

vs. TinyBERT and other feature-based methods:
- Weight Initialization: LRC generates student weights directly from the teacher via projection, preserving structural knowledge. TinyBERT typically initializes the student from scratch or from a subset of teacher layers.
- Alignment: LRC is "alignment-free," reusing its projection matrices. TinyBERT requires dedicated, separate alignment matrices.
- Knowledge Source: LRC distills knowledge from both FFN and attention layers. TinyBERT focuses primarily on attention.
vs. Minitron and Sheared Llama:
- Process: LRC uses a single-stage, unified process. Minitron/Sheared Llama use a multi-stage (prune then train) pipeline.
- Pruning Method: LRC uses "soft pruning" (low-rank projection) which preserves information. Minitron/Sheared Llama use "hard pruning" which discards information. This is a key reason for LRC's superior efficiency and performance.
vs. SliceGPT:
- Projection Method: LRC uses learnable projection matrices that are optimized via backpropagation. SliceGPT uses a fixed projection derived from PCA. The learnable nature of LRC's projections allows it to find a better compression that is tailored to the distillation objective, leading to better knowledge retention.

4. Methodology

4.1. Principles

The core principle of Low-Rank Clone (LRC) is to create a small language model (student) that is a "behavioral clone" of a larger, more powerful teacher model. This is achieved not by training the student's weights from scratch, but by generating them through a learnable, low-rank projection of the teacher's weights. This process has two simultaneous effects:

Soft Pruning: The low-rank projection naturally compresses the high-dimensional weight matrices of the teacher into the smaller dimensions required by the student, acting as an information-preserving form of model pruning.
Knowledge Distillation: The method forces the student's intermediate activations to mimic the teacher's. The same projection matrices used for weight generation are cleverly reused to align these activations, making the process highly efficient and "alignment-free."

The only parameters trained during this process are the low-rank projection matrices and the student's RMSNorm scaling parameters, which constitute less than 1% of a typical model's total parameters. This drastically reduces the training overhead.

4.2. Core Methodology In-depth (Layer by Layer)

The LRC method unfolds in two main steps within each training iteration, as illustrated in the figure below (Figure 2 from the paper).

该图像是示意图，展示了低秩克隆（Low-Rank Clone, LRC）在小语言模型（SLM）训练中的前向过程。图中分为几个步骤: 第一步是低秩投影，通过权重 $W^T_{up,i}$ 和 $W^S_{up,i}$ 进行计算; 第二步是激活克隆，使用均方误差（MSE）对比教师和学生的激活输出。最后一步涉及权重重用，从而提高知识转移效率。该方法旨在最大化信息传递，降低训练成本。

4.2.1. Step 1: Low-Rank Projection

This step generates the student model's weights from the teacher's weights. This is not a one-time initialization but happens dynamically in every forward pass during training, as the projection matrices are being learned.

The paper defines a set of trainable low-rank projection matrices for each relevant weight matrix in the Transformer architecture: $ W_{m,i}^p, W_{\mathrm{emb}}^p, W_{\mathrm{lm}}^p $ where $m$ denotes the module type ( $q$ , $k$ , $v$ , $o$ for attention; up, gate, down for FFN), and $i$ is the layer index.

Attention and FFN Weight Projection: For each layer $i$ and each module $m$ in the Transformer, the student's weight matrix $W_{m,i}^S$ is computed by post-multiplying the corresponding teacher's weight matrix $W_{m,i}^T$ with a trainable projection matrix $W_{m,i}^p$ .

The formula is: $ W_{m,i}^S = W_{m,i}^T W_{m,i}^p $

Symbol Explanation:
- $W_{m,i}^S \in \mathbb{R}^{d_m^T \times d^S}$ : The resulting weight matrix for the student model. Note that the input dimension remains the teacher's ( $d_m^T$ ), but the output dimension is compressed to the student's hidden size ( $d^S$ ).
- $W_{m,i}^T \in \mathbb{R}^{d_m^T \times d^T}$ : The frozen weight matrix from the pre-trained teacher model.
- $W_{m,i}^p \in \mathbb{R}^{d^T \times d^S}$ : The trainable low-rank projection matrix. Its goal is to learn the best way to map the teacher's hidden space of dimension $d^T$ to the student's smaller hidden space of dimension $d^S$ .

Embedding and LM Head Projection: Similarly, the token embedding matrix and the final language model (LM) head are also projected: $ W_m^S = W_m^T W_{\mathrm{emb}}^p $

Symbol Explanation:
- $m \in \{\mathrm{emb}, \mathrm{lm}\}$ : Denotes either the embedding or LM head matrix.
- $W_m^S \in \mathbb{R}^{|V| \times d^S}$ : The student's embedding or LM head matrix.
- $W_m^T \in \mathbb{R}^{|V| \times d^T}$ : The teacher's embedding or LM head matrix.
- $W_m^p \in \mathbb{R}^{d^T \times d^S}$ : The trainable projection matrix for embeddings.
- $|V|$ : The size of the shared vocabulary.
  
  After this step, a complete student model with a smaller hidden dimension $d^S$ is constructed, ready for a forward pass.

4.2.2. Step 2: Activation Clone

After generating the student's weights, both the teacher and the newly formed student perform a forward pass on the same input data. During this pass, their intermediate activations at various points are collected and aligned. LRC clones a comprehensive set of activations to ensure the student mimics the teacher's internal computations closely.

The specific activations cloned are:

Pre-computation hidden states: The inputs to the main weight matrices, i.e., $\pmb{h_m} = \pmb{x}\pmb{W}_m^\top$ , for $m \in \{\mathrm{q, k, v, up, gate}\}$ .
Module outputs: The final outputs of the attention mechanism ( $\pmb{o}_{\mathrm{attn}}$ ) and the FFN ( $\pmb{o}_{\mathrm{ffn}}$ ).

These activations are aligned using a Mean Squared Error (MSE) loss, which is summed up across all layers to form the total clone loss, $\mathcal{L}_{\mathrm{clone}}$ .

The overall formula for the clone loss is: $ \mathcal{L}{\mathrm{clone}} = \sum_i^l \Big[ \mathcal{E}(\sigma{\mathrm{attn},i}^S, \sigma_{\mathrm{attn},i}^T W_{\mathrm{o},i}^p) + \mathcal{E}(\sigma_{\mathrm{ffn},i}^S, \sigma_{\mathrm{ffn},i}^T W_{\mathrm{down},i}^p) + \sum_m \mathcal{E}(h_{m,i}^S, h_{m,i}^T) \Big] $

Symbol Explanation:
- $\mathcal{L}_{\mathrm{clone}}$ : The total activation cloning loss.
- $\sum_i^l$ : Summation over all $l$ layers of the model.
- $\mathcal{E}(a, b)$ : The Mean Squared Error (MSE) between tensors $a$ and $b$ .
- $\sigma_{\mathrm{attn},i}^S$ and $\sigma_{\mathrm{ffn},i}^S$ : The output activations from the attention and FFN modules of the student at layer $i$ .
- $\sigma_{\mathrm{attn},i}^T$ and $\sigma_{\mathrm{ffn},i}^T$ : The corresponding output activations from the teacher.
- $W_{\mathrm{o},i}^p$ and $W_{\mathrm{down},i}^p$ : The same low-rank projection matrices used in Step 1 for the attention output and FFN down-projection weights. They are reused here to project the teacher's output activations into the student's smaller dimensional space for comparison. This is the key to the alignment-free design.
- $h_{m,i}^S$ and $h_{m,i}^T$ : The intermediate hidden states for module $m$ (q, k, v, up, gate) at layer $i$ . Since these are computed before the final projection, their dimensions are already compatible between teacher and student, so no extra projection is needed for alignment.

4.2.3. Total Training Objective

The final training objective combines the clone loss with two standard distillation losses: $ \mathcal{L} = \mathcal{L}{\mathrm{KL}} + \mathcal{L}{\mathrm{LM}} + \alpha \mathcal{L}_{\mathrm{clone}} $

Symbol Explanation:
- $\mathcal{L}_{\mathrm{KL}}$ : A Kullback-Leibler (KL) divergence loss that aligns the student's final output probability distribution over the vocabulary with the teacher's. This is a standard output-based distillation objective.
- $\mathcal{L}_{\mathrm{LM}}$ : A standard next-token prediction (language modeling) loss on the ground-truth labels. This helps the student learn from the training data directly.
- $\alpha$ : A hyperparameter that balances the importance of the activation cloning loss relative to the other losses.

4.2.4. The Alignment-Free Property

The paper formally states and proves why LRC does not need extra alignment matrices. Lemma 1 provides the intuition for the FFN module: If the intermediate FFN activations ( $h_{\mathrm{up},i}$ and $h_{\mathrm{gate},i}$ ) are perfectly cloned, then the student's FFN output will be exactly equal to the teacher's FFN output projected by the down-projection matrix $W_{\mathrm{down},i}^p$ . This means minimizing the difference between $h^S$ and $h^T$ naturally leads to the alignment of the final outputs $o^S$ and $o^T W^p$ , making separate alignment modules redundant.

4.2.5. Algorithm Summary

The overall procedure is summarized in Algorithm 1 from the paper.

Algorithm 1: Overall Procedure of LRC

Input: Token sequence $\mathcal{T}$ , teacher's weights $\{W^T\}$ , learnable projection matrices $\{W^p\}$ , number of layers $l$ .
Output: Clone loss $\mathcal{L}_{\mathrm{clone}}$ .

Step 1: Low-Rank Projection

For each layer $i$ from 1 to $l$ :
For each module $m \in \{\mathrm{q, k, v, o, up, gate, down}\}$ :
Generate student weights: $W_{m,i}^S \leftarrow W_{m,i}^T W_{m,i}^p$ .
Generate student embedding and LM head weights: $W_{\mathrm{emb}}^S \leftarrow W_{\mathrm{emb}}^T W_{\mathrm{emb}}^p$ ; $W_{\mathrm{lm}}^S \leftarrow W_{\mathrm{lm}}^T W_{\mathrm{lm}}^p$ .

Step 2: Activation Clone 5. Initialize $\mathcal{L}_{\mathrm{clone}} \leftarrow 0$ . 6. Perform a forward pass with the teacher model to collect its intermediate activations: $h^T, o_{\mathrm{attn}}^T, o_{\mathrm{ffn}}^T$ . 7. Perform a forward pass with the generated student model to collect its activations: $h^S, o_{\mathrm{attn}}^S, o_{\mathrm{ffn}}^S$ . 8. For each layer $i$ from 1 to $l$ : 9. For each intermediate state $m \in \{\mathrm{q, k, v, gate, up}\}$ : 10. Compute loss on intermediate states: $\mathcal{L}_{\mathrm{clone}} \leftarrow \mathcal{L}_{\mathrm{clone}} + \mathcal{E}(h_{m,i}^S, h_{m,i}^T)$ . 11. Compute loss on module outputs, reusing projection matrices for alignment: $\mathcal{L}_{\mathrm{clone}} \leftarrow \mathcal{L}_{\mathrm{clone}} + \mathcal{E}(o_{\mathrm{attn},i}^S, o_{\mathrm{attn},i}^T W_{\mathrm{o},i}^p) + \mathcal{E}(o_{\mathrm{ffn},i}^S, o_{\mathrm{ffn},i}^T W_{\mathrm{down},i}^p)$ . 12. Return $\mathcal{L}_{\mathrm{clone}}$ .

This loss is then combined with  $\mathcal{L}_{\mathrm{KL}}$  and  $\mathcal{L}_{\mathrm{LM}}$  to perform a gradient update on the trainable parameters (the projection matrices  $\{W^p\}$  and student `RMSNorm` weights).

5. Experimental Setup

5.1. Datasets

Pre-training Datasets: The authors constructed a training corpus by mixing several high-quality datasets:
- Fineweb-Edu: A large dataset filtered for high-quality educational content. This formed the primary component.
- DCLM (Data-centric Learning for ML): A dataset designed for machine learning education.
- CosmopiediaV2: A large-scale, knowledge-rich dataset.
- OpenHermes: A dataset of synthetic conversational data. The combined dataset was shuffled randomly without any curriculum learning.
Supervised Fine-Tuning (SFT) Dataset:
- UltraChat: A multi-round dialogue dataset used for instruction-tuning the models to create "instruct" versions that can follow commands.
Justification: The choice of datasets, particularly the emphasis on Fineweb-Edu, reflects a focus on high-quality data. The authors later show that data quality has a significant impact on LRC's performance, making this a well-justified choice for an efficiency-focused method.

5.2. Evaluation Metrics

The models were evaluated on a suite of nine standard NLP benchmarks covering reasoning, commonsense, and knowledge. The primary metric for all tasks is accuracy.

ARC-E (AI2 Reasoning Challenge - Easy) and ARC-C (Challenge)
- Conceptual Definition: Measures grade-school level science question answering ability. It tests reasoning and knowledge.
- Mathematical Formula (Accuracy): $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation: The formula is straightforward. A prediction is correct if the model chooses the right answer from a multiple-choice list.
LogiQA
- Conceptual Definition: A reading comprehension task that requires logical reasoning over paragraphs.
- Formula: Standard accuracy.
CSQA (CommonsenseQA)
- Conceptual Definition: A multiple-choice question answering dataset that requires commonsense knowledge. Questions are designed to be easy for humans but hard for AI.
- Formula: Standard accuracy.
PIQA (Physical Interaction QA)
- Conceptual Definition: Tests commonsense reasoning about how to interact with everyday objects physically.
- Formula: Standard accuracy.
WinoG (WinoGrande)
- Conceptual Definition: A pronoun resolution task that requires commons-ense reasoning to resolve ambiguities.
- Formula: Standard accuracy.
BoolQ
- Conceptual Definition: A reading comprehension dataset with yes/no questions based on a given text passage.
- Formula: Standard accuracy.
SciQ
- Conceptual Definition: A question-answering dataset focused on scientific knowledge, also from a grade-school science curriculum.
- Formula: Standard accuracy.
MMLU (Massive Multitask Language Understanding)
- Conceptual Definition: A comprehensive benchmark designed to measure knowledge acquired during pre-training. It covers 57 subjects across STEM, humanities, social sciences, etc. Performance on MMLU is often considered a key indicator of a model's general knowledge and reasoning abilities.
- Formula: Standard accuracy.
Avg. ↑
- Conceptual Definition: The arithmetic mean of the scores on all the above benchmarks. It provides a single number to summarize the model's overall performance.

5.3. Baselines

The paper compares LRC against a strong set of baseline models:

Direct Competitors (Distillation/Pruning Methods):
- Sheared Llama: A state-of-the-art method combining hard pruning and continued training. The authors conduct a direct, fair comparison by using the same teacher and training data.
- Minitron: Another recent method that integrates pruning and distillation.
- TinyBERT: A classic feature-based distillation method, which the authors adapt to the Llama architecture for comparison.
State-of-the-Art Open-Source SLMs:
- Qwen3 family (e.g., Qwen3-1.7B, Qwen3-4B): Highly competitive models from Alibaba.
- MiniCPM: A strong SLM from the OpenBMB community.
- SmolLM2: An SLM from Microsoft Research.
- Gemma3: A family of models from Google.
- InternLM: A model family from the Shanghai AI Laboratory. These baselines are representative because they include both alternative distillation/pruning techniques and the best-performing publicly available SLMs of similar sizes, providing a comprehensive evaluation of LRC's standing.

6. Results & Analysis

6.1. Core Results Analysis

The main results demonstrate LRC's exceptional training efficiency and performance.

Models with < 2B Parameters: The following are the results from Table 1 of the original paper:

Model	InternLM2-1.8B	LRC-1.7B	Qwen3-1.7B	SmolLM2-1.7B	LRC-1.5B	MiniCPM-1.2B
Teacher		Qwen2.5-3B			Llama3-3B
# Tokens	2T	20B	36T	11T	10B	1T
Dataset	N/A	Mixed-1.1	N/A	SomlLM	Mixed-1.1	N/A
ARC-E	71.04	74.62	72.47	69.11	74.75	70.16
ARC-C	42.06	44.20	43.00	43.52	44.97	39.68
LogiQA	28.42	30.88	28.42	28.88	30.72	30.88
CSQA	70.11	70.19	64.78	51.19	65.77	64.29
PIQA	74.27	73.07	72.20	76.01	73.07	74.65
WinoG	63.77	63.30	61.48	68.98	62.25	60.77
BoolQ	75.50	79.82	77.65	68.47	75.78	67.58
SciQ	94.50	93.80	93.10	89.80	94.60	91.50
MMLU	43.75	54.93	55.44	48.50	49.42	44.23
Avg. ↑	62.60	64.98	63.17	60.50	63.48	60.42

Analysis: LRC-1.7B, trained on only 20 billion tokens, achieves the highest average score (64.98), outperforming Qwen3-1.7B (63.17), which was trained on 36 trillion tokens. This is an 1800x improvement in token efficiency for superior performance. Similarly, LRC-1.5B (trained on 10B tokens) significantly outperforms SmolLM2-1.7B (trained on 11T tokens). This strongly validates the paper's central claim of massive efficiency gains.

Models with > 2B Parameters: The following are the results from Table 2 of the original paper:

Model	Gemma3-4B	Minitron-4B	Qwen3-4B	LRC-4B	LRC-2.7B-B	Sheared-Llama-2.7B-B
Teacher		Nemotron4-15B		Qwen2.5-7B	Llama2-7B	Llama2-7B
# Tokens	4T	94B	36T	18B	10B	50B
Dataset	N/A	N/A	N/A	Mixed-2.0	Redpajama	Redpajama
ARC-E	82.53	79.59	80.47	78.37	58.59	67.30
ARC-C	57.08	54.35	53.58	52.47	29.61	33.58
LogiQA	33.03	30.26	33.64	34.10	29.03	28.26
CSQA	69.37	71.09	75.76	79.28	36.36	18.92
PIQA	76.44	77.64	75.08	76.82	66.97	76.17
WinoG	69.38	65.93	65.27	67.72	62.43	65.04
BoolQ	83.94	82.60	84.95	84.50	74.31	65.99
SciQ	95.50	96.60	95.50	95.00	85.50	91.10
MMLU	57.58	56.77	68.38	64.41	31.20	26.56
Avg. ↑	69.43	68.31	70.29	70.30	52.67	52.55

Analysis: At a larger scale, LRC-4B (trained on 18B tokens) matches the performance of Qwen3-4B (trained on 36T tokens), again showcasing an enormous 2000x efficiency gain. It also surpasses Minitron-4B, which used 5x more data. In a direct comparison with Sheared-Llama-2.7B-B (base models without SFT), LRC-2.7B-B achieves comparable performance using 5x fewer tokens, demonstrating the superiority of its soft pruning approach.

6.2. Ablation Studies / Parameter Analysis

Ablation studies were conducted to isolate the contributions of LRC's key components.

Low-Rank Projection vs. Training from Scratch: The chart below (Figure 3 from the paper) shows the convergence of the language modeling loss over time.

Figure 3: Effect of LRC component ablations on LM loss convergence over training time. 该图像是一个折线图，展示了不同训练条件下的语言模型损失（LM Loss）随训练时间变化的趋势，比较了LRC及其各个变体与TinyBERT的表现。

Analysis: The LRC line (blue) converges to a low loss much faster than TinyBERT (green). LRC reaches an LM loss of 3.0 about 2.7 times faster than TinyBERT. This confirms that generating student weights via low-rank projection is far more effective and efficient than initializing the student randomly and training it from scratch.

Importance of Activation Clone Components:

Term-level Ablation: The authors removed individual activation clone terms to see their impact. The following are the results from Table 3 of the original paper:

Removed Term None Attn q Attn k Attn v Attn o FFN gate FFN up FFN down

LM Loss ↓ 2.639 2.630 2.629 2.639 2.636 2.677 2.639 2.651
- Analysis: Removing the FFN-related terms, especially FFN gate and FFN down, leads to a significant increase in the final LM loss (worse performance). This provides strong evidence that cloning FFN activations is crucial for effective knowledge transfer.
Module-level Ablation (from Figure 3):
- LRC w/o FFN (orange line) shows a substantial and persistent performance degradation throughout training compared to the full LRC.
- LRC w/o Attn (red line) initially performs worse but eventually recovers to near the level of full LRC.
- This further reinforces the conclusion that FFN activations are a more critical source of knowledge for distillation than attention activations.
Effectiveness of Activation Clone (from Figure 3):
- Comparing LRC (blue line) with LRC w/o All Clone Loss (purple line) shows that activation cloning provides a significant speedup. LRC reaches an LM loss of 3.0 more than 2 times faster, demonstrating that the clone loss provides a powerful training signal.

Removed Term	None	Attn q	Attn k	Attn v	Attn o	FFN gate	FFN up	FFN down
LM Loss ↓	2.639	2.630	2.629	2.639	2.636	2.677	2.639	2.651

Alignment-Free Property:

Analysis (from Figure 3): The LRC w/o Alignment Free variant (brown line), which adds explicit, trainable alignment matrices, performs worse than the standard LRC. It takes longer to train and converges to a higher final loss. This confirms that the elegant, built-in alignment mechanism of LRC is not only sufficient but also more efficient and stable.

6.3. Model Analysis

Performance Trend During Training: The following chart (Figure 4 from the paper) tracks the MMLU score as training progresses.

Figure 4: The trend of MMLU scores with increasing training tokens. 该图像是一个图表，展示了不同训练代币数（B）下模型LRC-1.5B和LRC-4B在MMLU分数上的表现。随着训练代币的增加，LRC-4B模型的分数显著提升，达到60以上，而LRC-1.5B在较低的训练代币数时表现相对较低，分数在35到50之间波动。

Analysis: The performance of both LRC-1.5B and LRC-4B improves steadily as they are trained on more tokens. This shows that the method is stable and continues to benefit from more data, indicating good scalability and learning dynamics. Competitive performance is achieved even with half the total training data.

Impact of Training Data Quality: The following are the results from Table 4 of the original paper:

Model		LRC-1.5B
Teacher	Llama3-3B
# Tokens	20B	10B	10B
Dataset	Mixed-2.0	Mixed-1.0	Mixed-1.1
Avg. ↑	62.12	61.35	62.48

Analysis: LRC-1.5B trained on 10B tokens of a higher-quality, filtered dataset (Mixed-1.1) outperforms the same model architecture trained on 20B tokens of a lower-quality dataset (Mixed-2.0). This demonstrates that LRC is highly sample-efficient and can effectively leverage high-quality data to achieve better performance with even fewer resources.

Analysis of Knowledge Transfer in FFNs: The authors hypothesized that FFNs store factual knowledge. To test this, they ran a neuron-masking experiment. The following are the results from Table 5 of the original paper:

Score Type	Teacher	Student
Original Score	0.85	0.48
Important Neurons Masked	0.62 (-27%)	0.33 (-31%)
Random Neurons Masked	0.85	0.49

Analysis: Masking FFN neurons that were highly active on factual questions in the teacher model caused a significant performance drop in both the teacher and the LRC student. In contrast, masking random neurons had almost no effect. This provides strong evidence that (1) specific FFN neurons indeed encode factual knowledge, and (2) LRC successfully transfers this knowledge structure to the student by aligning their activation patterns.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces Low-Rank Clone (LRC), a simple, elegant, and highly efficient method for training Small Language Models (SLMs) through knowledge distillation. LRC's core innovation is a unified framework that uses trainable low-rank projection matrices to simultaneously perform soft pruning (compressing teacher weights) and activation cloning (aligning intermediate representations). A key finding is the critical importance of cloning activations from Feed-Forward Networks (FFNs), which are shown to be rich sources of transferable knowledge.

Extensive experiments demonstrate that LRC achieves state-of-the-art performance, matching or surpassing SLMs trained on trillions of tokens while using up to 1,000 times less training data. This positions LRC as a groundbreaking approach that dramatically lowers the computational and data requirements for developing high-performance SLMs, making advanced language model creation more accessible and sustainable.

7.2. Limitations & Future Work

The authors acknowledge a few limitations and suggest directions for future work:

Performance Ceiling at Scale: While LRC is highly efficient on a modest training budget (e.g., 20B tokens), its performance ceiling when trained on much larger datasets (e.g., trillions of tokens) remains unexplored. It is unclear if the gains would diminish at a very large scale.
Architectural Compression: The current implementation of LRC focuses on compressing the hidden dimension ( $d$ ) but keeps the FFN's intermediate dimension ( $d_{mid}$ ) the same for both teacher and student. This was a design choice to prioritize distillation efficiency over maximum architectural compression.
Future Work - Integration with Other Methods: The authors suggest that LRC is compatible with other post-hoc compression techniques. They demonstrate this by applying LLM-Pruner to an LRC model, successfully creating an even smaller model that still outperforms strong baselines. This suggests that LRC can serve as a highly efficient starting point for further compression.

7.3. Personal Insights & Critique

Strengths and Inspirations:
- Elegance and Simplicity: The core idea of using a single set of low-rank matrices for both weight compression and activation alignment is exceptionally elegant. It solves multiple problems (information loss, alignment complexity) with one unified mechanism, which is the hallmark of a strong research contribution.
- Focus on FFNs: The paper's emphasis on FFNs as a key repository of transferable knowledge is a crucial insight. This challenges the community's traditional focus on attention mechanisms in distillation and opens up new avenues for research into what different parts of a Transformer model learn and how to best transfer that knowledge.
- Practical Impact: The claimed 1,000x efficiency gain is not just an incremental improvement; it's a paradigm shift. If these results hold up to broader scrutiny and application, LRC could democratize the creation of custom, high-performance SLMs, enabling smaller academic labs, startups, and even individuals to compete with large industrial labs.
Potential Issues and Areas for Improvement:
- Teacher Dependency: Like all distillation methods, LRC's performance is fundamentally capped by the quality of the teacher model. The student cannot learn knowledge that the teacher does not possess. This dependency is a structural limitation of the approach.
- Hyperparameter Sensitivity: The choice of the student's rank (i.e., its hidden dimension $d^S$ ) is a critical hyperparameter that dictates the compression ratio and model capacity. The paper does not provide an in-depth analysis of how to choose the optimal rank for a given teacher and task. Similarly, the weight $\alpha$ for the clone loss could require careful tuning.
- Generalization Beyond Llama/GPT Architectures: The experiments are conducted on standard decoder-only Transformer architectures. While the principles of LRC seem general, its effectiveness on other architectures (e.g., encoder-decoder models like T5, or mixture-of-experts models) would need to be verified.
Transferability: The core concept of using learnable low-rank projections as a unified tool for compression and knowledge transfer is highly transferable. It could be applied to other domains where large models are distilled into smaller ones, such as in computer vision (e.g., distilling a large Vision Transformer into a smaller one) or speech recognition. The underlying principle is model-agnostic and could prove to be a valuable addition to the general model compression toolkit.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~24 min read · 32,119 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Step 1: Low-Rank Projection

4.2.2. Step 2: Activation Clone

4.2.3. Total Training Objective

4.2.4. The Alignment-Free Property

4.2.5. Algorithm Summary

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.3. Model Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers