Tags: Long-Context Modeling - Paper Library

Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths

Published:6/21/2024

Sliding-Window Attention OptimizationImproving Inference Efficiency of Large Language ModelsMixture of Attention SpansAdaptive Window Length ConfigurationLong-Context Modeling

This paper introduces the Mixture of Attention Spans (MoA), which optimizes inference efficiency for large language models (LLMs) by tailoring slidingwindow lengths for different attention heads and layers, significantly improving effective context length and retrieval accuracy

02

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Published:6/17/2024

Adaptive Structured Sparse AttentionLLM Inference AccelerationLong-Context ModelingNear-Lossless Sparse Attention

SampleAttention is introduced as an adaptive, nearlossless sparse attention method for longcontext LLMs, significantly reducing TimetoFirstToken latency while maintaining accuracy, achieving up to 2.42x TTFT reduction compared to FlashAttention.

01

WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training

Published:2/28/2025

Long-Context ModelingLarge Language Model TrainingWeight Pipeline ParallelismDistributed Training OptimizationCommunication Efficiency Enhancement

WeiPipe is a weight pipeline parallelism method that effectively reduces communication costs in large model training by overlapping communication and computation, significantly enhancing scalability and throughput compared to existing methods.

03

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Published:12/9/2025

RL Training for Large Language ModelsLLM Reasoning Capacity EnhancementSequence Policy OptimizationLong-Context ModelingReinforcement Learning for Math Reasoning

This study examines whether reinforcement learning (RL) truly enhances reasoning capabilities in language models, offering a transparent framework. Key findings include RL's effectiveness at the model's competence edge, with minimal pretraining seed needed for transfer, while mi

04

Jenga: Enhancing LLM Long-Context Fine-tuning with Contextual Token Sparsity

Large Language Model Fine-TuningLong-Context ModelingSparse Attention Mechanism

Jenga is a novel LLM finetuning system that optimizes activation memory usage in longcontext applications using Contextual Token Sparsity. It employs token elimination, pattern prediction, and kernel optimization, achieving up to 1.93x memory reduction and 1.36x acceleration ov

04

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

RL Training for Large Language ModelsLong-Context ModelingLLM Reasoning Capacity EnhancementSparse Attention Mechanism

DeepSeekV3.2 is presented, balancing computational efficiency and reasoning capabilities through three innovations: a sparse attention mechanism that reduces complexity, a scalable reinforcement learning framework rivaling GPT5, and a synthesis pipeline enhancing generalization

072

Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning

Published:6/9/2025

Causal Attention MechanismToken PruningGradient-Guided Knowledge DistillationLLM Reasoning Capacity EnhancementLong-Context Modeling

LeaF uses gradientguided token pruning to remove confounding tokens, aligning student attention with causal focus from teachers, improving reasoning accuracy and interpretability across multiple benchmarks.

06

The Devil in Linear Transformer

Published:10/19/2022

Transformer architectureLong-Context ModelingSparse Attention EfficiencyTransformer-Based Efficient Forward Prediction

This paper identifies unbounded gradients and attention dilution as key flaws in kernelbased linear transformers, then introduces TransNormer, which stabilizes training via normalized attention and enhances local focus with diagonal attention, achieving superior accuracy and eff

05

Papers