Papers

Sign in to view your remaining parses.
Tag Filter
Long-Context Modeling
Mixture of Attention Spans: Optimizing LLM Inference Efficiency with Heterogeneous Sliding-Window Lengths
Published:6/21/2024
Sliding-Window Attention OptimizationImproving Inference Efficiency of Large Language ModelsMixture of Attention SpansAdaptive Window Length ConfigurationLong-Context Modeling
This paper introduces the Mixture of Attention Spans (MoA), which optimizes inference efficiency for large language models (LLMs) by tailoring slidingwindow lengths for different attention heads and layers, significantly improving effective context length and retrieval accuracy
02
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Published:6/17/2024
Adaptive Structured Sparse AttentionLLM Inference AccelerationLong-Context ModelingNear-Lossless Sparse Attention
SampleAttention is introduced as an adaptive, nearlossless sparse attention method for longcontext LLMs, significantly reducing TimetoFirstToken latency while maintaining accuracy, achieving up to 2.42x TTFT reduction compared to FlashAttention.
01
WeiPipe: Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training
Published:2/28/2025
Long-Context ModelingLarge Language Model TrainingWeight Pipeline ParallelismDistributed Training OptimizationCommunication Efficiency Enhancement
WeiPipe is a weight pipeline parallelism method that effectively reduces communication costs in large model training by overlapping communication and computation, significantly enhancing scalability and throughput compared to existing methods.
03
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
Published:12/9/2025
RL Training for Large Language ModelsLLM Reasoning Capacity EnhancementSequence Policy OptimizationLong-Context ModelingReinforcement Learning for Math Reasoning
This study examines whether reinforcement learning (RL) truly enhances reasoning capabilities in language models, offering a transparent framework. Key findings include RL's effectiveness at the model's competence edge, with minimal pretraining seed needed for transfer, while mi
04
Jenga: Enhancing LLM Long-Context Fine-tuning with Contextual Token Sparsity
Large Language Model Fine-TuningLong-Context ModelingSparse Attention Mechanism
Jenga is a novel LLM finetuning system that optimizes activation memory usage in longcontext applications using Contextual Token Sparsity. It employs token elimination, pattern prediction, and kernel optimization, achieving up to 1.93x memory reduction and 1.36x acceleration ov
04
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
RL Training for Large Language ModelsLong-Context ModelingLLM Reasoning Capacity EnhancementSparse Attention Mechanism
DeepSeekV3.2 is presented, balancing computational efficiency and reasoning capabilities through three innovations: a sparse attention mechanism that reduces complexity, a scalable reinforcement learning framework rivaling GPT5, and a synthesis pipeline enhancing generalization
072
Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning
Published:6/9/2025
Causal Attention MechanismToken PruningGradient-Guided Knowledge DistillationLLM Reasoning Capacity EnhancementLong-Context Modeling
LeaF uses gradientguided token pruning to remove confounding tokens, aligning student attention with causal focus from teachers, improving reasoning accuracy and interpretability across multiple benchmarks.
06
The Devil in Linear Transformer
Published:10/19/2022
Transformer architectureLong-Context ModelingSparse Attention EfficiencyTransformer-Based Efficient Forward Prediction
This paper identifies unbounded gradients and attention dilution as key flaws in kernelbased linear transformers, then introduces TransNormer, which stabilizes training via normalized attention and enhances local focus with diagonal attention, achieving superior accuracy and eff
05