Papers
Sign in to view your remaining parses.
Tag Filter
LLM Inference Acceleration
PLAIN: Leveraging High Internal Bandwidth in PIM for Accelerating Large Language Model Inference via Mixed-Precision Quantization
Published:10/26/2025
DRAM-PIM Deep Learning AccelerationLLM Inference AccelerationMixed-Precision Quantization AlgorithmHigh Bandwidth Memory OptimizationPIM-Based Computation Scheduling
PLAIN is a novel software/hardware codesign framework for accelerating large language model inference through mixedprecision quantization. It optimizes parameter quantization and leverages PIM characteristics, achieving up to 5.03x and 1.69x performance improvements with neglig
03
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Published:6/17/2024
Adaptive Structured Sparse AttentionLLM Inference AccelerationLong-Context ModelingNear-Lossless Sparse Attention
SampleAttention is introduced as an adaptive, nearlossless sparse attention method for longcontext LLMs, significantly reducing TimetoFirstToken latency while maintaining accuracy, achieving up to 2.42x TTFT reduction compared to FlashAttention.
01