AiPaper

Papers

Sign in to view your remaining parses.
Tag Filter
Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
Published:8/22/2025
Post Neural Architecture SearchHybrid-Architecture Language ModelsEfficient Generation InferenceLinear Attention MechanismHardware-Aware Hyperparameter Search
JetNemotron uses PostNAS to freeze pretrained MLP weights and optimize attention blocks, creating efficient hybridarchitecture language models that match or surpass accuracy of leading models while boosting generation throughput by up to 53.6×.
03
Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning
Published:6/9/2025
Causal Attention MechanismToken PruningGradient-Guided Knowledge DistillationLLM Reasoning Capacity EnhancementLong-Context Modeling
LeaF uses gradientguided token pruning to remove confounding tokens, aligning student attention with causal focus from teachers, improving reasoning accuracy and interpretability across multiple benchmarks.
06
TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State
Published:7/12/2024
Distributed Trace SamplingSystem Runtime State AwarenessOnline Adaptive Sampling FrameworkRoot Cause Analysis OptimizationMicroservice Monitoring Data Processing
TraStrainer adaptively samples distributed traces by integrating system runtime state and trace diversity, improving sampling quality and root cause analysis accuracy by over 30% while reducing overhead.
01
Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems
Published:9/1/2021
Sampling of Distributed Microservice TracesAttention-Based Sampling MethodsEnd-to-End Trace Data AnalysisAnomaly Pattern DetectionRobust Random Cut Forest
Sieve uses attention and random cut forests to sample structurally and temporally uncommon traces, improving information retention and reducing storage costs in largescale microservice systems.
01
Mint: Cost-Efficient Tracing with All Requests Collection via Commonality and Variability Analysis
Published:2/6/2025
Distributed Trace Data CompressionCommonality and Variability AnalysisCost-Efficient Tracing FrameworkAll Requests CaptureTrace Data Storage Optimization
Mint introduces a commonalityvariability approach for costefficient tracing of all requests, significantly reducing storage and network overhead while preserving rich trace information.
01
PEARL: Towards Permutation-Resilient LLMs
Published:2/20/2025
RL Training for Large Language ModelsLLM Reasoning Capacity EnhancementSequence Policy OptimizationTraining-Free Acceleration Methods
PEARL uses distributionally robust optimization and a permutationproposal network to enhance LLMs' resilience against worstcase input orderings, effectively mitigating permutation attacks and boosting performance across varied contexts.
06
EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
Published:5/23/2024
LLM Reasoning Capacity EnhancementTraining-Free Acceleration MethodsCollaborative Edge Computing InferenceModel Sharding DeploymentDynamic Programming Optimization
EdgeShard uses collaborative edge computing to shard LLMs across distributed devices, optimizing latency and throughput via dynamic programming, reducing inference delay by 50% and doubling throughput while addressing cloud dependency challenges.
04
Learning to Reason without External Rewards
Published:5/26/2025
RL Training for Large Language ModelsSequence Policy OptimizationTraining-Free Acceleration MethodsReinforcement Learning for Math Reasoning
Intuitor leverages a model’s selfcertainty as an intrinsic reward in reinforcement learning, enabling unsupervised training of LLMs for complex reasoning with strong crossdomain generalization and no reliance on external labels or rewards.
03
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
Published:9/4/2025
Study of Drivelology PhenomenonMultilingual Drivelology DatasetPragmatic Understanding Limitations in LLMsImplicit Semantic Reasoning TasksLLM Generation and Classification Evaluation
This study introduces Drivelology, a phenomenon of syntactically coherent yet pragmatically deep nonsense, evaluates LLMs with a curated multilingual dataset, revealing their limitations in contextual, moral, and emotional understanding.
02
MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Static Quantization
Published:10/25/2025
Multimodal Large Language ModelStatic QuantizationPost-Training Quantization FrameworkModality-Specific QuantizationLLM Reasoning Capacity Enhancement
MQuant introduces a posttraining static quantization framework for multimodal large language models, reducing latency and outliers via modalityspecific quantization, flexible attention switching, and rotation suppression, boosting inference efficiency across major models.
04
AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing
Published:9/16/2024
Academic Literature Structured Text ParsingMultimodal Model Fine-TuningFormulas and Algorithms Parsing DatasetStructured Text Parsing Dataset
AceParse provides a diverse dataset for parsing academic structured texts. The finetuned multimodal AceParser surpasses stateoftheart models, enhancing parsing accuracy for formulas, tables, lists, and algorithms.
05
SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference
Published:10/12/2025
MoE Inference AccelerationSpeculative DecodingExpert Prefetching StrategyCompute-Communication PipeliningMulti-Token Parallel Verification
SPMoE introduces an SDaware expert offloading framework using speculative expert prefetching and cutofflayer policy, pipelining computation and communication to reduce memory and bandwidth bottlenecks, achieving up to 3.5× inference speedup on MoE models.
08
Contrastive Incomplete Cross-Modal Hashing
Published:6/14/2024
Cross-Modal HashingIncomplete Cross-Modal Data HandlingSemantic Similarity Coordination ModuleSemantic-Aware Contrastive HashingContextual Correspondence Alignment
CICH addresses incomplete crossmodal data by coordinating semantic similarities and aligning contextual correspondences, using prototypical semantic similarity and contrastive hashing modules to generate discriminative hash codes for robust retrieval.
05
ORCA: A Distributed Serving System for Transformer-Based Generative Models
Transformer-Based Generative Model ServingIteration-Level SchedulingSelective BatchingDistributed Inference SystemLarge-Scale Model Inference Acceleration
ORCA, a distributed inference system, improves Transformer generative model serving by iterationlevel scheduling and selective batching, boosting flexibility and throughput. It achieves 36.9× higher throughput than NVIDIA FasterTransformer on GPT3 175B.
03
Quality-Guided Vision-Language Learning for Long-Term Action Quality Assessment
Published:1/1/2025
Action Quality AssessmentVision-Language LearningFine-Grained Quality ScoringProgressive Semantic Learning StrategyQuality-Related Textual Prompts for Scoring
The study introduces a qualityguided visionlanguage learning method using textual prompts and a progressive semantic module to map visual features to finegrained scores, achieving stateoftheart results across diverse longterm action quality datasets without extra annotatio
03
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Published:2/6/2024
RL Training for Large Language ModelsMath Reasoning BenchmarksGroup Relative Policy OptimizationLarge Language Model Fine-TuningPublic Data-Driven Pretraining
DeepSeekMath 7B refines pretraining on 120B math tokens plus code and language, introducing Group Relative Policy Optimization (GRPO) to enhance reasoning and memory efficiency, achieving 51.7% on MATH benchmark near GPT4 performance.
03
Large Language Model Offloading using Active Inference in 6G Symbiotic IoT
Published:1/1/2025
Large Language Model OffloadingActive Inference Methods6G Edge Computing Resource SchedulingCloud-Edge Collaborative ComputingSymbiotic Internet of Things
This paper presents an active inferencebased offloading method for large language models in 6G symbiotic IoT, optimizing resource scheduling and computation through cloudedge collaboration for enhanced system efficiency and intelligent inference services.
03
Learning with Semantics: Towards a {Semantics-Aware} Routing Anomaly Detection System
Published:1/1/2024
BGP Routing Anomaly DetectionNetwork Representation LearningAutonomous System Role LearningSemantics-Aware Network SecurityLarge-Scale Routing Data Analysis
This paper introduces BEAM, a semanticsaware model that learns AS routing roles from BGP data to detect anomalies as unexpected role changes, validated on 18 real datasets with improved accuracy, interpretability, and reduced training overhead.
05
Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack
Published:2/2/2024
Harmful Fine-Tuning MitigationLarge Language Model Fine-TuningLLM Security MechanismRobustness of Embedding Representations
This work identifies harmful embedding drift from finetuning attacks on LLMs and proposes Vaccine, a perturbationaware alignment method that generates robust embeddings to resist harmful perturbations, enhancing model safety without compromising benign reasoning.
06
Distributed Learning and Inference Systems: A Networking Perspective
Published:1/9/2025
Distributed Machine Learning SystemsDistributed Inference FrameworkPrivacy-Preserving Distributed TrainingData and Dynamics-Aware NetworksNetworking Perspective on AI Systems
This work introduces DAITN, a novel framework addressing centralized ML limitations by enabling efficient, privacyaware distributed learning and inference, highlighting its components, functions, and key challenges in managing complex decentralized AI systems.
04