Page 2 - Paper Library - AiPaper

This paper introduces SAGE3D, addressing the limitations of 3D Gaussian Splatting in embodied navigation with objectcentric semantic grounding and physicsaware execution. The released InteriorGS dataset contains 1K annotated indoor scenes, and SAGEBench is the first VLN bench

SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation

Published:1/1/2025

LLM Reasoning Capacity EnhancementTraining-Free Acceleration MethodsTraining-Independent Inference OptimizationUtilization of Internal and External Speculation

SPECTRA is a novel framework that accelerates large language model inference through optimized internal and external speculation, requiring no additional training. It achieves up to 4.08x speedup over stateoftheart methods across various benchmarks, with its implementation pub

REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation

Published:12/23/2025

Vision-Language-Action ModelRobotic Manipulation BenchmarkRobotic Generalization EvaluationHigh-Fidelity Simulation EnvironmentTask Perturbation Factors

REALM introduces a highfidelity simulation environment for evaluating the generalization of VisionLanguageAction models, featuring 15 perturbation factors and over 3,500 objects. Validation shows strong alignment between simulated and realworld performance, highlighting ongoi

EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments

Published:3/12/2025

Robotic Mobile Manipulation BenchmarkNatural Language Instruction UnderstandingLong-Horizon Task ExecutionLLM Integration with Robotic SystemsUnified Operation Evaluation Framework

This paper introduces EMMOE, a benchmark that integrates highlevel and lowlevel tasks for autonomous home robots. It presents the EMMOE100 dataset and the HoMiEBoT agent system, enhancing natural language understanding and execution of complex tasks.

Does anxiety increase policy learning?

Published:2/23/2024

Emotional Impact on Policy LearningAnxiety and Decision-Making ProcessExperimental Study on Policy LearningPublic Policy Information ProcessingMarcus' Affective Intelligence Model

This study investigates the impact of anxiety on policy learning among local public officials in Switzerland, using Marcus' Affective Intelligence Model. It finds that anxiety positively influences learning, unaffected by prior beliefs or policy complexity, highlighting the signi

Mitty: Diffusion-based Human-to-Robot Video Generation

Published:12/19/2025

Human-to-Robot Video GenerationDiffusion TransformerUnlabeled LearningPre-trained Video Generation ModelsHuman-Robot Video Synthesis

The paper presents Mitty, a diffusionbased framework for endtoend humantorobot video generation that learns directly from human demonstrations, overcoming information loss and errors from intermediate representations. Leveraging a pretrained diffusion model, it generates hig

012

Mean Aggregator is More Robust than Robust Aggregators under Label Poisoning Attacks on Distributed Heterogeneous Data

Published:4/21/2024

Robustness of Aggregators under Label Poisoning AttacksDefense Against Malicious Attacks in Distributed LearningLearning Error Optimization in Heterogeneous Data EnvironmentsTheoretical Analysis of Mean AggregatorComparative Study of Robust Aggregators

This study reveals that the mean aggregator outperforms robust aggregators under label poisoning attacks in distributed heterogeneous data, proving orderoptimal learning error when data is sufficiently heterogeneous, supported by experimental validation.

DeepSeek-V3 Technical Report

Published:12/27/2024

DeepSeek-V3 Large Language ModelMixture-of-Experts Language ModelMulti-head Latent AttentionAuxiliary-Loss-Free Load Balancing StrategySupervised Fine-Tuning and Reinforcement Learning

DeepSeekV3 is a 671B parameter MixtureofExperts model leveraging Multihead Latent Attention and an innovative auxiliarylossfree loadbalancing strategy for efficient inference and costeffective training. Its pretraining on 14.8 trillion tokens, combined with finetuning a

Vision-Language Models for Vision Tasks: A Survey

Published:4/3/2023

Vision-Language ModelsAutoregressive Inference for Vision TasksUnsupervised Visual Recognition MethodsLarge-Scale Image-Text Pair DatasetsVLM Pre-training and Evaluation

This survey reviews VisionLanguage Models (VLMs) for visual tasks, addressing challenges of crowdlabel dependency in DNNs and training inefficiency. It analyzes network architectures, pretraining objectives, and existing methods, offering insights for future research.

Vision Foundation Models in Remote Sensing: A Survey

Published:8/7/2024

Foundation Models in Remote SensingSelf-Supervised Learning TechniquesContrastive LearningArchitectures and Pre-Training Datasets of Foundation ModelsAI Transformation in Remote Sensing Technology

This paper surveys vision foundation models in remote sensing, categorizing them by architectures, pretraining datasets, and methodologies. It highlights significant advancements and emerging trends while discussing challenges like data quality and computational resources, findi

Process Reinforcement through Implicit Rewards

Published:2/3/2025

Process Reinforcement with Implicit RewardsOnline Training for Large Language ModelsMath Reasoning BenchmarksProcess Reward ModelsMulti-Step Reasoning Tasks

The paper introduces PRIME, which enhances reinforcement learning for large language models by using implicit rewards for online process reward model updates. PRIME significantly improves performance by efficiently solving issues related to label collection costs and reward hacki

A Survey on Personalized Content Synthesis with Diffusion Models

Published:5/9/2024

Personalized Content SynthesisDiffusion ModelsTest-Time Fine-Tuning MethodsPre-Trained Adaptation MethodsObject Personalization

This paper surveys over 150 methods in personalized content synthesis (PCS) using diffusion models, categorizing them into testtime finetuning and pretrained adaptation frameworks, while addressing challenges like overfitting and proposing future research directions.

Qwen3 Technical Report

Published:5/14/2025

Large Language Model SeriesMixture-of-Expert ArchitectureDynamic Model SwitchingThinking Budget MechanismExpanded Multilingual Support

Qwen3 introduces a unified framework integrating thinking and nonthinking modes for dynamic switching, enhancing performance and multilingual support. It also features a thinking budget mechanism for adaptive resource allocation, expanding language capabilities from 29 to 119 la

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

Published:6/29/2025

Urban Intelligence Multi-Modal Large Language ModelUrban Instruction DatasetSpatial Reasoning EnhancementMulti-Stage Training FrameworkUrban Task Performance Evaluation

UrbanLLaVA is a multimodal language model designed for urban intelligence, processing four data types to enhance urban task performance. It leverages a diverse instruction dataset and a multistage training framework, achieving strong crosscity generalization.

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

Published:6/19/2023

Remote Sensing Vision-Language ModelSelf-Supervised Learning and Image ModelingMultitask Remote Sensing ApplicationsRemote Sensing Object CountingUnified Image-Text Data Format

RemoteCLIP is the first visionlanguage foundation model for remote sensing, overcoming limitations of existing models by integrating heterogeneous annotations into a unified imagetext format, resulting in a 12x larger pretraining dataset that enhances zeroshot and multitask a

A Survey on Remote Sensing Foundation Models: From Vision to Multimodality

Published:3/28/2025

Remote Sensing Foundation ModelsMultimodal Data FusionRemote Sensing Task AnalysisOptical and Radar DataLarge-Scale Annotated Datasets

This paper reviews advancements in remote sensing foundation models, highlighting vision and multimodal approaches that integrate diverse data types, enhancing geospatial data analysis. Despite significant improvements in task performance, challenges remain in data diversity, res

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Published:12/18/2025

Autoregressive Video Diffusion ModelsSelf-Resampling Training MethodLong-Horizon Generation CapabilityTemporal Causal MaskParameter-Free History Retrieval Mechanism

This paper presents a teacherfree selfresampling method for training autoregressive video diffusion models, addressing exposure bias and enabling efficient longhorizon generation with competitive performance and improved temporal consistency.

Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology

Published:6/6/2025

Multimodal Clinical Decision Support SystemsApplication of Vision Transformers in OncologyIntegration of Precision Oncology ToolsAutonomous AI Clinical AgentApplication of GPT-4 in Medical Decision Making

This study developed an autonomous AI agent that integrates GPT4 with multimodal precision oncology tools. Evaluated on 20 real cases, it achieved 87.5% tool accuracy and 91.0% correct conclusions, significantly boosting decision accuracy to 87.2%, laying the groundwork for pers

Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation

Published:12/21/2025

Real-time Video GenerationVideo Generation FrameworkHistorical Memory RetentionMemory Compression and GenerationAutoregressive Modeling

The MAG framework decouples memory compression from frame generation to enhance longterm consistency in realtime video generation. It utilizes a dedicated memory model for compressing historical data and a generator model for frame synthesis, achieving improved scene consistenc

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

Published:10/1/2025

Humanoid Robot Loco-Manipulation Data GenerationInteraction-Preserving Data Generation EngineDynamic Motion RetargetingLong-Horizon Task Execution for RobotsMotion Capture Datasets

The paper presents OmniRetarget, an engine addressing the embodiment gap in humanoid robots by preserving key interactive relationships. It generates highquality trajectories for reinforcement learning through an interaction mesh, enabling complex task execution for durations up

21 - 40 / 972

Papers