Papers
Sign in to view your remaining parses.
Tag Filter
Towards Physically Executable 3D Gaussian for Embodied Navigation
Published:10/24/2025
3D Gaussian RepresentationVisual-Language NavigationPhysically Executable EnvironmentSemantically Aligned 3D NavigationIndoor Scene Dataset
This paper introduces SAGE3D, addressing the limitations of 3D Gaussian Splatting in embodied navigation with objectcentric semantic grounding and physicsaware execution. The released InteriorGS dataset contains 1K annotated indoor scenes, and SAGEBench is the first VLN bench
02
SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation
Published:1/1/2025
LLM Reasoning Capacity EnhancementTraining-Free Acceleration MethodsTraining-Independent Inference OptimizationUtilization of Internal and External Speculation
SPECTRA is a novel framework that accelerates large language model inference through optimized internal and external speculation, requiring no additional training. It achieves up to 4.08x speedup over stateoftheart methods across various benchmarks, with its implementation pub
02
REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation
Published:12/23/2025
Vision-Language-Action ModelRobotic Manipulation BenchmarkRobotic Generalization EvaluationHigh-Fidelity Simulation EnvironmentTask Perturbation Factors
REALM introduces a highfidelity simulation environment for evaluating the generalization of VisionLanguageAction models, featuring 15 perturbation factors and over 3,500 objects. Validation shows strong alignment between simulated and realworld performance, highlighting ongoi
03
EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments
Published:3/12/2025
Robotic Mobile Manipulation BenchmarkNatural Language Instruction UnderstandingLong-Horizon Task ExecutionLLM Integration with Robotic SystemsUnified Operation Evaluation Framework
This paper introduces EMMOE, a benchmark that integrates highlevel and lowlevel tasks for autonomous home robots. It presents the EMMOE100 dataset and the HoMiEBoT agent system, enhancing natural language understanding and execution of complex tasks.
02
Does anxiety increase policy learning?
Published:2/23/2024
Emotional Impact on Policy LearningAnxiety and Decision-Making ProcessExperimental Study on Policy LearningPublic Policy Information ProcessingMarcus' Affective Intelligence Model
This study investigates the impact of anxiety on policy learning among local public officials in Switzerland, using Marcus' Affective Intelligence Model. It finds that anxiety positively influences learning, unaffected by prior beliefs or policy complexity, highlighting the signi
02
Mitty: Diffusion-based Human-to-Robot Video Generation
Published:12/19/2025
Human-to-Robot Video GenerationDiffusion TransformerUnlabeled LearningPre-trained Video Generation ModelsHuman-Robot Video Synthesis
The paper presents Mitty, a diffusionbased framework for endtoend humantorobot video generation that learns directly from human demonstrations, overcoming information loss and errors from intermediate representations. Leveraging a pretrained diffusion model, it generates hig
012
Mean Aggregator is More Robust than Robust Aggregators under Label Poisoning Attacks on Distributed Heterogeneous Data
Published:4/21/2024
Robustness of Aggregators under Label Poisoning AttacksDefense Against Malicious Attacks in Distributed LearningLearning Error Optimization in Heterogeneous Data EnvironmentsTheoretical Analysis of Mean AggregatorComparative Study of Robust Aggregators
This study reveals that the mean aggregator outperforms robust aggregators under label poisoning attacks in distributed heterogeneous data, proving orderoptimal learning error when data is sufficiently heterogeneous, supported by experimental validation.
02
DeepSeek-V3 Technical Report
Published:12/27/2024
DeepSeek-V3 Large Language ModelMixture-of-Experts Language ModelMulti-head Latent AttentionAuxiliary-Loss-Free Load Balancing StrategySupervised Fine-Tuning and Reinforcement Learning
DeepSeekV3 is a 671B parameter MixtureofExperts model leveraging Multihead Latent Attention and an innovative auxiliarylossfree loadbalancing strategy for efficient inference and costeffective training. Its pretraining on 14.8 trillion tokens, combined with finetuning a
02
Vision-Language Models for Vision Tasks: A Survey
Published:4/3/2023
Vision-Language ModelsAutoregressive Inference for Vision TasksUnsupervised Visual Recognition MethodsLarge-Scale Image-Text Pair DatasetsVLM Pre-training and Evaluation
This survey reviews VisionLanguage Models (VLMs) for visual tasks, addressing challenges of crowdlabel dependency in DNNs and training inefficiency. It analyzes network architectures, pretraining objectives, and existing methods, offering insights for future research.
02
Vision Foundation Models in Remote Sensing: A Survey
Published:8/7/2024
Foundation Models in Remote SensingSelf-Supervised Learning TechniquesContrastive LearningArchitectures and Pre-Training Datasets of Foundation ModelsAI Transformation in Remote Sensing Technology
This paper surveys vision foundation models in remote sensing, categorizing them by architectures, pretraining datasets, and methodologies. It highlights significant advancements and emerging trends while discussing challenges like data quality and computational resources, findi
02
Process Reinforcement through Implicit Rewards
Published:2/3/2025
Process Reinforcement with Implicit RewardsOnline Training for Large Language ModelsMath Reasoning BenchmarksProcess Reward ModelsMulti-Step Reasoning Tasks
The paper introduces PRIME, which enhances reinforcement learning for large language models by using implicit rewards for online process reward model updates. PRIME significantly improves performance by efficiently solving issues related to label collection costs and reward hacki
03
A Survey on Personalized Content Synthesis with Diffusion Models
Published:5/9/2024
Personalized Content SynthesisDiffusion ModelsTest-Time Fine-Tuning MethodsPre-Trained Adaptation MethodsObject Personalization
This paper surveys over 150 methods in personalized content synthesis (PCS) using diffusion models, categorizing them into testtime finetuning and pretrained adaptation frameworks, while addressing challenges like overfitting and proposing future research directions.
02
Qwen3 Technical Report
Published:5/14/2025
Large Language Model SeriesMixture-of-Expert ArchitectureDynamic Model SwitchingThinking Budget MechanismExpanded Multilingual Support
Qwen3 introduces a unified framework integrating thinking and nonthinking modes for dynamic switching, enhancing performance and multilingual support. It also features a thinking budget mechanism for adaptive resource allocation, expanding language capabilities from 29 to 119 la
02
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Published:6/29/2025
Urban Intelligence Multi-Modal Large Language ModelUrban Instruction DatasetSpatial Reasoning EnhancementMulti-Stage Training FrameworkUrban Task Performance Evaluation
UrbanLLaVA is a multimodal language model designed for urban intelligence, processing four data types to enhance urban task performance. It leverages a diverse instruction dataset and a multistage training framework, achieving strong crosscity generalization.
01
RemoteCLIP: A Vision Language Foundation Model for Remote Sensing
Published:6/19/2023
Remote Sensing Vision-Language ModelSelf-Supervised Learning and Image ModelingMultitask Remote Sensing ApplicationsRemote Sensing Object CountingUnified Image-Text Data Format
RemoteCLIP is the first visionlanguage foundation model for remote sensing, overcoming limitations of existing models by integrating heterogeneous annotations into a unified imagetext format, resulting in a 12x larger pretraining dataset that enhances zeroshot and multitask a
02
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
Published:3/28/2025
Remote Sensing Foundation ModelsMultimodal Data FusionRemote Sensing Task AnalysisOptical and Radar DataLarge-Scale Annotated Datasets
This paper reviews advancements in remote sensing foundation models, highlighting vision and multimodal approaches that integrate diverse data types, enhancing geospatial data analysis. Despite significant improvements in task performance, challenges remain in data diversity, res
04
End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
Published:12/18/2025
Autoregressive Video Diffusion ModelsSelf-Resampling Training MethodLong-Horizon Generation CapabilityTemporal Causal MaskParameter-Free History Retrieval Mechanism
This paper presents a teacherfree selfresampling method for training autoregressive video diffusion models, addressing exposure bias and enabling efficient longhorizon generation with competitive performance and improved temporal consistency.
04
Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology
Published:6/6/2025
Multimodal Clinical Decision Support SystemsApplication of Vision Transformers in OncologyIntegration of Precision Oncology ToolsAutonomous AI Clinical AgentApplication of GPT-4 in Medical Decision Making
This study developed an autonomous AI agent that integrates GPT4 with multimodal precision oncology tools. Evaluated on 20 real cases, it achieved 87.5% tool accuracy and 91.0% correct conclusions, significantly boosting decision accuracy to 87.2%, laying the groundwork for pers
02
Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation
Published:12/21/2025
Real-time Video GenerationVideo Generation FrameworkHistorical Memory RetentionMemory Compression and GenerationAutoregressive Modeling
The MAG framework decouples memory compression from frame generation to enhance longterm consistency in realtime video generation. It utilizes a dedicated memory model for compressing historical data and a generator model for frame synthesis, achieving improved scene consistenc
03
OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction
Published:10/1/2025
Humanoid Robot Loco-Manipulation Data GenerationInteraction-Preserving Data Generation EngineDynamic Motion RetargetingLong-Horizon Task Execution for RobotsMotion Capture Datasets
The paper presents OmniRetarget, an engine addressing the embodiment gap in humanoid robots by preserving key interactive relationships. It generates highquality trajectories for reinforcement learning through an interaction mesh, enabling complex task execution for durations up
03
…