Page 11 - Paper Library - AiPaper

The paper presents Riemannian Flow Matching Policies (RFMP), a model for learning robot visuomotor strategies that excels in efficient training and inference. RFMP effectively manages highdimensional, multimodal distributions and incorporates geometric awareness, outperforming e

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

Published:6/11/2025

Autoregressive Adversarial Post-TrainingReal-time Video Generationvideo diffusion modelsInteractive Video GenerationLong Video Generation

The paper introduces Autoregressive Adversarial PostTraining (AAPT) to transform a pretrained latent video diffusion model into an efficient realtime interactive video generator. It generates one latent frame per evaluation, streams in real time, and responds to user interacti

Fast and Robust Visuomotor Riemannian Flow Matching Policy

Published:12/14/2024

Riemannian Flow Matching PolicyVisuomotor PoliciesStable Riemannian Flow Matching PolicyRobotic Task LearningGeometric Constraints

The paper introduces the Riemannian Flow Matching Policy (RFMP) for visuomotor tasks, offering fast inference and easy training. It incorporates geometric constraints for robustness and outperforms traditional diffusion policies in real and simulated tasks.

GentleHumanoid: Learning Upper-body Compliance for Contact-rich Human and Object Interaction

Published:11/7/2025

Upper-body Compliance Learning for Humanoid RobotsSpring-based Impedance ControlContact-rich Human-Robot InteractionSafe Object ManipulationWhole-body Motion Tracking Policy

GentleHumanoid integrates impedance control into a wholebody motion tracking policy for humanoid robots, achieving upperbody compliance. It employs a springbased model to adapt to diverse humanrobot interactions, reducing contact forces while ensuring successful task executio

Learning Human-Humanoid Coordination for Collaborative Object Carrying

Published:10/16/2025

Human-Humanoid CollaborationProprioceptive Reinforcement LearningCollaborative Carrying TasksDynamic Object InteractionClosed-Loop Training Environment

The COLA method enables effective humanhumanoid collaboration in complex carrying tasks using proprioceptiononly reinforcement learning. It predicts object motion and human intent, achieving a 24.7% reduction in human effort while maintaining stability, validated across various

Humanoid Whole-Body Badminton via Multi-Stage Reinforcement Learning

Published:11/14/2025

Humanoid Whole-Body ControlReinforcement Learning Training PipelineAction Generation in Dynamic EnvironmentsBadminton Motion ControlMultistage Reinforcement Learning

This paper presents a reinforcement learning training pipeline to develop a unified wholebody controller for humanoid badminton, enabling coordinated footwork and striking without reliance on motion priors or expert demonstrations. The training is validated in both simulated and

IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

Published:5/22/2025

Instruction-Following Capability in Audio LLMsIFEval-Audio Benchmark DatasetMultimodal Model EvaluationAudio Instruction GenerationAudio-Text Instruction Pairing

The study introduces IFEvalAudio, a novel dataset for assessing instructionfollowing capabilities in audiobased large language models, comprising 280 audioinstructionanswer triples across six dimensions, and benchmarks stateoftheart audio LLMs.

AHELM: A Holistic Evaluation of Audio-Language Models

Published:8/29/2025

Evaluation of Audio-Language ModelsAHELM BenchmarkPARADE DatasetMultimodal Model Performance AssessmentSpeech Recognition and Language Model Integration

AHELM is a benchmark introduced to holistically assess AudioLanguage Models (ALMs), integrating multiple datasets and introducing PARADE and CoReBench. It covers ten key evaluation aspects and standardizes methods for equitable model comparisons.

AudioBench: A Universal Benchmark for Audio Large Language Models

Published:4/1/2025

Benchmark for Audio Large Language ModelsEvaluation of Audio Understanding TasksSpeech Understanding and Scene RecognitionDatasets for Speech and Voice UnderstandingEvaluation Toolkit for Audio Large Language Models

AudioBench is introduced as a universal benchmark for Audio Large Language Models, covering 8 tasks and 26 datasets, including 7 new ones. It evaluates speech and audio scene understanding, addressing gaps in existing benchmarks for instructionfollowing capabilities. Five models

Prototype memory and attention mechanisms for few shot image generation

Published:10/6/2021

Few-Shot Image GenerationPrototype Memory MechanismMemory Concept Attention (MoCA)Neural Network Visual ProcessingMomentum Online Clustering

This study explores the role of "grandmother cells" in the primary visual cortex in image generation, proposing them as prototype memory priors. These are learned via momentum online clustering and utilized through Memory Concept Attention (MoCA), significantly improving synthesi

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Published:9/11/2024

Speech Interaction with Large Language ModelsLLaMA-Omni Speech Model ArchitectureLow-Latency Speech GenerationInstructS2S-200K DatasetReal-Time Speech Response

LLaMAOmni is a novel speech interaction model that enables lowlatency, highquality interaction with large language models. Utilizing a unique architecture and the InstructS2S200K dataset, it generates text and speech responses within 226 ms without requiring transcription.

Deanonymizing Ethereum Validators: The P2P Network Has a Privacy Issue

Published:9/6/2024

Ethereum Validator DeanonymizationPrivacy Issues in Blockchain P2P NetworksValidator Geographic Distribution AnalysisDeanonymization Methodology and ExperimentsSecurity Vulnerabilities in Ethereum Network

This study reveals a significant privacy vulnerability in Ethereum's P2P network, demonstrating its failure to protect validator anonymity. The proposed method enables nodes to identify validators on connected peers, locating over 15% of them through data analysis. The paper disc

Active Visual Perception: Opportunities and Challenges

Published:12/3/2025

Active Visual PerceptionVisual Perception in Complex EnvironmentsRobotic Active PerceptionDynamic Decision-Making and Multimodal InputsReal-Time Visual Data Processing

Active visual perception enables systems to dynamically interact with the environment for better data acquisition. This paper reviews its potential and challenges, underscoring its significance in robotics, autonomous vehicles, and surveillance, while highlighting issues like rea

Personalized Generation In Large Model Era: A Survey

Published:3/4/2025

Personalized Content Generation ResearchPersonalized Generation in Large Model EraEvaluation Metrics for Personalized Generation SystemsMultimodal Personalized Generation TechniquesDatasets for Personalized Generation

This survey comprehensively investigates Personalized Generation (PGen) in the era of large models, conceptualizing its key components and objectives. A multilevel taxonomy reviews technical advancements and datasets while envisioning PGen's applications and future challenges, p

Large Language Models for Power System Applications: A Comprehensive Literature Survey

Published:12/15/2025

Large Language Models in Power SystemsFault Diagnosis in Power SystemsLoad ForecastingOptimization and Control in Power SystemsSimulation and Planning in Power Systems

This review analyzes the applications of Large Language Models (LLMs) in power systems from 2020 to 2025, covering areas like fault diagnosis and load forecasting. It notes the potential of LLMs while highlighting challenges such as data scarcity and safety. Future research shoul

Utilizing LLMs for Industrial Process Automation: A Case Study on Modifying RAPID Programs

Published:11/14/2025

LLMs in Industrial Process AutomationRAPID Program ModificationFew-Shot Prompting MethodDomain-Specific Programming LanguagesSensitive Data Protection

This study explores using existing Large Language Models for industrial process automation, specifically modifying RAPID programming. It finds that fewshot prompting can effectively address simple issues without extensive model training, while ensuring the security of sensitive

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought

Published:5/21/2025

Multimodal Chain-of-ThoughtLarge Vision-Language ModelsForms of Visual Thought ExpressionsImage-Text Interleaved GenerationMultimodal Task Performance Enhancement

This paper investigates the mechanisms of Multimodal ChainofThought (MCoT) in Large VisionLanguage Models (LVLMs), highlighting how visual thoughts enhance performance and interpretability. Four forms of visual thought expressions are defined, demonstrating their impact on MCo

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

Published:12/2/2024

Customized Motion TransferMultimodal Large Language Modelvideo diffusion modelsMotion ModelingText-to-Video Generation

MoTrans introduces a customized motion transfer method using a multimodal large language model recaptioner and an appearance injection module, effectively transferring specific humancentric motions from reference videos to new contexts, outperforming existing techniques.

Motion Prompting: Controlling Video Generation with Motion Trajectories

Published:12/4/2024

Motion Trajectory Control in Video GenerationConditioned Training of Video Generation ModelsMotion Prompt Expansion MethodModeling Dynamic Actions and Temporal CompositionsInteractive Applications of Video Models

This paper introduces motion prompting to control video generation via motion trajectories, addressing limitations of textbased prompts. It demonstrates converting highlevel requests into detailed motion prompts, showcasing versatility in motion control and image editing with i

ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions

Published:12/11/2025

Multi-Shot Video GenerationShot Transition DesignCamera Control ModuleHierarchical Editing PatternsShotWeaver40K Dataset

The paper introduces ShotDirector, an efficient framework combining parameterlevel camera control and hierarchical editingpatternaware prompting, enhancing shot transition design in multishot video generation and improving narrative coherence through finegrained control.

201 - 220 / 980

Papers