Papers
Sign in to view your remaining parses.
Tag Filter
Multimodal Large Language Model
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Published:6/3/2025
Unified Visual Understanding and Generation ModelHigh-Resolution Semantic EncodersContrastive Semantic Encoding-Based Generative FrameworkImage Understanding and GenerationMultimodal Large Language Model
UniWorldV1 is an innovative generative framework that integrates highresolution semantic encoders for visual understanding and generation, achieving impressive performance across tasks using only 2.7M training data.
02
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
Published:7/23/2025
Vision-Language-Action ReasoningReinforced Visual Latent PlanningMultimodal Large Language ModelLong-Horizon PlanningRobotic Action Execution
ThinkAct proposes a dualsystem framework that connects highlevel reasoning and lowlevel action execution through reinforced visual latent planning, enabling fewshot adaptation, longhorizon planning, and selfcorrection in complex environments.
02
A Multi-modal Large Language Model with Graph-of-Thought for Effective Recommendation
Published:1/1/2025
Multimodal Large Language ModelGraph-of-Thought Prompting TechniquePersonalized Recommendation SystemMultimodal Recommendation TasksUser-Item Interaction Graphs
The GollaRec model integrates a Multimodal Large Language Model and GraphofThought to enhance useritem interaction for effective recommendations, combining visual and textual data. It utilizes textgraph alignment and tuning, outperforming 12 existing models in multimodal tas
04
Qwen3-Omni Technical Report
Published:9/22/2025
Multimodal Large Language ModelQwen3-Omni ArchitectureAudio Task Performance OptimizationGeneral-Purpose Audio CaptioningMultilingual Speech Understanding and Generation
Qwen3Omni is a single multimodal model achieving stateoftheart performance across text, image, audio, and video, particularly excelling in audio tasks. It uses a mixtureofexperts architecture, supports multilingual audio understanding and generation, and reduces latency wit
05
ADriver-I: A General World Model for Autonomous Driving
Published:11/23/2023
Driving World ModelsMultimodal Large Language ModelVision-Language-Action ModelDiffusion ModelsnuScenes Dataset
ADriverI integrates visionaction pairs using MLLM and diffusion models to autoregressively predict control signals and future scenes, enabling iterative autonomous driving and significantly improving performance.
02
Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with
Query-Oriented Pivot Tasks
Published:3/1/2025
Multimodal Large Language ModelLLM-guided motion planningGraphical User Interface (GUI) AgentsQuery-Oriented ReasoningCoordinate Grounding and Reasoning Alignment
This work introduces query inference bridging coordinate grounding and action reasoning in MLLMpowered GUI agents, achieving superior performance with minimal data and enhanced reasoning by integrating semantic information.
01
A Survey on Generative Recommendation: Data, Model, and Tasks
Published:10/31/2025
Generative Recommendation SystemsLarge Language Model Fine-TuningDiffusion ModelsMultimodal Large Language ModelLLM-based Recommendation Systems
This survey reviews generative recommendation via a unified framework, analyzing data augmentation, model alignment, and task design, highlighting innovations in large language and diffusion models that enable knowledge integration, natural language understanding, and personalize
06
WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent
Published:10/8/2025
Multimodal Large Language ModelVision-Language-Action ModelRL Training for Large Language ModelsComplex Information Retrieval BenchmarkVisual-Language Reasoning
WebWatcher is a multimodal deep research agent enhancing visuallanguage reasoning via synthetic trajectories and reinforcement learning, validated on the new BrowseCompVL benchmark for complex visualtext retrieval tasks, surpassing existing baselines.
01
Thinking in Space: How Multimodal Large Language Models See, Remember,
and Recall Spaces
Published:12/19/2024
Multimodal Large Language ModelVisual-Spatial Intelligence BenchmarkSpatial ReasoningVideo Question AnsweringCognitive Map Generation
This work introduces VSIBench to evaluate multimodal large language models' spatial reasoning from videos, revealing emerging spatial awareness and local world models, with cognitive map generation enhancing spatial distance understanding beyond standard linguistic reasoning tec
09
Emerging Properties in Unified Multimodal Pretraining
Published:5/21/2025
Multimodal Large Language ModelLarge-Scale Multimodal PretrainingMultimodal Reasoning Capacity EnhancementMultimodal Generation and Understanding
BAGEL, a unified decoderonly multimodal model pretrained on massive interleaved data, shows emerging capabilities in complex reasoning, outperforming opensource models in generation and understanding, enabling advanced tasks like image manipulation and future frame prediction.
07
MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Static Quantization
Published:10/25/2025
Multimodal Large Language ModelStatic QuantizationPost-Training Quantization FrameworkModality-Specific QuantizationLLM Reasoning Capacity Enhancement
MQuant introduces a posttraining static quantization framework for multimodal large language models, reducing latency and outliers via modalityspecific quantization, flexible attention switching, and rotation suppression, boosting inference efficiency across major models.
04
KORE: Enhancing Knowledge Injection for Large Multimodal Models via
Knowledge-Oriented Augmentations and Constraints
Published:10/22/2025
Multimodal Large Language ModelKnowledge Utilization in LLMsKnowledge Injection MethodsKnowledge Adaptation and RetentionCatastrophic Forgetting Mitigation
KORE enhances knowledge injection in large multimodal models by structured augmentations and constraints, preserving old knowledge via null space projection to mitigate forgetting and enable precise adaptation to new knowledge.
09
m3P: Towards Multimodal Multilingual Translation with Multimodal Prompt
Published:3/26/2024
Multimodal Large Language ModelMultimodal Multilingual TranslationMultimodal PromptMultilingual Multimodal Instruction DatasetLow-Resource Language Translation
m3P leverages multimodal prompts and visual context as a languageindependent representation to align 102 languages semantically, significantly improving translation quality, especially in lowresource settings.
05
LMAgent: A Large-scale Multimodal Agents Society for Multi-user
Simulation
Published:12/12/2024
Multimodal Large Language ModelLarge-Scale Multi-Agent SystemsMulti-User Behavior SimulationE-Commerce Scenario SimulationSelf-Consistency Prompting
LMAgent develops a largescale multimodal agent society for realistic multiuser simulation in ecommerce, enhancing decisions via selfconsistency prompting and boosting efficiency with a smallworld fast memory, demonstrating humanlike behavior and herd effects.
010
Large Language Model Agent: A Survey on Methodology, Applications and
Challenges
Published:3/27/2025
Survey on Large Language Model AgentsRL Training for Large Language ModelsLLM-guided motion planningLLM Reasoning Capacity EnhancementMultimodal Large Language Model
This survey systematically analyzes large language model agents’ architecture, collaboration, and evolution, unifying fragmented research and outlining evaluation, tools, and applications, to guide future advancements toward artificial general intelligence.
09