Paper status: completed

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

Published:10/18/2025

Multimodal Large Language Model (24)Large-Scale Multimodal Dialogue Dataset (1)Vision-Audio Alignment Network (1)Temporal Embedding Encoding (1)Cross-Modal Understanding Benchmarks (1)

Original Link PDF

Price: 0.100000

19 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

OmniVinci introduces architectural innovations and large multimodal data to enhance vision, audio, and language integration, achieving superior performance and efficiency over Qwen2.5, with promising downstream applications in robotics and healthcare.

Abstract

Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.

Mind Map

In-depth Reading

English Analysis~13 min read · 17,195 chars

1. Bibliographic Information

Title: OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Authors: Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, and Pavlo Molchanov.
Affiliations: All authors are affiliated with NVIDIA. The paper highlights a core contribution team and designates corresponding authors and equal advisors, indicating a large, structured research initiative.
Journal/Conference: This paper is a preprint available on arXiv. arXiv is a popular open-access repository for research papers before they undergo formal peer review for publication in a conference or journal.
Publication Year: 2025 (as per the provided source link, though likely a placeholder for a 2024 submission).
Abstract: The paper introduces OmniVinci, an open-source omni-modal Large Language Model (LLM) designed to perceive and understand information across vision, audio, and text, mimicking human senses. The authors focus on improving model architecture and data curation. Key architectural innovations include OmniAlignNet for aligning vision and audio embeddings, Temporal Embedding Grouping for relative temporal alignment, and Constrained Rotary Time Embedding for absolute temporal encoding. They also developed a data pipeline to generate 24 million high-quality conversations, finding that modalities reinforce each other's learning. The resulting OmniVinci model significantly outperforms its contemporary, Qwen2.5-Omni, on various benchmarks while using 6 times fewer training tokens. The paper concludes by showcasing the model's advantages in downstream applications like robotics, medical AI, and smart manufacturing.
Original Source Link:
- ArXiv: https://arxiv.org/abs/2510.15870v1
- PDF: https://arxiv.org/pdf/2510.15870v1.pdf
- Status: Preprint.

2. Executive Summary

Background & Motivation (Why): While Large Language Models (LLMs) have been successfully extended to understand images (vision) or sound (audio), creating systems that can seamlessly integrate vision, audio (including speech and environmental sounds), and language—known as omni-modal understanding—remains a significant challenge. Training such models is often prohibitively expensive and complex, demanding sophisticated architectural designs for aligning different data types and massive, well-curated datasets. This paper tackles the problem of building a powerful, yet data-efficient, open-source omni-modal LLM by systematically improving both the model's architecture and its training data.
Main Contributions / Findings (What):
1. Novel Architectural Components: The paper introduces three key techniques to better align and integrate vision and audio signals:
  - OmniAlignNet: A module that uses contrastive learning to ensure that the high-level semantic meaning of visual and auditory streams from the same video are represented closely in a shared latent space.
  - Temporal Embedding Grouping (TEG): A method that arranges visual and audio embeddings into chronological chunks, explicitly encoding their relative temporal order.
  - Constrained Rotary Time Embedding (CRTE): An advanced technique for injecting absolute timestamp information into embeddings, improving the model's ability to reason about when events occur.
2. Advanced Data Curation and Synthesis Pipeline: The authors created a pipeline to generate 24 million diverse training samples. A key innovation is an "Omni-Modal Data Engine" that addresses "modality-specific hallucination"—where a model makes errors by relying on only one modality (e.g., video without audio). This engine uses a powerful LLM to cross-reference and correct captions generated from separate vision and audio streams, producing a single, accurate, and comprehensive omni-modal description.
3. State-of-the-Art Open-Source Model: The resulting model, OmniVinci, establishes new performance benchmarks for open-source omni-modal models. It significantly outperforms Qwen2.5-Omni (+19.05 on DailyOmni) and other strong baselines across a wide range of tasks.
4. High Data Efficiency: OmniVinci achieves these superior results using only 0.2 trillion training tokens, a 6-fold reduction compared to the 1.2 trillion tokens used for Qwen2.5-Omni, demonstrating the effectiveness of its architecture and data strategy.
5. Synergy Across Modalities: The research provides strong evidence that integrating audio and vision not only improves perception but also enhances higher-level reasoning capabilities, as demonstrated during reinforcement learning post-training.

Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks (e.g., GPT, Llama) trained on vast amounts of text. They excel at understanding and generating human-like language by predicting the next word in a sequence. This is known as an auto-regressive process.
- Multimodal LLMs (MLLMs): These are LLMs extended to handle non-text inputs like images, videos, or audio. This typically involves using a separate encoder (e.g., a Vision Transformer for images) to convert the non-text data into numerical representations (embeddings) that the LLM can understand. A projector network then maps these embeddings into the same space as the LLM's text embeddings.
- Omni-modal LLMs: The next evolution of MLLMs, designed to process and reason over multiple modalities simultaneously (e.g., video + audio + text prompt). This paper focuses on building such a system.
- Contrastive Learning (e.g., CLIP): A self-supervised learning technique used to align two different modalities. For instance, CLIP learns to match images with their corresponding text captions. It does this by pulling the embeddings of a correct (image, text) pair together in a shared latent space while pushing incorrect pairs apart. OmniAlignNet uses this principle for video and audio.
- Rotary Positional Embeddings (RoPE): A method for encoding the position of a token in a sequence. Instead of adding a positional value, RoPE rotates the embedding vector by an angle that depends on its position. This is effective at capturing relative positions. CRTE adapts this idea for absolute time.
Previous Works & Differentiation: The paper positions OmniVinci against recent leading models like Qwen2.5-Omni, Gemini-2.5-Pro, and GPT-4o. While these models demonstrate powerful omni-modal capabilities, many are proprietary (closed-source) or require immense computational resources and data for training. OmniVinci differentiates itself in several key ways:
- Open-Source Focus: It provides a strong, open-source alternative, fostering broader research and application development.
- Systematic Design: It presents a detailed, systematic exploration of how to build an omni-modal model, ablating specific architectural and data choices.
- Data Efficiency: It achieves superior performance with significantly less training data, thanks to its intelligent data synthesis pipeline that creates high-quality, explicit omni-modal supervision.
- Architectural Novelty: It introduces a combination of three novel components (OmniAlignNet, TEG, CRTE) specifically designed to tackle the temporal and semantic alignment challenges in video-audio integration, which is more advanced than simple token concatenation used in some prior works.

4. Methodology (Core Technology & Implementation)

The core of OmniVinci is its architecture for integrating vision, audio, and text into an LLM backbone. The process starts with encoding inputs and then uses a novel mechanism to align them.

Omni-Modal Input Embedding: Video is treated as a sequence of image frames, and audio is processed by a unified audio encoder. These are passed through modality-specific projectors to create initial embeddings.
Omni-Modal Alignment Mechanism: The key challenge is to harmonize these embeddings. OmniVinci employs a three-part strategy:
1. OmniAlignNet: Semantic Alignment
  - Principle: To align the high-level meaning of the entire video clip and its corresponding audio stream.
  - Steps:
    1. Visual embeddings ( $\mathbf{E}_v$ ) and audio embeddings ( $\mathbf{E}_a$ ) are obtained from their respective projectors.
    2. These sequences are summarized into single, fixed-size vectors ( $\mathbf{V}$ and $\mathbf{A}$ ) using a self-attention mechanism and learnable query embeddings.
    3. A contrastive loss, inspired by CLIP, is applied. This loss encourages the embeddings $\mathbf{V}_i$ and $\mathbf{A}_i$ from the same video $i$ to be similar, while being dissimilar to embeddings from other videos in the same batch.
  - Mathematical Formula: The loss is a symmetric cross-entropy function. The loss for aligning vision to audio ( $L_{v \to a}$ $L_{v \to a}$ ) and audio to vision ( $L_{a \to v}$ $L_{a \to v}$ ) is: $L _ { v \to a } = - \displaystyle \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \log \frac { \exp \left( s _ { i i } \right) } { \sum _ { j = 1 } ^ { N } \exp \left( s _ { i j } \right) } , L _ { a \to v } = \displaystyle - \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \log \frac { \exp \left( s _ { i i } \right) } { \sum _ { j = 1 } ^ { N } \exp \left( s _ { j i } \right) } .$
    - $N$ : The number of video-audio pairs in a batch.
    - $s_{ij} = \mathbf{V}_i^T \mathbf{A}_j$ : The dot product similarity between the visual embedding of video $i$ and the audio embedding of video $j$ . The goal is to maximize the similarity for matched pairs ( $s_{ii}$ ).
2. Temporal Embedding Grouping (TEG): Relative Temporal Alignment
  - Principle: To preserve the chronological order of events.
  - Steps: The timeline is divided into chunks of duration $T_G$ . All visual and audio embeddings are grouped into these chunks based on their timestamps. The final input sequence for the LLM is formed by concatenating these groups in order (e.g., [Vision Group 1, Audio Group 1, Vision Group 2, Audio Group 2, ...]). This organization implicitly tells the LLM which visual and audio events happened around the same time and in what sequence.
3. Constrained Rotary Time Embedding (CRTE): Absolute Temporal Alignment
  - Principle: To encode the precise, absolute timestamp of each frame and audio sample into its embedding.
  - Steps: This technique adapts RoPE for time.
    1. Base Frequency Generation: A set of base frequencies $\omega_i$ is created, ranging from high to low. High frequencies are sensitive to small time differences, while low frequencies capture broad temporal relationships. The coarseness is controlled by a hyperparameter $T_{max}$ .
    2. Frequency Modulation: The base frequencies are scaled by the actual timestamp $t_j$ of the sample.
    3. Rotary Embedding Application: The embedding vector $\mathbf{x}$ is rotated based on the modulated frequencies. Dimensions are paired up and rotated in 2D planes, with different pairs using different frequencies. This allows the model to simultaneously represent time at multiple scales. $\mathrm { C R T E } ( \mathbf { x } , \boldsymbol { \Omega } _ { : , j } ) = \mathbf { x } \odot \cos ( \boldsymbol { \Omega } _ { : , j } ) + \mathtt { R o t a t e H a l f } ( \mathbf { x } ) \odot \sin ( \boldsymbol { \Omega } _ { : , j } )$
    - $\mathbf{x}$ : The original embedding vector.
    - $\boldsymbol{\Omega}_{:,j}$ : The vector of modulated frequencies for a sample at time $t_j$ .
    - $\odot$ : Element-wise multiplication.
    - RotateHalf: A function that swaps and negates adjacent elements, effectively performing a rotation in 2D subspaces.
Training Strategy and Data Engine: The model is trained in two stages: modality-specific pre-training followed by joint omni-modal training. The joint training phase uses a mix of data, with a key contribution being the Omni-Modal Data Engine.

该图像是图表，展示了图4中的全模态字幕生成流程。视频被切分为20秒片段，视觉和音频字幕分别生成，存在跨模态理解错误。随后由另一个大型语言模型进行跨模态校正和总结，生成准确的全模态字幕。

As shown in Figure 4 above, this engine addresses "modality-specific hallucination". For a video of an underwater robot, a vision-only model might focus on "human technology," while an audio-only model hearing the narration might guess "Earth's interior." Neither is fully correct. The engine works as follows:
1. Generate separate captions for video and audio.
2. Feed both captions to a powerful LLM.
3. Instruct the LLM to synthesize a new, corrected omni-modal caption that integrates information from both and resolves conflicts (e.g., "A blue underwater robot glides through dark waters... the narration explains the journey...").
4. This high-quality, synthesized data provides explicit omni-modal supervision, which proved highly effective.

5. Experimental Setup

Datasets: OmniVinci was evaluated on a comprehensive suite of benchmarks:
- Omni-modal Understanding: Worldsense, Dailyomni, Omnibench (tasks requiring joint video-audio or image-audio understanding).
- Audio Understanding: MMAR (general audio QA), MMAU (diverse audio tasks), and several Automatic Speech Recognition (ASR) benchmarks like LibriSpeech and AMI.
- Video Understanding: Video-MME, MVBench, LongVideoBench (video QA and comprehension).
- Image Understanding: A wide array of 10 benchmarks including VQAv2 (visual question answering), DocVQA (document QA), and MathVista (mathematical reasoning in images).
Evaluation Metrics:
- Most benchmarks report an accuracy-based score, where higher is better (indicated by ↑).
- For speech recognition (ASR), the primary metric is Word Error Rate (WER), where lower is better (indicated by ↓).
  - Conceptual Definition: WER is the standard metric for measuring the performance of a speech recognition system. It calculates the percentage of words that were incorrectly transcribed. Errors are categorized as substitutions (wrong word), deletions (missed word), or insertions (extra word).
  - Mathematical Formula: $\mathrm{WER} = \frac{S + D + I}{N}$
  - Symbol Explanation:
    - $S$ : The number of substitutions (words replaced by incorrect ones).
    - $D$ : The number of deletions (words from the reference transcript that were missed).
    - $I$ : The number of insertions (words added to the hypothesis that were not in the reference).
    - $N$ : The total number of words in the reference transcript.
Baselines: The paper compares OmniVinci against a strong field of contemporary models, including:
- Open-Source: Qwen2.5-Omni, InternVL2, LLaVA-NeXT-Video, NVILA.
- Proprietary (Closed-Source): GPT-4o, Gemini 2.5 Pro, Claude 3.5 Sonnet.

6. Results & Analysis

OmniVinci demonstrates superior performance across the board, validating its design choices.

Ablation Studies: The authors systematically prove the value of each new component.

Omni-modal Alignment (Table 1): Starting with a simple baseline, adding TEG (Temporal Embedding Grouping) improved the average score by +2.21. Adding CRTE (Constrained Rotary Time Embedding) on top of that further boosted the score to +4.74 over the baseline. Finally, adding OmniAlignNet brought the total improvement to +7.08, confirming that all three components contribute significantly.

(Manual transcription of Table 1 from the paper)

Method	Omni
Method	Worldsense ↑	Dailyomni ↑	Omnibench ↑	Average ↑
Token Concatenation - Baseline	42.21	54.55	36.46	45.51
+ TEG (ours)	44.51 (+2.30)	60.99 (+6.44)	37.65 (+1.19)	47.72 (+2.21)
++ Learned Time Embedding	44.58 (+2.37)	60.40 (+5.85)	36.91 (+0.45)	47.30 (+1.79)
++ RoTE	44.42 (+2.21)	60.74 (+6.19)	38.24 (+1.78)	47.80 (+2.29)
++ CRTE (ours)	45.46 (+3.25)	65.66 (+11.11)	39.64 (+3.18)	50.25 (+4.74)
+++ OmniAlignNet (ours)	46.21 (+4.00)	65.83 (+12.28)	45.74 (+9.28)	52.59 (+7.08)

Implicit vs. Explicit Learning (Table 2): Training on videos with audio (Implicit Learning) improved video understanding scores. However, adding the synthesized data from the Omni-Modal Data Engine (Explicit Learning) yielded much larger gains (+5.70 on VideoMME without subtitles), proving the effectiveness of the data synthesis strategy.

(Manual transcription of Table 2 from the paper)

Method	VideoMME ↑		VideoMME w/o sub. ↑
Method	w/ subtitles	w/o subtitles	Short	Medium	Long
Visual Alone	66.37	61.67	74.22	59.67	51.11
Visual + Audio (IL)	66.96 (+0.59)	63.76 (+2.09)	71.31 (-2.91)	64.16 (+4.49)	55.82 (+4.71)
Visual + Audio + Data Engine (EL)	68.63 (+2.26)	67.37 (+5.70)	76.78 (+2.56)	67.56 (+7.89)	57.78 (+6.67)

Core Benchmark Results:

$Figure 1 | OmniVinci demonstrates strong performance across widely used omni-modal ( $_ { + 1 9 . 0 5 }$ on Dailyomni), audio (+1.7 on MMAR), and vision ( $+ 3 . 9$ on Video-MME) understanding benchm…$ 该图像是图表，展示了OmniVinci在六个常用的多模态、音频和视觉理解基准上的性能表现，明显优于多个对比模型，提升包括Dailyomni +19.05，MMAR +1.7，Video-MME +3.9。

As summarized in Figure 1, OmniVinci sets new state-of-the-art results for open-source models:
- Omni-modal (Table 3): Achieves an average score of 53.73, surpassing Qwen2.5-Omni (49.66). The most dramatic improvement is on Dailyomni, with a score of 66.50, which is +19.05 points higher than Qwen2.5-Omni.
- Audio (Tables 4, 6): Outperforms Qwen2.5-Omni on MMAR (58.40 vs. 56.70) and MMAU (71.60 vs. 71.00), showing stronger general audio comprehension.
- Video (Table 7): Leads other open-source models on MVBench (70.6), LongVideoBench (61.3), and Video-MME (68.2), confirming that its audio-visual integration translates to better video understanding.
- Image (Table 8): Maintains highly competitive performance on a wide range of image benchmarks, demonstrating that its omni-modal training did not compromise its vision-language capabilities.
Omni-Modal Reasoning with Reinforcement Learning: The paper applies GRPO (Group Relative Policy Optimization), a reinforcement learning technique, to further enhance the model's reasoning.
- Result (Table 9): GRPO post-training provides consistent improvements across all omni-modal benchmarks.
- Key Insight 3: The analysis in Figure 6 reveals that training with joint audio-visual input leads to a higher final reward and faster convergence compared to training with only visual input. This finding suggests that audio provides a valuable signal for improving not just perception, but also complex reasoning.
  
  该图像是图表，展示了图6中OmniVinci与Qwen2.5-Omni在强化学习训练过程中的准确率奖励和格式奖励曲线。左侧两图显示OmniVinci在准确率奖励上有+7.8%提升，格式奖励收敛速度快2.7倍；右图对比了有无音频输入时准确率奖励，增加了+0.1改进。
Qualitative Examples and Downstream Tasks: The paper showcases OmniVinci's practical abilities through several examples:
- Qualitative Analysis (Figures 7-9): The model can accurately describe complex video scenes by integrating visual actions (e.g., a person opening a gift box) with spoken content (e.g., a podcast discussion on AI). It can also handle audio prompts.
- Downstream Applications: The model is successfully applied to advanced tasks, including speech-instructed robot navigation, analysis of medical scans with physician narration, and monitoring in smart factories, demonstrating its real-world utility. For instance, in robot navigation (Figure 8), the agent uses visual input and spoken commands to navigate an environment.
  
  该图像是论文中第8图的示意图，展示了基于OmniVinci的语音驱动导航代理。左侧为代理当前视觉观察，中间为目标位置及历史轨迹的地图，右侧为输入语音指令及代理预测动作。

7. Conclusion & Reflections

Conclusion Summary: OmniVinci represents a significant step forward in building powerful and efficient open-source omni-modal LLMs. The paper provides a clear and effective recipe centered on three architectural innovations (OmniAlignNet, TEG, CRTE) and an intelligent data synthesis pipeline that explicitly corrects for "modality-specific hallucinations." The resulting model achieves state-of-the-art performance on a wide range of benchmarks while being substantially more data-efficient than its competitors, and demonstrates that a true omni-modal approach enhances both perception and reasoning.
Limitations & Future Work:
- Dependency on Teacher Model: The "Omni-Modal Data Engine" relies on another powerful (and likely proprietary) LLM to synthesize corrected captions. This creates a dependency and may propagate biases from the teacher model into OmniVinci's training data.
- Complexity: The combination of multiple encoders, projectors, and three distinct alignment mechanisms results in a complex architecture that may be challenging to train and optimize.
- Real-time Performance: While the paper shows latency improvements (Figure 9), real-time, low-latency omni-modal interaction on consumer-grade hardware remains a challenge.
Personal Insights & Critique:
- The most impactful contribution of this paper is arguably the "Omni-Modal Data Engine." Identifying and explicitly addressing "modality-specific hallucination" is a crucial insight. It highlights that simply throwing more data at a model is not enough; the quality and nature of the supervision are paramount. This data-centric approach is likely a key reason for the model's impressive data efficiency.
- The work provides a strong blueprint for future research in omni-modal AI. The open-sourcing of the OmniVinci model will enable the community to build upon this foundation, exploring new applications and pushing the boundaries of machine perception.
- An open question is the model's robustness to conflicting or ambiguous signals. For example, how would it behave if the visual information directly contradicts the audio narration? The current framework focuses on synergy and alignment, but handling discordance is a critical next step for real-world intelligence.
- Overall, OmniVinci is an excellent piece of engineering and research that combines architectural ingenuity with a pragmatic, data-driven strategy to deliver a top-tier, efficient, and accessible omni-modal foundation model.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.