DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations
TL;DR Summary
This paper introduces DrVoice, a parallel speech-text conversation model that reduces LLM input frequency to 5Hz using dual-resolution speech representations, significantly lowering computational costs. Experimental results show state-of-the-art performance across multiple benchm
Abstract
Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs' capabilities. Experimental results demonstrate that DRVOICE-7B establishes new state-of-the-art (SOTA) on OpenAudioBench and Big Bench Audio benchmarks, while achieving performance comparable to the SOTA on VoiceBench and UltraEval-Audio benchmarks, making it a leading open-source speech foundation model in ~7B models.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations
1.2. Authors
Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Xiangang Li, Jieping Ye. The authors are affiliated with the Tongyi Lab at Alibaba Group, a prominent research institution known for developing large-scale AI models, including the Qwen series of large language models.
1.3. Journal/Conference
The paper was submitted to arXiv, a repository for electronic preprints of scientific papers. The publication date of June 11, 2025, suggests it is a preprint submitted for review to a future conference. Given that a cited paper is for ICLR 2025, it is plausible that DrVoice was submitted to a top-tier AI conference like ICLR, NeurIPS, or ICML.
1.4. Publication Year
2025 (as per the arXiv submission date).
1.5. Abstract
The paper introduces DrVoice, a novel end-to-end (E2E) voice conversation model that generates speech and text in parallel. It is based on a joint autoregressive modeling approach, where the Large Language Model (LLM) generates both speech and text tokens simultaneously. The core innovation is a dual-resolution speech representation mechanism. While existing models often use a 12.5Hz audio representation, DrVoice reduces the input frequency for the LLM to just 5Hz. This significantly lowers computational costs and resolves the frequency mismatch between slower text tokens and faster speech tokens, allowing the LLM's language capabilities to be better utilized. Experimental results show that the 7-billion-parameter model, DrVoice-7B, achieves new state-of-the-art (SOTA) performance on OpenAudioBench and Big Bench Audio benchmarks. It also performs comparably to SOTA models on VoiceBench and UltraEval-Audio, positioning it as a leading open-source speech foundation model in its size class.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2506.09349
- PDF Link: https://arxiv.org/pdf/2506.09349v3.pdf
- Publication Status: This is a preprint available on arXiv and has not yet been officially published in a peer-reviewed journal or conference proceedings at the time of this analysis.
2. Executive Summary
2.1. Background & Motivation
The field of spoken dialogue systems is rapidly advancing, moving beyond traditional "cascaded" systems towards more integrated "end-to-end" (E2E) models.
-
Cascaded Systems: These systems chain together three separate modules: Automatic Speech Recognition (ASR) to convert speech to text, a Large Language Model (LLM) to process the text and generate a response, and Text-to-Speech (TTS) to convert the text response back into audio. This approach suffers from error propagation between modules, high latency, and the loss of non-textual information like emotion and prosody.
-
End-to-End (E2E) Systems: These models aim to create a single, unified system that can directly understand speech and generate speech, overcoming the limitations of cascaded systems.
Within E2E systems, two main paradigms have emerged:
-
Text-Driven Speech Models: The LLM first generates a complete text response, and then a separate speech decoder uses the LLM's internal states to synthesize the corresponding audio. The key limitation is that the text generation process is "unaware" of the speech synthesis; it cannot adjust its wording based on how the audio is being generated.
-
Joint Speech-Text Models: The LLM generates both text tokens and speech tokens (discrete representations of audio) together. This allows for mutual awareness between modalities. However, a major challenge is that the introduction of speech tokens can interfere with and degrade the LLM's original powerful text generation capabilities.
This paper focuses on improving Joint Speech-Text Models, specifically the parallel variant. The authors identify a critical issue in prior state-of-the-art models like Kimi-Audio: the high frequency of audio tokens (e.g., 12.5Hz, or 12.5 tokens per second) compared to the much lower frequency of text tokens (~3Hz). This mismatch leads to two problems:
-
High Computational Cost: Processing long sequences of high-frequency audio tokens is computationally expensive.
-
Semantic Dilution: The LLM is flooded with speech tokens for every text token, which can dilute the semantic information and prevent the model from fully leveraging its reasoning and language skills.
The innovative idea of DrVoice is to tackle this frequency mismatch head-on with a dual-resolution approach: use a low-resolution (5Hz) representation for the LLM's internal processing to improve efficiency and semantic alignment, and a high-resolution (25Hz) representation for the final speech generation to ensure audio quality.
2.2. Main Contributions / Findings
The paper presents three main contributions:
- DrVoice with Dual-Resolution Speech Representations (DRSR): The paper proposes a novel parallel speech-text model architecture. Its core innovation, DRSR, groups high-frequency speech tokens into a low-frequency representation (25Hz -> 5Hz) before feeding them into the LLM. This reduces computational load and aligns the speech-text token rates. For output, a specialized
Speech Refined Head"un-groups" the representation to generate high-quality, high-frequency speech tokens. - Novel Training Strategies:
- CoM-Mixing Training: A curriculum-like strategy that uses "Chain-of-Modality" (CoM) prompting, where the model first thinks in text before generating speech. This helps scaffold the learning process and improves modality alignment.
- Core-Cocktail Training: A two-stage optimization strategy designed to add speech capabilities to a pre-trained LLM without degrading its original knowledge. It involves an initial aggressive training stage followed by a model-merging and fine-tuning stage to preserve the base LLM's performance.
- State-of-the-Art Performance: The resulting model, DrVoice-7B, sets a new SOTA on the OpenAudioBench (for audio understanding) and Big Bench Audio (for audio reasoning) benchmarks. It also achieves performance comparable to the best models on VoiceBench and UltraEval-Audio, all while being significantly more computationally efficient due to its 5Hz processing frequency. This establishes DrVoice as a top-tier open-source speech foundation model.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Large Language Models (LLMs): These are massive neural networks trained on vast amounts of text data. Their core mechanism is autoregressive generation, where they predict the next word (or "token") in a sequence based on all the preceding tokens. This allows them to generate coherent and contextually relevant text. Examples include GPT-4 and Qwen.
- Speech Tokenization: This is the process of converting a continuous audio waveform into a sequence of discrete integer tokens, similar to how text is broken down into word or sub-word tokens. This allows LLMs, which are designed for discrete data, to process audio. There are two main types of speech tokens:
- Acoustic Tokens: These are optimized to reconstruct the original audio waveform with high fidelity. They capture detailed acoustic properties like pitch, timbre, and loudness.
- Semantic Tokens: These are optimized to capture the linguistic content of the speech. They have a stronger correlation with the meaning of the words being spoken, making them more suitable for tasks requiring speech understanding. DrVoice uses
S3Tokenizer, a type of semantic tokenizer.
- Neural Audio Codec: A model that performs speech tokenization. It typically consists of an encoder that compresses audio into discrete tokens and a decoder (or detokenizer) that reconstructs the audio from these tokens.
- Automatic Speech Recognition (ASR): The task of converting spoken language into text.
- Text-to-Speech (TTS): The task of synthesizing human-like speech from text.
3.2. Previous Works
The paper situates DrVoice within the landscape of E2E speech foundation models.
-
Text-Driven Speech Models: In these models, the LLM acts as a "Thinker" that processes multimodal input and generates a text-only response. This text, along with the LLM's hidden states, is then passed to a "Talker" module (a speech decoder) that generates the audio.
- Example:
Qwen2.5-Omni. - Limitation: The information flow is one-way. The "Thinker" completes its text generation before the "Talker" starts. This means the LLM cannot receive feedback from the generated audio, limiting its ability to control fine-grained prosody or adapt its response mid-utterance.
- Example:
-
Joint Speech-Text Models: These models train the LLM to generate both text and speech tokens simultaneously, allowing for interaction between the two modalities.
- Interleaved Models: These models generate speech and text tokens in an alternating sequence within the same autoregressive process.
- Examples:
GLM-4-Voice,Baichuan-Omni-1.5.
- Examples:
- Parallel Models: These models generate speech and text tokens for the same time step in parallel. The embeddings of the speech and text tokens are typically combined (e.g., added together) and fed back into the LLM as a single input for the next step.
- Examples:
Moshi,Kimi-Audio. Kimi-Audiois a key predecessor that established strong performance but used a 12.5Hz audio representation, which DrVoice identifies as inefficient.
- Examples:
- Interleaved Models: These models generate speech and text tokens in an alternating sequence within the same autoregressive process.
-
Key Components Used: DrVoice builds upon several existing powerful open-source components:
S3TokenizerandCosyVoice: A state-of-the-art semantic speech tokenizer and detokenizer (TTS system) that DrVoice uses to handle the conversion between audio and semantic tokens. They are noted for their robustness and high-quality speech synthesis.Whisper: A large-scale speech recognition model from OpenAI. DrVoice uses its powerful encoder to process user speech input for better understanding.
3.3. Technological Evolution
The field has evolved from clumsy, error-prone cascaded systems to more seamless E2E systems. Within E2E models, the trend has been to grant the core LLM more direct control over speech generation.
- Early E2E (Text-Driven): The LLM's role was primarily text generation, with speech synthesis as a downstream task (e.g.,
Qwen2.5-Omni). - Advanced E2E (Joint Modeling): The LLM is directly involved in generating speech tokens, enabling richer interaction. This led to interleaved and parallel architectures.
- Refinement of Joint Models (DrVoice): DrVoice represents the next step, focusing on solving a core efficiency and architectural bottleneck in parallel joint models—the frequency mismatch between speech and text. By introducing the dual-resolution concept, it refines the parallel approach to be more computationally feasible and better aligned with the LLM's natural processing strengths.
3.4. Differentiation Analysis
DrVoice's core innovations differentiate it from previous works, particularly Kimi-Audio:
- Dual-Resolution vs. Single-Resolution: While
Kimi-Audiouses a fixed 12.5Hz representation, DrVoice introduces a dual-resolution system. It uses an ultra-low 5Hz frequency for LLM input (via grouping) and a high 25Hz frequency for output (via a refinement head). This is the central architectural innovation. - Efficiency: The 5Hz input rate makes DrVoice significantly more computationally efficient during training and inference compared to models operating at 12.5Hz or 25Hz.
- Speech Refined Head (SRH): Instead of predicting a group of tokens at once, DrVoice uses the SRH to autoregressively generate each speech token within a group. This allows for finer control and potentially higher quality speech, addressing the information loss from the initial grouping step.
- Systematic Training Strategies: DrVoice introduces the
CoM-MixingandCore-Cocktailtraining strategies, which are explicit methodologies for improving modality alignment and preserving the base LLM's knowledge—areas that are often handled with less structured approaches in other works.
4. Methodology
4.1. Principles
The core principle of DrVoice is to decouple the speech resolution required for LLM comprehension from the resolution required for high-quality audio synthesis. The model hypothesizes that an LLM does not need to process audio at a high frame rate to understand its content and generate a coherent response. A lower frame rate is sufficient and more efficient, as it better matches the rate of semantic information flow in text. However, generating clear, natural-sounding speech requires a higher frame rate. DrVoice's architecture is designed around this principle, using a low-resolution representation internally and a high-resolution one externally.
The overall architecture is depicted in the figure below.
该图像是DrVoice模型的示意图。图中展示了用户语音输入的处理过程,包括语音解码器、共享LLM层、文本头和语音精炼头。SRH通过次自回归前向传递生成语音标记,进而生成预测文本标记与语音波形。
4.2. Core Methodology In-depth
The DrVoice system consists of three main components: input encoders, a Multimodal LLM (MLLM) for generation, and an output speech detokenizer.
4.2.1. Speech Input Encoding and Output Detokenization
- User Speech Input: For understanding the user's speech, DrVoice uses the powerful pre-trained
Whisper-Large-v3Speech Encoder. This extracts continuous audio representations, which are then passed through anAdapterto match the temporal resolution and hidden dimension of the LLM. - Assistant Speech Tokenization: For generating its own speech, DrVoice relies on a semantic speech tokenizer. The paper uses
S3Tokenizer, which converts a speech waveform into a sequence of discrete semantic tokens . - Speech Detokenization (Synthesis): To convert the generated speech tokens back into an audio waveform, DrVoice uses the
Speech DetokenizerfromCosyVoice. This involves a Flow Matching model that converts tokens to a Mel spectrogram, followed by aHiFi-GANvocoder that transforms the spectrogram into the final audio signal. Both the tokenizer and detokenizer are pre-trained and frozen during DrVoice's training.
4.2.2. Multimodal Large Language Model (MLLM)
The MLLM is the core of the system, built upon a pre-trained text-only LLM (Qwen2.5). It is designed to process both speech and text inputs and generate them in parallel.
Parallel Joint Speech-Text Modeling
At each generation step , the model generates a pair of tokens: a speech token and a text token . The embeddings of these two tokens are added together to form a combined input for the next step. This creates a parallel feedback loop where both modalities influence future generation.
The combined input embedding at timestep is calculated as: $ c_t = E_{\mathrm{speech}}(s_t) + E_{\mathrm{text}}(t_t) $ where:
-
and are the embedding layers for speech and text tokens, respectively.
-
is the speech token generated at step .
-
is the text token generated at step .
Since speech and text sequences for an utterance have different lengths, the shorter sequence is padded with a special token to align them. The model is trained to predict the joint output autoregressively.
4.2.3. Dual-Resolution Speech Representations (DRSR)
This is the key innovation, consisting of a grouping mechanism for input and a refinement head for output.
1. Speech Token Grouping (Input to LLM)
To resolve the frequency mismatch between high-rate speech tokens (25Hz) and low-rate text tokens (~3Hz), the model groups consecutive speech tokens. A grouping factor of is used to compress a 25Hz speech token sequence into a 5Hz sequence of representations.
A group of speech tokens, , are first concatenated and then passed through a linear layer to produce a single grouped representation . $ \mathbf{g}_i = \mathrm{Linear}\left( \underset{j=ik}{\overset{(i+1)k-1}{\parallel}} \mathbf{s}j \right) \in \mathbb{R}^{d{\mathrm{text}}} $ where:
-
is the embedding of the -th speech token.
-
denotes the concatenation operation.
-
is the grouping factor (set to 5).
-
is the -th grouped speech representation, which has the same dimension as the text embeddings ().
This new sequence of grouped representations is what the main LLM body processes, effectively seeing speech at a 5Hz rate.
2. Speech Refined Head (SRH) and Ungrouping (Output from LLM)
While grouping is efficient, it loses fine-grained acoustic information necessary for high-quality speech generation. To recover this detail, the SRH is introduced. It takes the hidden state from the LLM and "un-groups" it to generate the original high-frequency speech tokens.
The process is as follows:
-
The final hidden state from the shared LLM layer, , is passed through a linear projection . $ \mathbf{h}{ug} = \mathbf{W}{p} \mathbf{h}_{L}^{\mathrm{[SLLM]}} $
-
This projected vector is then split into separate time-step embeddings. $ \mathbf{H} = \mathrm{Split}k(\mathbf{h}{ug}) = [\mathbf{h}{ug}^{(1)}, \mathbf{h}{ug}^{(2)}, \dots, \mathbf{h}_{ug}^{(k)}] $
-
The SRH, a smaller autoregressive model, then uses these embeddings as conditional information to generate the original speech tokens one by one.
The SRH is trained to maximize the conditional probability of the speech tokens, using its own loss function: $ \mathcal{L}{\mathrm{SRH}} = - \sum{i=1}^T \log P(s_i | s_{ is the -th speech token to be predicted, conditioned on all previous speech tokens and the contextual embeddings derived from the LLM.
4.2.4. Overall Training Objective
The entire model is trained with a multi-task objective that combines the loss from the main text head () and the speech refined head ().
The text head loss is a standard autoregressive loss for predicting the next text token: $ \mathcal{L}{\mathrm{TH}} = - \sum{i=1}^T \log P(t_i | c_{<i}, \mathbf{g}) $ The total MLLM loss is a weighted sum of the two: $ \mathcal{L}{\mathrm{MLLM}} = \lambda \mathcal{L}{\mathrm{TH}} + \mu \mathcal{L}_{\mathrm{SRH}} $ where and are hyperparameters to balance the two tasks (both set to 1 in the experiments).
4.2.5. Training Strategy
- Initialization: To leverage existing knowledge, components are initialized with pre-trained weights: the speech encoder from
Whisper-Large-v3, the shared LLM layers fromQwen2.5-Instruct, and the SRH from a pre-trained TTS model. The speech tokenizer/detokenizer fromCosyVoiceare used as-is and kept frozen. - CoM-Mixing Training: The model is trained on a mixed dataset containing seven different interaction patterns (e.g., Speech-to-Text, Speech-to-Multimodal, etc., see Table 1). Some of these patterns follow a Chain-of-Modality (CoM) structure, where the model is prompted to first generate a text-only plan or transcription before producing the final multimodal output. This acts as a form of curriculum learning, guiding the model to structure its thoughts.
- Core-Cocktail Training: This two-stage strategy addresses the challenge of adding speech capabilities without damaging the base LLM's performance (known as "catastrophic forgetting").
- Stage 1: The entire model is fine-tuned with a relatively high learning rate to quickly adapt the parameters for the new multimodal task.
- Merging & Stage 2: The weights of the model from Stage 1 () are merged with the weights of the original, pre-trained base LLM (). The paper uses an interpolation formula: $ M_r \gets \alpha M_1 + (1-\alpha)M_0 $ The merged model is then fine-tuned in Stage 2 with a small learning rate for stable and precise optimization. This step re-integrates the robust knowledge of the base LLM, mitigating performance degradation from the aggressive first stage. The paper notes using in its main experiment, which is an extreme case of preserving the base model's capabilities.
5. Experimental Setup
5.1. Datasets
- SRH Pre-training: Approximately 100,000 hours of paired audio-text data were used to pre-train the Speech Refined Head.
- DrVoice Post-training: The main training data was created by synthesizing speech for ~3 billion text tokens using
CosyVoice. From this, a curated dataset was selected based on the Word Error Rate (WER) of the synthesized speech:- ~26,000 hours for speech-to-speech conversation.
- ~20,000 hours of user speech plus 1.3 billion assistant tokens for speech-to-text conversation.
- Real-world Data Augmentation: To improve robustness, ~10,000 hours of English Automatic Speech Recognition (ASR) data from various corpora were added, including
Common Voice,MELD,LibriSpeech,SPGISpeech, andVoxpopuli. - Evaluation Benchmarks: The model was evaluated on several standard benchmarks:
OpenAudioBench,VoiceBench,UltraEval-Audio, andBig Bench Audio.
5.2. Evaluation Metrics
- G-Eval: Used for open-ended question-answering tasks (
AlpacaEval,CommonEval). It leverages a powerful LLM like GPT-4 to score the quality of a generated response based on a given prompt and criteria, providing a score that aligns well with human judgment. - Accuracy: The percentage of correctly answered questions. Used for multiple-choice or extractive QA tasks like
SD-QA,MMSU,OpenBookQA,TriviaQA, andWeb Q.. - Refusal Rate: Used for
AdvBench. It measures the percentage of times the model refuses to comply with harmful or adversarial requests. A higher score is better, indicating greater safety alignment. - UTMOS (Universal MOS Predictor):
- Conceptual Definition: A machine-learning-based objective metric that predicts the Mean Opinion Score (MOS) of synthesized speech. MOS is a subjective quality score, typically rated by humans on a scale of 1 to 5. UTMOS aims to automate this process, evaluating the overall perceptual quality of speech (naturalness, clarity, etc.).
- Mathematical Formula: UTMOS is a deep learning model itself, so it doesn't have a simple formula. It is a supervised model trained on large datasets of speech samples and their corresponding human-rated MOS scores.
- ASR-WER (Word Error Rate):
- Conceptual Definition: This metric measures the dissimilarity between the text generated by the model and a transcription of the speech generated by the model (obtained via an ASR system). It quantifies how well the generated speech aligns with the intended text content. A lower ASR-WER is better.
- Mathematical Formula: $ \text{WER} = \frac{S + D + I}{N} $
- Symbol Explanation:
- : The number of substitutions (words incorrectly transcribed).
- : The number of deletions (words missing from the transcription).
- : The number of insertions (words added to the transcription).
- : The total number of words in the reference text.
5.3. Baselines
The paper compares DrVoice against a comprehensive set of recent open-source audio language models:
- Text-Driven Models:
MiniCPM-o 2.6(8B)Qwen2.5-Omni(7B)
- Joint Speech-Text Models (Interleaved):
GLM-4-Voice(9B)Baichuan-Omni-1.5(7B)Step-Audio2-Mini(8B)
- Joint Speech-Text Models (Parallel):
-
Kimi-Audio(7B)This selection covers the main competing paradigms and provides a strong basis for comparison.
-
6. Results & Analysis
6.1. Core Results Analysis
The main results demonstrate DrVoice's strong performance and exceptional efficiency.
-
Overall SOTA Performance: As shown in Table 2, DrVoice-7B achieves the highest overall scores on OpenAudioBench (69.24) and Big Bench Audio (66.3), significantly outperforming all other models, including the previous SOTA,
Kimi-Audio. OnVoiceBench, its score of 76.02 is highly competitive and nearly identical to the top score of 76.93. This balanced, top-tier performance across benchmarks for understanding, reasoning, and conversational ability validates the effectiveness of the proposed architecture. -
Computational Efficiency: The most striking result is DrVoice's efficiency. The row in Table 2 and Table 3 shows that DrVoice's LLM backbone operates at a 5/5 frame rate (5 tokens per second for input and output). This is 2.5x to 5x lower than competing models that run at 12.5Hz or 25Hz. This translates to substantial reductions in computational requirements and latency, making the model more practical for real-world deployment.
-
Speech Quality and Alignment: Despite its very low internal frame rate, DrVoice does not sacrifice output quality. Table 3 shows it achieves a
UTMOSscore of 4.29, indicating high-quality, natural-sounding speech, on par with models likeQwen2.5-Omni(4.28) and significantly better thanKimi-Audio(3.06). ItsASR-WERof 11.2 demonstrates good alignment between the generated speech and text, though it is higher thanQwen2.5-Omni(3.48). The authors hypothesize this gap is because Qwen2.5-Omni's "Talker" module receives direct text input, ensuring perfect textual fidelity, whereas DrVoice's SRH only receives hidden states.
6.2. Data Presentation (Tables)
The following are the results from Table 2 of the original paper:
| GLM4-Voice | MiniCPM-o 2.6 | Baichuan-Omni-1.5 | Qwen2.5-Omni | Kimi-Audio | Step-Audio2-Mini | DrVoice | |
|---|---|---|---|---|---|---|---|
| FR (In/Out) | 12.5/12.5+τ | 25/T | 12.5/12.5+7 | 25/T | 12.5/12.5 | 12.5/25+T | 5/5 |
| OpenAudioBench (S2T) | |||||||
| AlpacaEval | 57.89 | 64.10 | 77.90 | 72.76 | 75.73 | 59.60 | 78.34 |
| Llama Q. | 76.00 | 78.00 | 78.50 | 75.33 | 79.33 | 75.00 | 80.33 |
| Reasoning QA | 47.43 | 38.60 | 50.00 | 63.76 | 58.02 | 46.04 | 57.92 |
| TriviaQA | 51.80 | 63.00 | 57.20 | 57.06 | 62.10 | 57.70 | 61.50 |
| Web Q. | 55.40 | 69.20 | 59.10 | 62.80 | 70.20 | 65.10 | 68.10 |
| Overall | 57.70 | 62.58 | 64.54 | 66.34 | 69.08 | 60.69 | 69.24 |
| VoiceBench (S2T) | |||||||
| AlpacaEval | 3.97 | 4.42 | 4.50 | 4.33 | 4.46 | 4.17 | 4.52 |
| CommonEval | 3.42 | 4.15 | 4.05 | 3.84 | 3.97 | 3.00 | 3.77 |
| SD-QA | 36.98 | 50.72 | 43.40 | 57.41 | 63.12 | 56.06 | 68.54 |
| MMSU | 39.75 | 54.78 | 57.25 | 56.38 | 62.17 | 52.18 | 60.31 |
| OpenBookQA | 53.41 | 78.02 | 74.51 | 79.12 | 83.52 | 64.18 | 79.56 |
| IFEval | 52.80 | 49.25 | 54.54 | 53.88 | 61.10 | 38.01 | 59.30 |
| AdvBench | 88.08 | 97.69 | 97.31 | 99.62 | 100.00 | 93.08 | 98.65 |
| Overall | 59.83 | 71.69 | 71.14 | 72.83 | 76.93 | 63.84 | 76.02 |
| UltraEval-Audio (S2S) | |||||||
| AlpacaEval | 51.00 | 51.00 | 58.69 | 56.10 | 44.20 | 51.72 | 49.65 |
| Llama Q. | 50.00 | 61.00 | 67.33 | 66.30 | 57.33 | 67.67 | 68.00 |
| TriviaQA | 36.40 | 40.20 | 30.57 | 40.52 | 35.71 | 33.50 | 35.35 |
| Web Q. | 32.00 | 40.00 | 38.09 | 38.93 | 33.90 | 34.65 | 37.65 |
| Overall | 42.35 | 48.05 | 48.67 | 50.46 | 42.79 | 46.89 | 47.66 |
| Big Bench Audio (S2T & S2S) | |||||||
| S2T | 44.8 | 56.2 | 47.1 | 54.2 | 59.4 | 50.9 | 71.6 |
| S2S | 42.7 | 55.4 | 44.6 | 53.6 | 51.0 | 47.5 | 60.9 |
| Overall | 43.8 | 55.8 | 45.8 | 53.9 | 55.2 | 49.2 | 66.3 |
The following are the results from Table 3 of the original paper:
| Model | FR(In/Out)↓ | UTMOS↑ | ASR-WER↓ |
|---|---|---|---|
| MiniCPM-o 2.6 (2025) | 25/τ | 4.18 | 13.17 |
| Baichuan-Omni-1.5 (2025) | 12.5/12.5+7 | 4.27 | 23.38 |
| Qwen2.5-Omni (2025) | 25/T | 4.28 | 3.48 |
| Kimi-Audio (2025) | 12.5/12.5 | 3.06 | 21.06 |
| Step-Audio2-mini (2025) | 12.5/25+T | 4.53 | 9.5 |
| DrVoice | 5/5 | 4.29 | 11.2 |
6.3. Ablation Studies / Parameter Analysis
The paper conducts thorough ablation studies to validate each component of DrVoice.
The following are the results from Table 4 of the original paper:
| Model | S2M (T/S) | S2T | T2M (T/S) | T2T | STC (T/S) | SAC (T/S) | SUC (T/S) |
|---|---|---|---|---|---|---|---|
| DRVoICE-Small | 68.67 / 56.00 | 72.33 | 72.33 / 56.00 | 75.33 | 75.67 / 68.33 | 71.67 / 62.67 | 73.33 / 62.00 |
| w/o. CSE | 61.67 / 53.00 | 62.33 | 70.00 / 60.00 | 74.00 | 69.33 / 61.00 | 63.00 / 55.00 | 66.33 / 58.67 |
| w/o. SRH-Pretraining | 38.33 / 30.33 | 56.00 | 59.33 / 46.33 | 73.33 | 67.33 / 57.67 | 54.00 / 42.33 | 54.33 / 42.67 |
| w/o. SRH | 21.67 / 15.33 | 56.00 | 45.22 / 35.00 | 73.00 | 64.33 / 50.67 | 55.67 / 42.33 | 40.33 / 27.67 |
| w/o. CoM-Mixing | 58.00 / 49.00 | 58.00 | 69.33 / 55.00 | 68.33 | -/- | -/- | -/- |
- Continuous Speech Encoder (CSE): Removing the
Whisperencoder (w/o. CSE) causes a significant performance drop on all tasks involving speech input. The S2T score falls from 72.33 to 62.33. This confirms that using a powerful, pre-trained speech encoder is crucial for audio understanding. - Dual-Resolution (DRSR) - Grouping & SRH: The ablation for DRSR is multifaceted:
- Grouping Factor: Appendix C (Table 7) shows that increasing the grouping factor from 1 (no grouping) to 5 dramatically improves performance on S2M (Speech-to-Multimodal) tasks from 4.00 to 37.67. This confirms that grouping is not just an efficiency trick but actively helps the model by aligning speech and text representation rates.
- Speech Refined Head (SRH): Removing the SRH (
w/o. SRH) devastates speech generation performance. The S2M (Text) score plummets from 38.33 to 21.67 (a 76.9% relative improvement when SRH is added). This highlights the necessity of the SRH for reconstructing high-quality speech from the LLM's coarse-grained internal representations.
- SRH-Pretraining: Removing the pre-training step for the SRH (
w/o. SRH-Pretraining) also leads to a major drop in speech generation capabilities (e.g., S2M score falls from 61.67 to 38.33 in the relevant comparison). This shows that initializing the SRH with a capable TTS model is vital for its effectiveness. - CoM-Mixing Training: Removing this strategy (
w/o. CoM-Mixing) hurts performance across the board, with the S2M score dropping from 68.67 to 58.00. Furthermore, the model without this training cannot perform the CoM-guided generation patterns (STC, SAC, SUC). This validates the strategy's role in improving both direct generation and enabling more complex, structured reasoning patterns. - Core-Cocktail Training: Appendix C (Table 6) shows that after the aggressive Stage 1 training, the model's performance on text tasks drops significantly (from 81.77 to 70.19). However, after the merging and low-learning-rate Stage 2, the performance recovers substantially to 74.73, demonstrating the strategy's success in balancing adaptation and knowledge retention.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces DrVoice, a parallel speech-text conversational model that excels in both performance and efficiency. Its core contribution is the Dual-Resolution Speech Representation (DRSR) mechanism, which processes speech at a low 5Hz frequency inside the LLM and generates it at a high 25Hz frequency. This innovative design effectively solves the token rate mismatch between speech and text, leading to reduced computational costs and better utilization of the LLM's semantic capabilities. Complemented by novel training strategies like CoM-Mixing and Core-Cocktail, DrVoice-7B establishes itself as a leading open-source speech foundation model, achieving state-of-the-art results on several key audio understanding and reasoning benchmarks.
7.2. Limitations & Future Work
The authors identify three main areas for future improvement:
- Enhancing Speech Generation Quality: The model's speech-text alignment (ASR-WER) is weaker than some text-driven models. Future work will explore feeding text directly into the Speech Refined Head (SRH) alongside the hidden states to provide more explicit textual guidance and improve fidelity.
- Enabling Full-Duplex Interaction: To create more natural, interruptible conversations, the authors plan to investigate architectures that can process user speech input even while the model is generating its own response, inspired by time-division multiplexing approaches.
- Expanding to General Audio and Multimodality: The long-term vision is to extend the model beyond speech to comprehend and generate other audio types like music and environmental sounds, and eventually integrate the visual modality to build a truly comprehensive multimodal conversational AI.
7.3. Personal Insights & Critique
-
Insights:
- The dual-resolution approach is an elegant and powerful concept. It addresses a fundamental engineering trade-off in multimodal AI: the tension between the high data rates of sensory inputs (like audio) and the more abstract, lower-rate processing of symbolic reasoning systems (like LLMs). This idea could be highly transferable to other domains, such as video processing, where a similar frequency mismatch exists between video frames and textual descriptions.
- The
Core-Cocktailtraining strategy is a very practical and insightful method for adapting powerful, pre-trained foundation models to new modalities. It provides a structured way to navigate the "learning rate dilemma" and mitigate catastrophic forgetting, which is a common and critical problem in transfer learning.
-
Critique & Potential Issues:
- Ambiguity in Core-Cocktail Implementation: The paper states that the interpolation weight was set to 0. If taken literally, this would mean , implying that the entire Stage 1 training was discarded and Stage 2 was simply a standard fine-tuning of the base model. This seems to contradict the "cocktail" or "merging" concept. It is possible this is a simplification in the text and a very small non-zero was used, or that the benefit came from the optimizer's state being "warmed up" in Stage 1. This crucial detail is unclear and warrants further clarification, as it significantly impacts the interpretation of the strategy's effectiveness.
- Information Loss in Grouping: While the 5Hz grouping is highly efficient, there is an inherent risk of losing subtle but important paralinguistic cues in the speech signal (e.g., slight hesitations, rapid emotional shifts, nuanced intonation). The paper demonstrates strong benchmark performance, but a deeper analysis of what specific acoustic information is lost and how that might affect more nuanced tasks (like fine-grained emotion detection or sarcasm) would be valuable.
- Dependency on External Modules: The model's performance is tightly coupled to the quality of the frozen
S3TokenizerandCosyVoicedetokenizer. While this is a practical choice, it means that any limitations or biases in those external components are inherited by DrVoice. A fully end-to-end trained system without this dependency might offer more flexibility and potentially higher performance in the long run.
Similar papers
Recommended via semantic vector search.