Abstract
In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.
Abstract (translated)
在这份报告中,我们介绍了Qwen2.5-Omni,这是一种端到端的多模态模型,旨在感知包括文本、图像、音频和视频在内的多种模式,并同时以流式方式生成文本和自然语音响应。为了实现多模态信息输入的流处理,音频和视觉编码器均采用块状处理方法。为了同步视频输入的时间戳与音频,我们将音频和视频交错地顺序组织起来,并提出了一种新颖的位置嵌入方法,名为TMRoPE(时间对齐的多模态RoPE)。为同时生成文本和语音并避免两种模式之间的干扰,我们提出了**Thinker-Talker**架构。在该框架中,Thinker作为一个大型语言模型负责文本生成,而Talker则是一个双轨自回归模型,直接利用Thinker的隐藏表示来输出音频令牌。两个模型均设计为端到端训练和推理。 为了以流式方式解码音频令牌,我们引入了一种滑动窗口DiT,限制接收域,旨在减少初始包延迟。Qwen2.5-Omni与同样大小的Qwen2.5-VL相当,并优于Qwen2-Audio。此外,Qwen2.5-Omni在多模态基准测试如Omni-Bench上取得了最先进的性能表现。值得注意的是,根据MMLU和GSM8K等基准测试数据,Qwen2.5-Omni的端到端语音指令跟随能力与处理文本输入的能力相当。至于语音生成方面,Qwen2.5-Omni的流式Talker在鲁棒性和自然度上优于大多数现有的流式和非流式替代方案。
URL
https://arxiv.org/abs/2503.20215