Paper Reading AI Learner

Qwen2.5-Omni Technical Report

2025-03-26 04:17:55
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, Junyang Lin

Abstract

In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

Abstract (translated)

在这份报告中,我们介绍了Qwen2.5-Omni,这是一种端到端的多模态模型,旨在感知包括文本、图像、音频和视频在内的多种模式,并同时以流式方式生成文本和自然语音响应。为了实现多模态信息输入的流处理,音频和视觉编码器均采用块状处理方法。为了同步视频输入的时间戳与音频,我们将音频和视频交错地顺序组织起来,并提出了一种新颖的位置嵌入方法,名为TMRoPE(时间对齐的多模态RoPE)。为同时生成文本和语音并避免两种模式之间的干扰,我们提出了**Thinker-Talker**架构。在该框架中,Thinker作为一个大型语言模型负责文本生成,而Talker则是一个双轨自回归模型,直接利用Thinker的隐藏表示来输出音频令牌。两个模型均设计为端到端训练和推理。 为了以流式方式解码音频令牌,我们引入了一种滑动窗口DiT,限制接收域,旨在减少初始包延迟。Qwen2.5-Omni与同样大小的Qwen2.5-VL相当,并优于Qwen2-Audio。此外,Qwen2.5-Omni在多模态基准测试如Omni-Bench上取得了最先进的性能表现。值得注意的是,根据MMLU和GSM8K等基准测试数据,Qwen2.5-Omni的端到端语音指令跟随能力与处理文本输入的能力相当。至于语音生成方面,Qwen2.5-Omni的流式Talker在鲁棒性和自然度上优于大多数现有的流式和非流式替代方案。

URL

https://arxiv.org/abs/2503.20215

PDF

https://arxiv.org/pdf/2503.20215.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot