We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.
我们提出了OpenVoxel,这是一种无需训练的算法,用于对稀疏体素进行分组和加标签,以实现开放词汇3D场景理解任务。给定从一个3D场景多视角图像获得的稀疏体素栅格化(SVR)模型,我们的OpenVoxel能够生成有意义的群组来描述场景中的不同物体。此外,通过利用强大的视觉语言模型(VLMs)和多模态大型语言模型(MLLMs),我们的OpenVoxel成功地为每个群组生成了有信息量的场景地图,从而支持进一步的3D场景理解任务,如开放词汇分割(OVS)或指代表达分割(RES)。与以前的方法不同,我们这种方法无需训练,并且不从CLIP/BERT文本编码器引入嵌入向量。相反,我们直接使用MLLM进行文本到文本搜索。通过广泛的实验,我们的方法在最近的研究中表现出色,特别是在复杂的指代表达分割(REM)任务上。代码将开源。
https://arxiv.org/abs/2601.09575
This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at this https URL.
这篇论文介绍了一种新颖的应用,即使用视频联合嵌入预测架构(V-JEPAs)进行面部表情识别(FER)。不同于依赖于像素级重建的传统视频理解预训练方法,V-JEPAs通过从未遮挡区域的嵌入中预测遮挡区域的嵌入来进行学习。这使得经过训练的编码器不会捕获给定视频中的无关信息,例如背景某个区域内像素的颜色。利用一个预先训练好的V-JEPA视频编码器,我们使用RAVDESS和CREMA-D数据集训练浅层分类器,在RAVDESS上达到了最先进的性能,并且在CREMA-D上的表现超过了所有其他基于视觉的方法(+1.48 WAR)。此外,跨数据集的评估显示了强大的泛化能力,证明了仅基于嵌入式的预训练方法对推进FER具有潜力。我们在提供的链接中发布了我们的代码。
https://arxiv.org/abs/2601.09524
The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.
近期,开源多模态大型语言模型(MLLM)框架如LLaVA的兴起为人工智能开发者和研究人员提供了一个便捷的起点。然而,大多数MLLM框架主要以视觉作为输入模式,并对语音、音频和音乐等模态的支持有限。这种情况阻碍了音頻-语言模型的发展,并迫使研究者花费大量精力在代码编写和超参数调整上。 我们推出了SLAM-LLM,这是一个开源深度学习框架,旨在训练定制化的多模态大型语言模型(MLLM),专注于语音、语言、音频和音乐处理。SLAM-LLM提供了不同编码器、投影器、大语言模型(LLMs)以及高效微调插件的模块化配置。此外,它还包含了主流任务的详细训练和推理配方,并包括高性能检查点,如基于大型语言模型的自动语音识别(ASR)、自动化音频描述(AAC)和音乐描述(MC)。其中一些配方已达到或接近业界最佳性能水平,相关技术也被学术论文接受。 我们希望SLAM-LLM能够加速研究者的迭代、开发、数据工程以及模型训练过程。我们将致力于通过这个开源框架不断推动基于语音的多模态大型语言模型的发展,并呼吁社区贡献于基于大型语言模型的语音、音频和音乐处理工作。
https://arxiv.org/abs/2601.09385
Recent advances in Multimodal Large Language Models (MLLMs) have improved image recognition and reasoning, but video-related tasks remain challenging due to memory constraints from dense frame processing. Existing Video Moment Retrieval (VMR) methodologies rely on sparse frame sampling, risking potential information loss, especially in lengthy videos. We propose SMORE (See MORE, store less), a framework that enhances memory efficiency while maintaining high information resolution. SMORE (1) uses query-guided captions to encode semantics aligned with user intent, (2) applies query-aware importance modulation to highlight relevant segments, and (3) adaptively compresses frames to preserve key content while reducing redundancy. This enables efficient video understanding without exceeding memory budgets. Experimental validation reveals that SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks.
最近在多模态大型语言模型(MLLM)方面的进展提高了图像识别和推理的能力,但在视频相关任务中仍面临挑战,主要是由于密集帧处理导致的内存限制。现有的视频时刻检索(VMR)方法依赖于稀疏帧采样,这可能导致信息丢失,尤其是在长视频中更为明显。为此,我们提出了SMORE(见多存少),一个旨在提高记忆效率同时保持高信息分辨率的框架。 SMORE具有以下特点: 1. 使用查询引导的字幕编码语义以与用户意图对齐; 2. 应用查询感知的重要性调制来突出相关片段; 3. 自适应压缩帧,保留关键内容的同时减少冗余。 这些特性使得SMORE能够高效地理解视频而不超出内存预算。实验验证显示,SMORE在QVHighlights、Charades-STA和ActivityNet-Captions基准测试中达到了最先进的性能。
https://arxiv.org/abs/2601.09350
In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel. Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel. Finally, experts and LLMs jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters. Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively. On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate. In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain.
在信息和通信技术(ICT)行业中,训练特定领域的大型语言模型(LLM)或构建增强检索的生成系统需要大量的高质量专业领域知识。然而,这些知识不仅隐藏于文本模式中,还存在于图像模式之中。传统的方法可以解析文档中的文本信息,但不具备图像描述能力。多模态LLM能够理解图片内容,但是它们缺乏足够的特定领域的专业知识。为解决上述问题,本文提出了一种多阶段渐进式训练策略,用于在ICT行业中训练一种领域专用的图像描述模型(DICModel),并构建了一个标准评估系统以验证DICModel的表现。具体而言,该研究首先通过结合Mermaid工具和LLM生成了约7K张图片-文本对,这些数据用于第一阶段的监督微调(SFT)过程中的DICModel训练。然后,ICT领域的专家手动标注大约2K张图片-文本对,以供第二阶段SFT使用。最后,专家与LLM共同合成了大约1.5K条基于视觉的问题回答数据,用于指令驱动的SFT。 实验结果显示,在参数量仅为70亿的情况下,我们的DICModel的表现优于其他最先进的模型(其参数量分别为320亿)。相比其他参数量为70亿和320亿的最佳实践模型,本研究中的DICModel在BLEU分数指标上分别提高了约56.8%和20.8%。在由ICT领域专家构建的客观问题测试中,我们的DICModel在准确率方面比Qwen2.5-VL 32B高出1%。 总之,该工作能够有效地、精确地从图像中提取逻辑文本信息,并有望促进多模态模型在ICT领域的进一步发展。
https://arxiv.org/abs/2601.09298
With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has garnered significant attention in Chinese Classical Studies (CCS). While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current models still face substantial challenges when processed on the MCGA test set. Furthermore, we introduce an evaluation metric for SEC and a metric to measure the consistency between the speech and text capabilities of MLLMs. We release MCGA and our code to the public to facilitate the development of MLLMs with more robust multidimensional audio capabilities in CCS. MCGA Corpus: this https URL
随着多模态大型语言模型(MLLMs)的迅速发展,它们在中文古典研究(CCS)中的潜力引起了广泛关注。尽管现有研究主要集中在文本和视觉模式上,但该领域内的音频语料库仍未被充分探索。为填补这一空白,我们提出了一个多任务中国古典文学体裁音频语料库(MCGA)。它涵盖了六个方面的多样化文学体裁:自动语音识别(ASR)、语音到文本翻译(S2TT)、语音情感说明(SEC)、口语问答(SQA)、语音理解(SU)和语音推理(SR)。通过评估十个MLLMs,我们的实验结果显示,在处理MCGA测试集时,当前模型仍面临重大挑战。此外,我们还引入了一个用于SEC的评价指标以及一个衡量MLLMs在语音与文本能力之间一致性程度的指标。我们将MCGA语料库及其代码公开发布,以促进CCS领域中具有更强大多维度音频处理能力的MLLM的发展。 **MCGA语料库链接:[此网址](this https URL)**
https://arxiv.org/abs/2601.09270
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at this https URL.
链式思维(Chain-of-Thought,CoT)推理已被证明可以有效提升大型语言模型的性能,通过鼓励逐步、中间推理的方式来实现。近期进展已将这一范式扩展到多模态大型语言模型(MLLMs)。在医疗领域中,诊断决策依赖于细微的视觉线索和顺序推理,链式思维与临床思维方式自然契合。然而,目前用于医学图像理解的基准测试通常仅关注最终答案,而忽视了推理路径。缺乏透明度的过程难以提供可靠的判断依据,使得医生难以利用其进行辅助诊断。 为解决这一问题,我们引入了一个新的M3CoTBench基准测试,专门设计用于评估链式思维在医学图像理解中的正确性、效率、影响和一致性。该基准包括以下特点: 1. 一个涵盖24种检查类型的多样性和多难度级别的数据集。 2. 包含不同难度等级的13个任务。 3. 针对临床推理量身定制的一系列链式思维特定评估指标(正确性、效率、影响和一致性)。 4. 多个MLLM性能分析。 M3CoTBench系统地评估了各种医学影像任务中的链式推理,揭示了当前多模态大型语言模型在生成可靠且临床解释性强的推理方面存在的局限,并致力于推动透明、可信及诊断准确的人工智能系统的开发。项目页面链接:[此URL](https://project-page-url.com)(请将实际项目页URL插入此处)。
https://arxiv.org/abs/2601.08758
We introduce the Ministral 3 series, a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three model sizes: 3B, 8B, and 14B parameters. For each model size, we release three variants: a pretrained base model for general-purpose use, an instruction finetuned, and a reasoning model for complex problem-solving. In addition, we present our recipe to derive the Ministral 3 models through Cascade Distillation, an iterative pruning and continued training with distillation technique. Each model comes with image understanding capabilities, all under the Apache 2.0 license.
我们介绍Ministral 3系列,这是一个参数高效、密集型的语言模型家族,专为计算和内存受限的应用程序设计。该系列包含三种不同规模的模型:3B、8B 和14B 参数版本。 对于每个模型规模,我们将发布三个变体: - 预训练的基础模型,适用于通用场景; - 经过指令微调的基础模型; - 用于解决复杂问题的推理模型。 此外,我们还介绍了通过级联蒸馏技术(Cascade Distillation)导出Ministral 3系列模型的方法。这种技术包括迭代剪枝和持续蒸馏训练。每个模型都具备图像理解能力,并且所有内容均以Apache 2.0许可证发布。
https://arxiv.org/abs/2601.08584
The demand for real-time visual understanding and interaction in complex scenarios is increasingly critical for unmanned aerial vehicles. However, a significant challenge arises from the contradiction between the high computational cost of large Vision language models and the limited computing resources available on UAV edge devices. To address this challenge, this paper proposes a lightweight multimodal task platform based on BLIP-2, integrated with YOLO-World and YOLOv8-Seg models. This integration extends the multi-task capabilities of BLIP-2 for UAV applications with minimal adaptation and without requiring task-specific fine-tuning on drone data. Firstly, the deep integration of BLIP-2 with YOLO models enables it to leverage the precise perceptual results of YOLO for fundamental tasks like object detection and instance segmentation, thereby facilitating deeper visual-attention understanding and reasoning. Secondly, a content-aware key frame sampling mechanism based on K-Means clustering is designed, which incorporates intelligent frame selection and temporal feature concatenation. This equips the lightweight BLIP-2 architecture with the capability to handle video-level interactive tasks effectively. Thirdly, a unified prompt optimization scheme for multi-task adaptation is implemented. This scheme strategically injects structured event logs from the YOLO models as contextual information into BLIP-2's input. Combined with output constraints designed to filter out technical details, this approach effectively guides the model to generate accurate and contextually relevant outputs for various tasks.
在复杂场景中,无人驾驶航空器对实时视觉理解和交互的需求日益重要。然而,大型视觉语言模型的高计算成本与无人机边缘设备有限的计算资源之间的矛盾成为了一个重大挑战。为解决这一问题,本文提出了一种基于BLIP-2的轻量级多模态任务平台,并集成了YOLO-World和YOLOv8-Seg模型。这种集成使BLIP-2在无人飞行器应用中能够扩展其多任务功能,而无需对无人机数据进行特定任务的微调。 首先,BLIP-2与YOLO模型的深度融合使其能利用YOLO提供的精确感知结果来执行对象检测和实例分割等基本任务,从而促进更深层次的视觉注意力理解和推理。其次,设计了一种基于K-Means聚类的内容感知关键帧采样机制,该机制结合了智能帧选择和时间特征串联功能,使轻量级BLIP-2架构能够有效处理视频级别的交互式任务。最后,实施了一个统一的多任务适应性提示优化方案。此方案战略性地将YOLO模型中的结构化事件日志作为上下文信息注入到BLIP-2的输入中,并结合输出限制以过滤掉技术细节,从而有效地指导模型生成准确且与上下文相关的多种任务结果。
https://arxiv.org/abs/2601.08408
With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.
随着图像生成技术的迅速发展,使用自然语言指令进行视觉文本编辑受到了越来越多的关注。这一任务的主要挑战在于充分理解指令和参考图,并据此生成与图片风格一致的视觉文字。以往的方法往往涉及复杂的步骤,如指定文本内容及属性(例如字体大小、颜色和布局),但这些方法并未考虑与参考图像在风格上的一致性。为了解决这些问题,我们提出了UM-Text,这是一种统一的多模态模型,用于通过自然语言指令理解上下文并进行视觉文字编辑。具体来说,我们引入了一个视觉语言模型(VLM)来处理指令和参考图,以便根据上下文信息精巧地设计文本内容和布局。 为了生成准确且和谐的视觉文字图像,我们进一步提出了UM-Encoder,它将各种条件信息的嵌入组合在一起,这些组合由VLM根据输入指令自动配置。在训练过程中,我们提出了一种区域一致性损失函数,以提供对潜在空间和RGB空间中字符生成的有效监督,并设计了一个定制化的三阶段训练策略来进一步提升模型性能。 此外,我们贡献了UM-DATA-200K,这是一个大型视觉文本图像数据集,在多种场景下用于模型训练。在多个公共基准上的广泛定性和定量结果表明,我们的方法达到了最先进的性能水平。
https://arxiv.org/abs/2601.08321
Human Activity Recognition (HAR) in smart homes is critical for health monitoring and assistive living. While vision-based systems are common, they face privacy concerns and environmental limitations (e.g., occlusion). In this work, we present MobiDiary, a framework that generates natural language descriptions of daily activities directly from heterogeneous physical signals (specifically IMU and Wi-Fi). Unlike conventional approaches that restrict outputs to pre-defined labels, MobiDiary produces expressive, human-readable summaries. To bridge the semantic gap between continuous, noisy physical signals and discrete linguistic descriptions, we propose a unified sensor encoder. Instead of relying on modality-specific engineering, we exploit the shared inductive biases of motion-induced signals--where both inertial and wireless data reflect underlying kinematic dynamics. Specifically, our encoder utilizes a patch-based mechanism to capture local temporal correlations and integrates heterogeneous placement embedding to unify spatial contexts across different sensors. These unified signal tokens are then fed into a Transformer-based decoder, which employs an autoregressive mechanism to generate coherent action descriptions word-by-word. We comprehensively evaluate our approach on multiple public benchmarks (XRF V2, UWash, and WiFiTAD). Experimental results demonstrate that MobiDiary effectively generalizes across modalities, achieving state-of-the-art performance on captioning metrics (e.g., BLEU@4, CIDEr, RMC) and outperforming specialized baselines in continuous action understanding.
智能家居中的人体活动识别(HAR)对于健康监测和辅助生活至关重要。虽然基于视觉的系统很常见,但它们面临隐私问题和环境限制(如遮挡)。在此工作中,我们提出了MobiDiary框架,该框架可以直接从异构物理信号(特别是IMU和Wi-Fi)生成日常活动的自然语言描述。不同于传统方法将输出限制为预定义标签,MobiDiary能够生成丰富、易读的摘要。 为了弥合连续且嘈杂的物理信号与离散语言描述之间的语义差距,我们提出了一种统一传感器编码器。与其依赖于特定模态的工程设计,我们的方法利用了由运动引起的所有数据共享的先验知识——无论是惯性数据还是无线数据都反映了潜在的动力学特性。 具体而言,我们的编码器采用基于块的方法来捕获局部时间相关性,并集成了异构位置嵌入以统一不同传感器间的空间背景。这些统一后的信号标记随后被输入到一个基于Transformer的解码器中,该解码器利用自回归机制逐词生成连贯的动作描述。 我们在多个公开基准(XRF V2、UWash和WiFiTAD)上全面评估了我们的方法。实验结果表明,MobiDiary在跨模态泛化方面表现出色,在字幕指标(如BLEU@4、CIDEr、RMC)等衡量标准中达到了最先进的性能,并且在连续动作理解方面超过了专门设计的基线系统。
https://arxiv.org/abs/2601.08204
Explainable video anomaly detection (VAD) is crucial for safety-critical applications, yet even with recent progress, much of the research still lacks spatial grounding, making the explanations unverifiable. This limitation is especially pronounced in multi-entity interactions, where existing explainable VAD methods often produce incomplete or visually misaligned descriptions, reducing their trustworthiness. To address these challenges, we introduce instance-aligned captions that link each textual claim to specific object instances with appearance and motion attributes. Our framework captures who caused the anomaly, what each entity was doing, whom it affected, and where the explanationis grounded, enabling verifiable and actionable reasoning. We annotate eight widely used VAD benchmarks and extend the 360-degree egocentric dataset, VIEW360, with 868 additional videos, eight locations, and four new anomaly types, creating VIEW360+, a comprehensive testbed for explainable VAD. Experiments show that our instance-level spatially grounded captions reveal significant limitations in current LLM- and VLM-based methods while providing a robust benchmark for future research in trustworthy and interpretable anomaly detection.
可解释的视频异常检测(VAD)对于安全关键型应用至关重要,然而尽管近年来取得了进展,许多研究仍然缺乏空间定位能力,这使得其解释无法验证。这一局限在多实体交互中尤为明显,在这种情况下,现有的可解释性VAD方法经常产生不完整或视觉对齐不当的描述,从而降低它们的信任度。 为了解决这些挑战,我们引入了一种实例级文本说明系统,该系统将每个文本声明与特定具有外观和运动属性的对象实例关联起来。我们的框架能够揭示异常是由谁引起的、每个实体在做什么、它影响了谁,以及解释基于何处,这使得推理既可验证又具有操作性。 我们对八个常用的VAD基准数据集进行了注释,并扩展了360度第一人称视角数据集VIEW360,增加了868个额外的视频、八种新地点和四种新的异常类型,从而创建了VIEW360+,一个全面测试可解释性VAD方法的平台。 实验表明,我们的实例级空间定位文本说明揭示了当前基于LLM(大型语言模型)和VLM(视觉语言模型)的方法的重大局限性,并为未来值得信赖且易于解释的异常检测研究提供了坚实的基准。
https://arxiv.org/abs/2601.08155
Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.
科学复合图表将多个标注面板组合成一个单一图像,但在实际处理管道中,图例常常缺失或仅提供图片级别的总结,这使得理解每个单独的面板变得困难。在本文中,我们提出了FigEx2,这是一个基于视觉条件的框架,它可以定位面板,并直接从复合图形生成针对每个面板的文字描述。为了减轻开放性描述中的多种表述带来的影响,我们引入了一种噪声感知门控融合模块,该模块能够自适应地过滤词级别特征以稳定检测查询空间。 此外,我们采用了一个阶段化优化策略,结合了监督学习与强化学习(RL),利用CLIP基础的对齐和BERTScore基础的语义奖励来强制执行严格的多模态一致性。为了支持高质量的监督,我们策划了BioSci-Fig-Cap,这是一个针对面板级别定位进行了精炼基准测试的数据集,并且还包含跨学科物理学和化学领域的测试套件。 实验结果表明,FigEx2在检测方面取得了0.726 mAP@0.5:0.95的卓越表现,并在METEOR和BERTScore上分别比Qwen3-VL-8B高出0.51和0.24。值得注意的是,FigEx2展示了出色的零样本迁移能力,能够在未经任何微调的情况下应用于分布外的科学领域。
https://arxiv.org/abs/2601.08026
Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.
视觉-语言模型在多种多模态理解和推理任务中表现出色,但其多步推理的稳定性仍然存在问题。对同一输入进行多次采样往往会导致不同的推理路径和不一致的最终预测结果。为了解决这个问题,我们提出了两种受测试时间缩放启发的方法:(1)CASHEW,这是一个推理框架,在推理过程中通过迭代聚合多个候选轨迹来稳定推理过程,并通过显式的视觉验证过滤掉幻觉步骤并使推理建立在视觉证据之上;以及(2)CASHEW-RL,这是一种学习型变体,它在一个单一模型中内化了这种聚合行为。CASHEW-RL 使用分组序列策略优化(GSPO)进行训练,并采用了一种复合奖励机制来鼓励基于最少但足够视觉证据的正确答案,同时根据任务难度自适应地分配推理努力。这一训练目标使模型能够在推理时实现稳健的自我聚合。 在13个图像理解、视频理解和视频推理基准测试中进行了广泛实验,结果显示了显著性能提升,包括ScienceQA和EgoSchema分别提高了23.6个百分点和8.1个百分点。
https://arxiv.org/abs/2601.08010
Large Vision-Language Models (LVLMs) face a fundamental dilemma in video reasoning: they are caught between the prohibitive computational costs of verbose reasoning and the hallucination risks of efficient, ungrounded approaches. To resolve this, we introduce the Chain of Evidence (CoE), a novel framework that architecturally decouples and co-optimizes perceptual grounding and reasoning efficiency. CoE incorporates two core innovations: (1) A lightweight Evidence Grounding Module (EGM) that acts as a query-guided filter, dynamically identifying and extracting a compact set of high-fidelity visual evidence; and (2) An Evidence-Anchoring Protocol optimized via Reinforcement Learning. Crucially, we design a composite reward mechanism that enforces process alignment, compelling the model to strictly reference identified temporal anchors during deduction, thereby mitigating hallucinations. To enable this, we construct CoE-Instruct, a large-scale dataset (164k samples) featuring a novel dual-annotation schema for separate perception and reasoning supervision. Extensive experiments on five benchmarks, including Video-MME, MVBench, and VSI-Bench, demonstrate that CoE-enhanced models establish a new state-of-the-art. They significantly outperform existing methods in accuracy, proving CoE to be a powerful and practical paradigm for reliable video understanding.
大型视觉-语言模型(LVLMs)在视频推理方面面临着一个根本性的困境:它们要么面临详细推理的高昂计算成本,要么承担低效且缺乏依据的方法所带来的幻觉风险。为了解决这一问题,我们引入了证据链(CoE),这是一种新型框架,它从架构上分离并协同优化感知基础和推理效率。CoE 包含两大核心创新: 1. 一个轻量级的证据接地模块(EGM),作为查询引导过滤器,动态识别并提取一组高质量、简洁的视觉证据。 2. 一种通过强化学习优化的证据锚定协议。 至关重要的是,我们设计了一个复合奖励机制来强制执行过程一致性,使模型在演绎过程中严格参考已识别的时间锚点,从而减少幻觉风险。为了实现这一点,我们构建了 CoE-Instruct,这是一个大规模的数据集(164,000 个样本),其中包含一种新颖的双标注方案,用于单独感知和推理监督。 我们在五个基准测试上进行了广泛的实验,包括 Video-MME、MVBench 和 VSI-Bench,结果表明增强型 CoE 模型建立了新的最先进水平。它们在准确性方面显著超越了现有方法,证明 CoE 是一种强大且实用的可靠视频理解范式。
https://arxiv.org/abs/2601.07761
Generating structured narrations for real-world e-commerce videos requires models to perceive fine-grained visual details and organize them into coherent, high-level stories--capabilities that existing approaches struggle to unify. We introduce the E-commerce Hierarchical Video Captioning (E-HVC) dataset with dual-granularity, temporally grounded annotations: a Temporal Chain-of-Thought that anchors event-level observations and Chapter Summary that compose them into concise, story-centric summaries. Rather than directly prompting chapters, we adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions, then refines coarse annotations into precise chapter boundaries and titles conditioned on the Temporal Chain-of-Thought, yielding fact-grounded, time-aligned narratives. We also observe that e-commerce videos are fast-paced and information-dense, with visual tokens dominating the input sequence. To enable efficient training while reducing input tokens, we propose the Scene-Primed ASR-anchored Compressor (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues. Built upon these designs, our HiVid-Narrator framework achieves superior narrative quality with fewer input tokens compared to existing methods.
生成针对现实世界电子商务视频的结构化叙述,需要模型能够感知细微的视觉细节,并将这些细节组织成连贯、高层次的故事——这是现有方法难以统一的能力。我们引入了具有双粒度、时间接地标注的E-commerce Hierarchical Video Captioning (E-HVC) 数据集:一个锚定事件级观察的时间链式思维(Temporal Chain-of-Thought),以及将它们组合成简洁、以故事为中心摘要的小节总结(Chapter Summary)。 不同于直接提示章节,我们采用了分阶段构建的方法,首先通过精心挑选的自动语音识别(ASR)和帧级别描述来收集可靠的语言和视觉证据,然后根据时间链式思维细化粗略标注,从而生成基于事实的时间对齐叙述。 此外,我们注意到电子商务视频节奏快、信息密集,并且视觉标记在输入序列中占据主导地位。为了减少输入标记的同时实现高效训练,我们提出了场景引导的ASR锚定压缩器(Scene-Primed ASR-anchored Compressor, SPA-Compressor),它能够将多模态令牌压缩为层次化的场景和事件表示,并由ASR语义线索指导。 基于这些设计,我们的HiVid-Narrator框架相比现有方法使用更少的输入标记获得了更高的叙述质量。
https://arxiv.org/abs/2601.07366
6D object pose estimation plays a crucial role in scene understanding for applications such as robotics and augmented reality. To support the needs of ever-changing object sets in such context, modern zero-shot object pose estimators were developed to not require object-specific training but only rely on CAD models. Such models are hard to obtain once deployed, and a continuously changing and growing set of objects makes it harder to reliably identify the instance model of interest. To address this challenge, we introduce an Open-Set CAD Retrieval from a Language Prompt and a Single Image (OSCAR), a novel training-free method that retrieves a matching object model from an unlabeled 3D object database. During onboarding, OSCAR generates multi-view renderings of database models and annotates them with descriptive captions using an image captioning model. At inference, GroundedSAM detects the queried object in the input image, and multi-modal embeddings are computed for both the Region-of-Interest and the database captions. OSCAR employs a two-stage retrieval: text-based filtering using CLIP identifies candidate models, followed by image-based refinement using DINOv2 to select the most visually similar object. In our experiments we demonstrate that OSCAR outperforms all state-of-the-art methods on the cross-domain 3D model retrieval benchmark MI3DOR. Furthermore, we demonstrate OSCAR's direct applicability in automating object model sourcing for 6D object pose estimation. We propose using the most similar object model for pose estimation if the exact instance is not available and show that OSCAR achieves an average precision of 90.48\% during object retrieval on the YCB-V object dataset. Moreover, we demonstrate that the most similar object model can be utilized for pose estimation using Megapose achieving better results than a reconstruction-based approach.
六维物体姿态估计在机器人和增强现实等应用的场景理解中扮演着至关重要的角色。为了支持这些应用场景中不断变化的对象集合的需求,现代零样本(zero-shot)物体姿态估算器被开发出来,不需要特定于某个对象的训练,而只需要依靠CAD(计算机辅助设计)模型即可工作。然而,在部署后获取这些模型变得非常困难,并且由于对象集持续变化和增长,准确识别感兴趣的实例模型变得更加具有挑战性。 为了解决这一难题,我们引入了一种名为OSCAR的方法,即从语言提示和单张图像进行开放集合CAD检索的新颖训练自由方法(Open-Set CAD Retrieval from a Language Prompt and a Single Image)。在部署时,OSCAR会生成数据库中模型的多视角渲染,并使用图像描述性文字注释工具来标注这些渲染。推理阶段时,GroundedSAM会在输入图像中检测查询对象,同时为感兴趣区域和数据库中的描述性文字计算多模态嵌入。 OSCAR采用两阶段检索方法:第一阶段是利用CLIP(一种文本到图像的匹配模型)基于文本过滤候选模型;第二阶段则使用DINOv2进行基于图像的细化,选择视觉上最相似的对象。在我们的实验中显示,与现有最佳方法相比,OSCAR在跨域3D模型检索基准MI3DOR上的性能更优。 此外,我们展示了OSCAR在自动化六维物体姿态估计所需对象模型获取中的直接应用价值。当无法获得确切实例时,我们可以使用最相似的对象模型进行姿态估计,并证明在YCB-V对象数据集上,OSCAR在物体检索期间达到了90.48%的平均精度。 最后,我们还展示了即使采用Megapose方法利用最接近的对象模型来进行姿态估计也能取得比基于重建的方法更好的结果。
https://arxiv.org/abs/2601.07333
While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.
尽管多模态大型语言模型(MLLM)在单一图像理解方面表现出色,但在处理涉及多个图像的推理场景时性能显著下降。多图像推理面临的基本挑战包括图像之间的复杂相互关系以及关键信息在图像集中的分散分布。受人类认知过程启发,我们提出了一种称为“CINEMA”(认知启发元动作框架)的新方法,该方法将多图像推理分解为五个结构化的元动作:全局视图、聚焦、提示、思考和回答,这些步骤明确地模拟了人类自然使用的顺序认知步骤。在冷启动训练阶段,我们引入了一种基于检索的树采样策略,以生成高质量的元动作轨迹来初始化模型的推理模式。在强化学习过程中,我们采用两阶段范式:第一阶段是探索阶段,使用保持多样性的策略避免熵塌陷;第二阶段是经过调整后的开发阶段,逐步加强利用过程。 为了训练我们的模型,我们构建了一个包含57,000个冷启动和58,000个强化学习实例的数据集,这些实例涵盖了多图像、多帧以及单一图像的任务。我们在多个多图像推理基准测试、视频理解基准测试及单张图像基准测试中进行了广泛的评估,并在几个关键基准上实现了具有竞争力的最新性能。我们的模型在MUIR和MVMath基准测试上超越了GPT-4o,在视频理解基准测试上也显著优于专门化的视频推理模型,这证明了我们的人类认知启发推理框架的有效性和通用性。
https://arxiv.org/abs/2601.07298
This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.
本文介绍了VideoLoom,这是一种统一的视频大型语言模型(Video LLM),用于联合空间-时间理解。为了促进细粒度的空间和时间定位能力的发展,我们整理了LoomData-8.7k,这是一个以人为中心的视频数据集,具有基于时间标注和空间定位的字幕。凭借这一优势,VideoLoom在多种空间和时间基准测试中实现了最先进的或极具竞争力的表现(例如,在ReVOS上针对参考视频对象分割获得63.1 J&F,在Charades-STA上针对时间定位获得48.3 R1@0.7)。此外,我们引入了LoomBench,这是一个新的由时间、空间和组成视频问题对构成的基准测试集,能够从多个角度全面评估Video LLMs。这些贡献共同提供了一套通用且有效的工具,用于联合进行空间-时间视频理解,并在多模态智能领域设立了新标准。
https://arxiv.org/abs/2601.07290
Glass surfaces create complex interactions of reflected and transmitted light, making single-image reflection removal (SIRR) challenging. Existing datasets suffer from limited physical realism in synthetic data or insufficient scale in real captures. We introduce a synthetic dataset generation framework that path-traces 3D glass models over real background imagery to create physically accurate reflection scenarios with varied glass properties, camera settings, and post-processing effects. To leverage the capabilities of Large Multimodal Model (LMM), we concatenate the image layers into a single composite input, apply joint captioning, and fine-tune the model using task-specific LoRA rather than full-parameter training. This enables our approach to achieve improved reflection removal and separation performance compared to state-of-the-art methods.
玻璃表面会产生复杂的反射和透射光线相互作用,这使得单一图像的反射移除(SIRR)变得具有挑战性。现有的数据集在合成数据中的物理真实性和实际捕捉中的规模方面存在局限。我们引入了一个生成框架,用于通过真实的背景图像对3D玻璃模型进行路径追踪,以创建具有各种玻璃属性、相机设置和后期处理效果的物理准确反射场景。 为了利用大型多模态模型(LMM)的能力,我们将图像层合并为单一复合输入,并应用联合标注。我们使用任务特定的LoRA微调模型,而不是进行全面参数训练,从而使得我们的方法能够实现比现有最佳方法更好的反射移除和分离性能。
https://arxiv.org/abs/2601.07209