Existing medical VQA benchmarks mostly focus on single-image analysis, yet clinicians almost always compare a series of images before reaching a diagnosis. To better approximate this workflow, we introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA. To build MedFrameQA both at scale and in high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images, and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance. The resulting dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images. We comprehensively benchmark ten advanced Multimodal LLMs -- both proprietary and open source, with and without explicit reasoning modules -- on MedFrameQA. The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50%, and accuracy fluctuates as the number of images per question increases. Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities. We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems.
现有的医学VQA(视觉问答)基准测试大多集中在单一图像分析上,然而,临床医生在得出诊断之前几乎总是要比较一系列的影像。为了更好地模拟这一工作流程,我们推出了MedFrameQA——这是首个明确评估多图推理能力的医学VQA基准测试。为了在规模和高质量两个方面构建MedFrameQA,我们开发了1)一个自动化管道,该管道从医疗视频中提取时间上连贯的画面,并构造出内容在图像间逻辑演化的VQA项目;2)一个多阶段筛选策略,包括基于模型的和人工审查的方式,以保持数据清晰度、难度以及医学相关性。生成的数据集包含2,851对VQA问题(从9,237张高质量帧中的3,420个视频中提取),涵盖九大人体系统及43个器官;每个问题都配有两到五幅图像。 我们全面评估了十种先进的多模态LLM(大型语言模型)——无论是专有的还是开源的,包括那些具有明确推理模块和没有推理模块的情况。在MedFrameQA上的评估结果挑战性地揭示出所有模型的表现都很差,大多数准确率低于50%,并且随着每个问题中图像数量的增加,准确性波动显著。错误分析进一步显示,模型经常忽略显而易见的发现、在图片之间错误汇总证据,并且通过推理链传播早期的错误;结果也因人体系统、器官和模态的不同而有较大差异。 我们希望这项工作能够推动基于临床实践的多图推理研究,并加速更强大的诊断AI系统的开发进程。
https://arxiv.org/abs/2505.16964
Recent advances in scene-based video generation have enabled systems to synthesize coherent visual narratives from structured prompts. However, a crucial dimension of storytelling -- character-driven dialogue and speech -- remains underexplored. In this paper, we present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue, enriching visual storytelling with natural voice and character expression. Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior. While a story generation model such as Text2Story generates the corresponding visual scene, we focus on generating expressive character utterances from these prompts and the scene image. We apply a pretrained vision-language encoder to extract a high-level semantic feature from the representative frame, capturing salient visual context. This feature is then combined with the structured prompts and used to guide a large language model in synthesizing natural, character-consistent dialogue. To ensure contextual consistency across scenes, we introduce a Recursive Narrative Bank that conditions each dialogue generation on the accumulated dialogue history from prior scenes. This approach enables characters to speak in ways that reflect their evolving goals and interactions throughout a story. Finally, we render each utterance as expressive, character-consistent speech, resulting in fully-voiced video narratives. Our framework requires no additional training and demonstrates applicability across a variety of story settings, from fantasy adventures to slice-of-life episodes.
最近在基于场景的视频生成领域的进展使得系统能够从结构化的提示中合成连贯的视觉叙述。然而,叙事中的一个关键维度——以角色驱动对话和言语——仍然相对未被充分探索。在这篇论文中,我们提出了一种模块化管道,该管道将动作级别的提示转换为基于视觉和听觉的叙述对话,从而丰富了视觉叙事,并加入了自然的声音和人物表达。我们的方法采用每场景一对输入提示作为输入,其中第一个定义背景设置,第二个指定角色的行为。虽然像Text2Story这样的故事生成模型可以产生相应的视觉场景,但我们专注于从这些提示和场景图像中生成富有表现力的对话文本。 我们应用了一个预训练的视觉-语言编码器来提取代表帧中的高层次语义特征,捕捉显著的视觉上下文。这个特征随后与结构化提示相结合,并用来指导大型语言模型合成自然且角色一致的对话。为了确保在整个故事中的场景之间保持上下文一致性,我们引入了递归叙事库,使得每一次对话生成都基于之前场景积累下来的对话历史。这种方法使角色能够以反映其不断变化的目标和互动的方式进行交谈。 最后,我们将每个语句渲染成富有表现力且符合角色的语音,从而产生完整的有声视频叙述。我们的框架无需额外训练,并展示了在各种故事设置中的适用性,包括幻想冒险和日常生活片段等场景。
https://arxiv.org/abs/2505.16819
High-quality machine translation systems based on large language models (LLMs) have simplified the production of personalized translations reflecting specific stylistic constraints. However, these systems still struggle in settings where stylistic requirements are less explicit and might be harder to convey via prompting. We explore various strategies for personalizing LLM-generated translations in low-resource settings, focusing on the challenging literary translation domain. We explore prompting strategies and inference-time interventions for steering model generations towards a personalized style, and propose a contrastive framework exploiting latent concepts extracted from sparse autoencoders to identify salient personalization properties. Our results show that steering achieves strong personalization while preserving translation quality. We further examine the impact of steering on LLM representations, finding model layers with a relevant impact for personalization are impacted similarly by multi-shot prompting and our steering method, suggesting similar mechanism at play.
基于大型语言模型(LLM)的高质量机器翻译系统已经简化了生成反映特定风格约束的个性化翻译的过程。然而,在那些风格要求不太明确且可能难以通过提示传达的情况下,这些系统仍然面临挑战。我们探索了在资源较少环境下使LLM生成的翻译个性化的各种策略,重点关注具有挑战性的文学翻译领域。我们探讨了引导模型生成向个性化风格发展的提示策略和推理时干预方法,并提出了一种利用稀疏自动编码器提取的概念对比框架来识别显著的个性化属性。我们的结果显示,通过引导可以实现强烈的个性化同时保持翻译质量。进一步地,我们还研究了引导对LLM表示的影响,发现与多次提示和我们的引导方法一样,影响个性化的模型层受到类似的机制影响。
https://arxiv.org/abs/2505.16612
According to the EPA, only 25% of waste is recycled, and just 60% of U.S. municipalities offer curbside recycling. Plastics fare worse, with a recycling rate of only 8%; an additional 16% is incinerated, while the remaining 76% ends up in landfills. The low plastic recycling rate stems from contamination, poor economic incentives, and technical difficulties, making efficient recycling a challenge. To improve recovery, automated sorting plays a critical role. Companies like AMP Robotics and Greyparrot utilize optical systems for sorting, while Materials Recovery Facilities (MRFs) employ Near-Infrared (NIR) sensors to detect plastic types. Modern optical sorting uses advances in computer vision such as object recognition and instance segmentation, powered by machine learning. Two-stage detectors like Mask R-CNN use region proposals and classification with deep backbones like ResNet. Single-stage detectors like YOLO handle detection in one pass, trading some accuracy for speed. While such methods excel under ideal conditions with a large volume of labeled training data, challenges arise in realistic scenarios, emphasizing the need to further examine the efficacy of optic detection for automated sorting. In this study, we compiled novel datasets totaling 20,000+ images from varied sources. Using both public and custom machine learning pipelines, we assessed the capabilities and limitations of optical recognition for sorting. Grad-CAM, saliency maps, and confusion matrices were employed to interpret model behavior. We perform this analysis on our custom trained models from the compiled datasets. To conclude, our findings are that optic recognition methods have limited success in accurate sorting of real-world plastics at MRFs, primarily because they rely on physical properties such as color and shape.
根据美国环境保护署(EPA)的数据,只有25%的废物被回收利用,并且仅有60%的美国城市提供路边回收服务。塑料的表现更糟,其回收率仅为8%,另有16%被焚烧,而剩余的76%最终进入垃圾填埋场。低塑料回收率的原因包括污染、经济激励不足以及技术难度,这些因素使得高效的塑料回收变得极具挑战性。为了改善材料回收效率,自动分拣系统发挥着关键作用。 一些公司如AMP Robotics和Greyparrot利用光学系统进行分类,而物料回收设施(MRFs)则使用近红外(NIR)传感器来检测塑料类型。现代光学分类技术运用了计算机视觉的进步,比如目标识别和实例分割,并通过机器学习支持这些技术。Mask R-CNN等两阶段探测器采用区域提议与深层骨干网如ResNet相结合的方法进行分类。而像YOLO这样的单阶段探测器则能在一次处理中完成检测任务,尽管牺牲了一定的精度以换取速度。 在理想的条件和大量标记训练数据的情况下,这些方法表现出色。然而,在实际应用中,由于各种挑战,例如光线变化、标签不一致等问题,光学识别技术的有效性受到了限制。为了进一步探究这些问题,本研究收集了来自不同来源共计20,000多张图像的新型数据集,并使用公共和定制机器学习管道评估了光学识别在分拣中的能力和局限。 我们利用Grad-CAM(Gradient-weighted Class Activation Mapping)、热图以及混淆矩阵来解释模型的行为。我们在从编译的数据集中自训练的模型上进行了这种分析。最终,我们的研究发现表明,在物料回收设施中准确分类真实世界的塑料方面,光学识别方法仅取得了有限的成功,其主要原因在于它们依赖于颜色和形状等物理特性。 这项研究表明了尽管现有技术在理论上具有潜力,但在实际操作环境中有效应用仍面临诸多挑战,特别是在处理复杂多样的环境条件时。这强调了进一步开发更加智能且适应性强的分类系统的必要性,以便能够更好地应对这些现实世界的难题。
https://arxiv.org/abs/2505.16513
Long Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing ``delegate-and-conquer'' strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on two LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47\% while still outperforming existing methods, establishing a new state-of-the-art for LTVG in terms of both efficiency and performance. Our code is available at this https URL.
长视频时间定位(LVTG)的目标是根据用户提供的文本查询,在较长的视频中识别特定时刻,以实现有效的内容检索。现有方法通过将视频划分为片段并使用全规模专家编码器处理每个片段来实现这一目标,但由于在长视频中处理大量片段会带来高昂的计算成本,这种方法难以扩展。为了解决这个问题,我们引入了DeCafNet,这是一种采用“委托与征服”策略的方法,在不牺牲定位性能的情况下提高计算效率。 DeCafNet 引入了一个副手编码器(sidekick encoder),该编码器能够以资源高效的方式对视频中的所有片段进行密集特征提取,并生成一个显著图来识别哪些片段需要由专家编码器进行全面处理。为了有效利用以不同时间分辨率存在的副手和专家编码器的特征,我们引入了DeCaf-Grounder,它通过查询感知的时间聚合和多尺度时间细化来统一并精炼这些特征,从而实现精确的时间定位。 在两个LTVG基准数据集上的实验表明,与现有方法相比,DeCafNet 能够减少高达47%的计算成本,并仍然保持性能优势,这使它在效率和性能方面都确立了新的业界标准。我们的代码可在此网址获得:[https URL](请注意,在实际回复中应提供具体链接而非占位符)。
https://arxiv.org/abs/2505.16376
The intrication of brain signals drives research that leverages multimodal AI to align brain modalities with visual and textual data for explainable descriptions. However, most existing studies are limited to coarse interpretations, lacking essential details on object descriptions, locations, attributes, and their relationships. This leads to imprecise and ambiguous reconstructions when using such cues for visual decoding. To address this, we analyze different choices of vision feature spaces from pre-trained visual components within Multimodal Large Language Models (MLLMs) and introduce a zero-shot multimodal brain decoding method that interacts with these models to decode across multiple levels of granularities. % To assess a model's ability to decode fine details from brain signals, we propose the Multi-Granularity Brain Detail Understanding Benchmark (MG-BrainDub). This benchmark includes two key tasks: detailed descriptions and salient question-answering, with metrics highlighting key visual elements like objects, attributes, and relationships. Our approach enhances neural decoding precision and supports more accurate neuro-decoding applications. Code will be available at this https URL.
大脑信号的复杂性推动了研究,这些研究利用多模态AI来将大脑模式与视觉和文本数据对齐,以实现可解释性的描述。然而,大多数现有的研究表明,它们局限于粗略的解读,并且在对象描述、位置、属性及其关系等方面的细节不足。这导致使用此类线索进行视觉解码时会出现不精确和模糊的结果。为了应对这一挑战,我们分析了来自多模态大型语言模型(MLLMs)中预训练视觉组件的不同视觉特征空间选择,并引入了一种零样本多模态脑信号解码方法,该方法可以与这些模型互动,在多个粒度级别上进行解码。 为了评估模型从大脑信号中解析精细细节的能力,我们提出了一个多粒度大脑细节理解基准测试(MG-BrainDub)。这个基准包括两个关键任务:详细的描述和显著的问题回答,并且其指标突出了关键的视觉元素,如对象、属性和关系。我们的方法提高了神经解码的精确性,并支持更准确的神经解码应用。相关代码将在以下链接提供:[此URL]。
https://arxiv.org/abs/2505.15755
While eXplainable AI (XAI) has advanced significantly, few methods address interpretability in embedded vector spaces where dimensions represent complex abstractions. We introduce Distance Explainer, a novel method for generating local, post-hoc explanations of embedded spaces in machine learning models. Our approach adapts saliency-based techniques from RISE to explain the distance between two embedded data points by assigning attribution values through selective masking and distance-ranked mask filtering. We evaluate Distance Explainer on cross-modal embeddings (image-image and image-caption pairs) using established XAI metrics including Faithfulness, Sensitivity/Robustness, and Randomization. Experiments with ImageNet and CLIP models demonstrate that our method effectively identifies features contributing to similarity or dissimilarity between embedded data points while maintaining high robustness and consistency. We also explore how parameter tuning, particularly mask quantity and selection strategy, affects explanation quality. This work addresses a critical gap in XAI research and enhances transparency and trustworthiness in deep learning applications utilizing embedded spaces.
尽管可解释的人工智能(XAI)已经取得了显著进展,但很少有方法能解决嵌入向量空间中的可解释性问题,其中维度代表复杂的抽象概念。我们提出了Distance Explainer,这是一种新颖的方法,用于生成机器学习模型中嵌入空间的局部事后解释。我们的方法借鉴RISE中的基于敏感度的技术来解释两个嵌入数据点之间的距离,并通过选择性掩码和根据距离排名过滤掩码来分配归因值。 我们在使用ImageNet和CLIP模型进行实验时,利用了已建立的XAI指标(包括忠实度、灵敏度/鲁棒性和随机化)对跨模态嵌入(图像-图像和图像-描述符配对)上的Distance Explainer进行了评估。结果表明,我们的方法能够有效地识别导致嵌入数据点之间相似性或不相似性的特征,并保持高鲁棒性和一致性。 此外,我们还探讨了参数调整,特别是掩码数量和选择策略如何影响解释质量。这项工作填补了XAI研究中的一个关键空白,并增强了使用嵌入空间的深度学习应用在透明度和信任方面的表现。
https://arxiv.org/abs/2505.15516
Deep neural network (DNN)-based policy models, such as vision-language-action (VLA) models, excel at automating complex decision-making from multi-modal inputs. However, scaling these models greatly increases computational overhead, complicating deployment in resource-constrained settings like robot manipulation and autonomous driving. To address this, we propose Saliency-Aware Quantized Imitation Learning (SQIL), which combines quantization-aware training with a selective loss-weighting strategy for mission-critical states. By identifying these states via saliency scores and emphasizing them in the training loss, SQIL preserves decision fidelity under low-bit precision. We validate SQIL's generalization capability across extensive simulation benchmarks with environment variations, real-world tasks, and cross-domain tasks (self-driving, physics simulation), consistently recovering full-precision performance. Notably, a 4-bit weight-quantized VLA model for robotic manipulation achieves up to 2.5x speedup and 2.5x energy savings on an edge GPU with minimal accuracy loss. These results underline SQIL's potential for efficiently deploying large IL-based policy models on resource-limited devices.
基于深度神经网络(DNN)的策略模型,如视觉-语言-行动(VLA)模型,在从多模态输入自动复杂决策方面表现出色。然而,扩大这些模型规模会极大地增加计算负担,这在资源受限的环境中(例如机器人操作和自动驾驶)部署时变得复杂。为解决这一问题,我们提出了基于显著性感知量化模仿学习(SQIL),该方法结合了感知量化训练与针对关键任务状态的选择性损失加权策略。通过利用显著性分数识别这些状态并在训练损失中强调它们,SQIL能够在低比特精度下保持决策的准确性。我们在广泛的模拟基准测试、现实世界任务和跨域任务(自动驾驶、物理仿真)上验证了SQIL的一般化能力,并且始终能够恢复全精度性能。值得注意的是,在边缘GPU上使用4位权重量化的VLA模型用于机器人操作,实现了高达2.5倍的加速和能量节省,同时几乎没有任何准确性损失。这些结果强调了SQIL在资源受限设备上高效部署大型模仿学习(IL)策略模型方面的潜力。
https://arxiv.org/abs/2505.15304
Learned Sparse Retrieval (LSR) models encode text as weighted term vectors, which need to be sparse to leverage inverted index structures during retrieval. SPLADE, the most popular LSR model, uses FLOPS regularization to encourage vector sparsity during training. However, FLOPS regularization does not ensure sparsity among terms - only within a given query or document. Terms with very high Document Frequencies (DFs) substantially increase latency in production retrieval engines, such as Apache Solr, due to their lengthy posting lists. To address the issue of high DFs, we present a new variant of FLOPS regularization: DF-FLOPS. This new regularization technique penalizes the usage of high-DF terms, thereby shortening posting lists and reducing retrieval latency. Unlike other inference-time sparsification methods, such as stopword removal, DF-FLOPS regularization allows for the selective inclusion of high-frequency terms in cases where the terms are truly salient. We find that DF-FLOPS successfully reduces the prevalence of high-DF terms and lowers retrieval latency (around 10x faster) in a production-grade engine while maintaining effectiveness both in-domain (only a 2.2-point drop in MRR@10) and cross-domain (improved performance in 12 out of 13 tasks on which we tested). With retrieval latencies on par with BM25, this work provides an important step towards making LSR practical for deployment in production-grade search engines.
学习稀疏检索(LSR)模型将文本编码为加权的词向量,这些词向量需要保持稀疏性以便在检索过程中利用倒排索引结构。SPLADE,作为最流行的LSR模型之一,通过使用FLOPS正则化来鼓励训练期间向量的稀疏性。然而,FLOPS正则化并不确保词汇项之间的稀疏性——仅限于给定查询或文档内部。高频词(高文档频率DF)在生产检索引擎中(如Apache Solr),由于其冗长的后置列表(posting list),会导致显著的延迟增加。 为了解决高频词问题,我们提出了一种新的FLOPS正则化变体:DF-FLOPS。这一新方法通过惩罚高文档频率词汇项的使用来缩短posting lists并降低检索延迟。与其它推理时稀疏化的方法(如停用词去除)不同,DF-FLOPS允许在特定情况下包含那些高频但具有重要意义的词语。 我们发现DF-FLOPS有效地减少了高DF词的比例,并且在生产级引擎中将检索延迟降低了大约10倍(同时保持了MRR@10指标仅下降2.2分的效果),并且跨域性能也得到了提升(在测试的13个任务中的12个任务上表现更好)。通过提供与BM25相当的检索延迟,这项工作为LSR模型的实际应用提供了重要步骤,使其能够部署到生产级搜索引擎中。
https://arxiv.org/abs/2505.15070
Supervised pretrained models have become widely used in deep learning, especially for image segmentation tasks. However, when applied to specialized datasets such as biomedical imaging, pretrained weights often introduce unintended biases. These biases cause models to assign different levels of importance to different slices, leading to inconsistencies in feature utilization, which can be observed as asymmetries in saliency map distributions. This transfer of color distributions from natural images to non-natural datasets can compromise model performance and reduce the reliability of results. In this study, we investigate the effects of these biases and propose strategies to mitigate them. Through a series of experiments, we test both pretrained and randomly initialized models, comparing their performance and saliency map distributions. Our proposed methods, which aim to neutralize the bias introduced by pretrained color channel weights, demonstrate promising results, offering a practical approach to improving model explainability while maintaining the benefits of pretrained models. This publication presents our findings, providing insights into addressing pretrained weight biases across various deep learning tasks.
监督预训练模型在深度学习中被广泛应用,特别是在图像分割任务中。然而,在如生物医学成像这样的特定数据集上应用时,预训练权重往往会引入未预期的偏见。这些偏见会导致模型对不同切片的重要性分配不均等,从而导致特征利用的一致性问题,这可以在显著图分布中的不对称中观察到。这种将自然图像的颜色分布转移到非自然数据集中会降低模型性能,并减少结果的可靠性。 在本研究中,我们探讨了这些偏见的影响,并提出了一系列缓解策略。通过一系列实验,我们将预训练和随机初始化的模型进行比较,评估它们的表现和显著图分布。我们的方法旨在抵消由预训练颜色通道权重引入的偏见,显示出有希望的结果,提供了一种在保持预训练模型优点的同时提高模型可解释性的实用方法。 本文发布我们对此问题的研究发现,为解决各种深度学习任务中预训练权重偏见的问题提供了见解。
https://arxiv.org/abs/2505.14105
Warning: This paper contains examples of harmful language and images. Reader discretion is advised. Recently, vision-language models have demonstrated increasing influence in morally sensitive domains such as autonomous driving and medical analysis, owing to their powerful multimodal reasoning capabilities. As these models are deployed in high-stakes real-world applications, it is of paramount importance to ensure that their outputs align with human moral values and remain within moral boundaries. However, existing work on moral alignment either focuses solely on textual modalities or relies heavily on AI-generated images, leading to distributional biases and reduced realism. To overcome these limitations, we introduce MORALISE, a comprehensive benchmark for evaluating the moral alignment of vision-language models (VLMs) using diverse, expert-verified real-world data. We begin by proposing a comprehensive taxonomy of 13 moral topics grounded in Turiel's Domain Theory, spanning the personal, interpersonal, and societal moral domains encountered in everyday life. Built on this framework, we manually curate 2,481 high-quality image-text pairs, each annotated with two fine-grained labels: (1) topic annotation, identifying the violated moral topic(s), and (2) modality annotation, indicating whether the violation arises from the image or the text. For evaluation, we encompass two tasks, \textit{moral judgment} and \textit{moral norm attribution}, to assess models' awareness of moral violations and their reasoning ability on morally salient content. Extensive experiments on 19 popular open- and closed-source VLMs show that MORALISE poses a significant challenge, revealing persistent moral limitations in current state-of-the-art models. The full benchmark is publicly available at this https URL.
警告:本文包含有害语言和图像示例。建议读者谨慎阅读。 近期,视觉-语言模型在自主驾驶和医学分析等道德敏感领域展示了日益增强的影响,这得益于它们强大的多模态推理能力。随着这些模型被部署到高风险的现实应用中,确保其输出与人类道德价值观一致并保持在道德界限内变得至关重要。然而,现有的关于道德对齐的研究要么仅关注文本模式,要么严重依赖于AI生成的图像,导致分布偏差和真实性降低。为克服这些限制,我们引入了MORALISE,这是一个全面的基准测试工具,用于评估视觉-语言模型(VLM)在使用多样化、专家验证的真实世界数据时的道德对齐能力。 首先,我们根据Turiel的领域理论提出了一个全面涵盖个人、人际和社群道德领域的13个道德主题分类法。在此框架基础上,我们手动收集了2,481组高质量的图像-文本配对,并为每一对标注了两个细粒度标签:(1)主题标记,标识违反的道德主题;(2)模态标记,指明违规行为是源自图像还是文本。 在评估环节中,我们涵盖了两项任务,“道德判断”和“道德规范归因”,以评估模型对道德违规的认知能力及其处理具有道德敏感度内容的推理能力。通过对19种流行开源和闭源VLM进行广泛的实验研究,表明MORALISE构成了重大挑战,并揭示了当前最先进的模型在道德领域中持久存在的局限性。 完整的基准测试可在[该链接](https://example.com)公开获取。
https://arxiv.org/abs/2505.14728
Weakly-supervised methods for video anomaly detection (VAD) are conventionally based merely on RGB spatio-temporal features, which continues to limit their reliability in real-world scenarios. This is due to the fact that RGB-features are not sufficiently distinctive in setting apart categories such as shoplifting from visually similar events. Therefore, towards robust complex real-world VAD, it is essential to augment RGB spatio-temporal features by additional modalities. Motivated by this, we introduce the Poly-modal Induced framework for VAD: "PI-VAD", a novel approach that augments RGB representations by five additional modalities. Specifically, the modalities include sensitivity to fine-grained motion (Pose), three dimensional scene and entity representation (Depth), surrounding objects (Panoptic masks), global motion (optical flow), as well as language cues (VLM). Each modality represents an axis of a polygon, streamlined to add salient cues to RGB. PI-VAD includes two plug-in modules, namely Pseudo-modality Generation module and Cross Modal Induction module, which generate modality-specific prototypical representation and, thereby, induce multi-modal information into RGB cues. These modules operate by performing anomaly-aware auxiliary tasks and necessitate five modality backbones -- only during training. Notably, PI-VAD achieves state-of-the-art accuracy on three prominent VAD datasets encompassing real-world scenarios, without requiring the computational overhead of five modality backbones at inference.
弱监督视频异常检测(VAD)方法通常仅基于RGB时空特征,这在现实场景中继续限制了其可靠性。这是因为RGB特征不足以区分如偷窃等事件与视觉相似的其他事件。因此,在针对复杂真实世界的VAD问题时,增强RGB时空特征以包含额外模态是必要的。受到这一观点的启发,我们引入了一种新的方法——多模态诱导框架“PI-VAD”,它通过五种附加模态来增强RGB表示。具体来说,这些模态包括对细粒度运动(姿态)敏感性、三维场景和实体表示(深度)、周围物体(全景掩码)、全局运动(光流),以及语言线索(视觉语言模型)。每种模态在PI-VAD框架中代表一个多边形的一个轴,通过这种方式向RGB添加显著的提示信息。 PI-VAD包括两个插件模块:伪模态生成模块和跨模态诱导模块。这些模块分别生成特定于每个模态的原型表示,并将多模态信息引入到RGB线索中。这些模块操作时执行异常感知辅助任务,仅在训练期间需要五个模态骨干网络的支持。 值得注意的是,PI-VAD在涵盖真实世界场景的三个重要的VAD数据集上达到了最先进的准确率水平,且无需在推理阶段使用五种模态骨干网络来避免额外计算开销。
https://arxiv.org/abs/2505.13123
Efficient convolutional neural network (CNN) architecture designs have attracted growing research interests. However, they usually apply single receptive field (RF), small asymmetric RFs, or pyramid RFs to learn different feature representations, still encountering two significant challenges in medical image classification tasks: 1) They have limitations in capturing diverse lesion characteristics efficiently, e.g., tiny, coordination, small and salient, which have unique roles on results, especially imbalanced medical image classification. 2) The predictions generated by those CNNs are often unfair/biased, bringing a high risk by employing them to real-world medical diagnosis conditions. To tackle these issues, we develop a new concept, Expert-Like Reparameterization of Heterogeneous Pyramid Receptive Fields (ERoHPRF), to simultaneously boost medical image classification performance and fairness. This concept aims to mimic the multi-expert consultation mode by applying the well-designed heterogeneous pyramid RF bags to capture different lesion characteristics effectively via convolution operations with multiple heterogeneous kernel sizes. Additionally, ERoHPRF introduces an expert-like structural reparameterization technique to merge its parameters with the two-stage strategy, ensuring competitive computation cost and inference speed through comparisons to a single RF. To manifest the effectiveness and generalization ability of ERoHPRF, we incorporate it into mainstream efficient CNN architectures. The extensive experiments show that our method maintains a better trade-off than state-of-the-art methods in terms of medical image classification, fairness, and computation overhead. The codes of this paper will be released soon.
高效的卷积神经网络(CNN)架构设计吸引了越来越多的研究兴趣。然而,它们通常应用单一感受野(RF)、小的非对称感受野或金字塔式感受野来学习不同的特征表示,在医学图像分类任务中仍然面临两个显著挑战:1) 它们在高效捕捉多样化的病灶特性方面存在局限性,例如微小、协调、小巧而突出的病变,这些特性在结果上具有独特作用,特别是在不平衡的医学图像分类情况下。2) 这些CNN产生的预测往往不公平/有偏见,在实际医疗诊断环境中使用时风险较高。 为了解决这些问题,我们提出了一种新的概念——专家型异构金字塔感受野重新参数化(ERoHPRF),旨在同时提升医学图像分类的性能和公平性。这一概念通过应用精心设计的异构金字塔RF包来模仿多专家咨询模式,并采用具有多种不同核尺寸的卷积操作有效捕捉不同的病灶特性。此外,ERoHPRF引入了一种类似于专家的知识结构重新参数化技术,采用两阶段策略将其参数与单个RF相融合,在确保计算成本和推理速度的同时保持竞争优势。 为了展现ERoHPRF的有效性和泛化能力,我们将该概念融入主流的高效CNN架构。广泛的实验表明,我们的方法在医学图像分类、公平性以及计算开销方面相对于现有最佳方法具有更好的权衡效果。本文所用代码将在不久后公开发布。
https://arxiv.org/abs/2505.13039
Explainable AI (XAI) methods generally fall into two categories. Post-hoc approaches generate explanations for pre-trained models and are compatible with various neural network architectures. These methods often use feature importance visualizations, such as saliency maps, to indicate which input regions influenced the model's prediction. Unfortunately, they typically offer a coarse understanding of the model's decision-making process. In contrast, ante-hoc (inherently explainable) methods rely on specially designed model architectures trained from scratch. A notable subclass of these methods provides explanations through prototypes, representative patches extracted from the training data. However, prototype-based approaches have limitations: they require dedicated architectures, involve specialized training procedures, and perform well only on specific datasets. In this work, we propose EPIC (Explanation of Pretrained Image Classification), a novel approach that bridges the gap between these two paradigms. Like post-hoc methods, EPIC operates on pre-trained models without architectural modifications. Simultaneously, it delivers intuitive, prototype-based explanations inspired by ante-hoc techniques. To the best of our knowledge, EPIC is the first post-hoc method capable of fully replicating the core explanatory power of inherently interpretable models. We evaluate EPIC on benchmark datasets commonly used in prototype-based explanations, such as CUB-200-2011 and Stanford Cars, alongside large-scale datasets like ImageNet, typically employed by post-hoc methods. EPIC uses prototypes to explain model decisions, providing a flexible and easy-to-understand tool for creating clear, high-quality explanations.
可解释人工智能(XAI)方法通常分为两类。事后方法为预训练模型生成解释,并且与各种神经网络架构兼容。这些方法经常使用特征重要性可视化,例如灵敏度图,来指示哪些输入区域影响了模型的预测结果。然而,它们通常只提供了对模型决策过程的一种粗略的理解。相比之下,事前(固有可解释)的方法依赖于专门设计的模型结构,从头开始训练。这类方法的一个显著子类通过原型提供解释,即从训练数据中提取出具有代表性的片段。不过,基于原型的方法存在局限性:它们需要特定架构、涉及特殊的训练程序,并且仅在某些具体的数据集上表现出色。在这项工作中,我们提出了EPIC(预训练图像分类的解释),这是一种新颖的方法,旨在弥合这两种范式之间的差距。与事后方法一样,EPIC可以在不修改模型架构的情况下应用于预训练模型。同时,它提供了基于原型、受到事前技术启发的直观解释。据我们所知,EPIC是第一个能够完全复制固有可解释模型核心解释能力的事后方法。我们在常用的基于原型解释的数据集(如CUB-200-2011和Stanford Cars)以及通常被事后方法使用的大型数据集(如ImageNet)上评估了EPIC的效果。EPIC使用原型来解释模型决策,提供了一种灵活且易于理解的工具,用于创建清晰、高质量的解释。
https://arxiv.org/abs/2505.12897
Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames. To advance the task towards more practical scenarios, we introduce \textbf{Long-RVOS}, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing. The objects are manually annotated with three different types of descriptions to individually evaluate the understanding of static attributes, motion patterns and spatiotemporal relationships. Moreover, unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two new metrics to assess the temporal and spatiotemporal consistency. We benchmark 6 state-of-the-art methods on Long-RVOS. The results show that current approaches struggle severely with the long-video challenges. To address this, we further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local-to-global architecture to capture both short-term dynamics and long-term dependencies. Despite simplicity, ReferMo achieves significant improvements over current methods in long-term scenarios. We hope that Long-RVOS and our baseline can drive future RVOS research towards tackling more realistic and long-form videos.
基于语言描述的视频对象分割(RVOS)旨在识别、跟踪和分割视频中的物体,近年来受到了广泛关注。然而,现有的数据集仍然主要关注几秒钟内的短视频片段,并且大多数帧中都有明显的物体。为了推进该任务向更实际场景的发展,我们引入了\textbf{Long-RVOS},这是一个面向长期基于描述的视频对象分割的大规模基准测试。Long-RVOS包含超过2,000个平均时长超过60秒的视频片段,并涵盖了经历遮挡、消失与重现以及镜头切换等变化的各种物体。这些物体通过三种不同类型的描述进行人工标注,以分别评估静态属性、运动模式和时空关系的理解能力。 此外,不同于以往依赖每帧空间评价的标准基准测试,我们引入了两个新的指标来评估时间和时空的一致性。我们在Long-RVOS上对6种最先进的方法进行了基准测试。结果显示,目前的方法在处理长视频的挑战时存在严重困难。为了解决这一问题,我们进一步提出了ReferMo,这是一种有前景的基础方法,它通过整合运动信息扩展了时间接受域,并采用局部到全局架构来捕捉短期动态和长期依赖性。尽管简单,但在长期场景中,ReferMo相对于现有方法取得了显著的改进。 我们希望Long-RVOS和我们的基础模型能够推动未来的RVOS研究朝向更现实和长格式视频的方向发展。
https://arxiv.org/abs/2505.12702
Large language models (LLMs) increasingly power mental-health chatbots, yet the field still lacks a scalable, theory-grounded way to decide which model is most effective to deploy. We present ESC-Judge, the first end-to-end evaluation framework that (i) grounds head-to-head comparisons of emotional-support LLMs in Clara Hill's established Exploration-Insight-Action counseling model, providing a structured and interpretable view of performance, and (ii) fully automates the evaluation pipeline at scale. ESC-Judge operates in three stages: first, it synthesizes realistic help-seeker roles by sampling empirically salient attributes such as stressors, personality, and life history; second, it has two candidate support agents conduct separate sessions with the same role, isolating model-specific strategies; and third, it asks a specialized judge LLM to express pairwise preferences across rubric-anchored skills that span the Exploration, Insight, and Action spectrum. In our study, ESC-Judge matched PhD-level annotators on 85 percent of Exploration, 83 percent of Insight, and 86 percent of Action decisions, demonstrating human-level reliability at a fraction of the cost. All code, prompts, synthetic roles, transcripts, and judgment scripts are released to promote transparent progress in emotionally supportive AI.
大型语言模型(LLMs)越来越多地用于心理健康聊天机器人,但该领域仍然缺乏一种可扩展且理论基础坚实的方法来决定哪种模型最有效。我们提出了ESC-Judge,这是第一个端到端评估框架,它(i)以Clara Hill建立的探索-洞察-行动咨询模式为基础,进行情感支持LLM之间的直接比较,并提供了结构化和易于解释的表现视图;(ii)在大规模下完全自动化了评估流程。 ESC-Judge通过三个阶段操作:首先,它通过采样压力源、个性和生活经历等经验上显著的属性来综合现实中的求助者角色;其次,两个候选支持代理分别与同一角色进行会话,从而隔离模型特定策略;最后,一个专门的评判LLM根据涵盖探索、洞察和行动领域的评分标准,表达对各项技能的一对一偏好。 在我们的研究中,ESC-Judge在85%的探索决策、83%的洞察决策和86%的动作决策上与博士级注释者匹配,展示了以人类水平可靠性进行操作的成本效益。所有代码、提示、合成角色、会话记录以及评判脚本都已发布,以促进情感支持AI方面的透明进展。
https://arxiv.org/abs/2505.12531
Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset -- the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process -- separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Our project page is here: this https URL.
近年来,视频内容的创作和消费显著增加。制作引人入胜的内容需要精心策划视觉和音频元素。虽然通过选择最佳视角或后期编辑等技术来策划视觉线索在媒体生产中一直占据核心地位,但其自然对应的部分——音频——却没有经历同等程度的进步。这通常导致了视觉和听觉吸引力之间的脱节。为了解决这一问题,我们引入了一个新的任务:基于视觉的声学强调,旨在根据配套视频指导调整音频以实现适当的突出效果,从而创造更加和谐的视听体验。为此,我们提出了一种灵活的、基于变压器的多模态框架来解决这个任务。为了训练我们的模型,我们也引入了新的数据集——“泥泞混合”数据集(muddy mix dataset),该数据集利用电影中精心制作的音频和视频素材提供了形式上的免费监督。我们开发了一个伪数据生成过程,以模拟不良混音的音频,通过分离、调整和重新混合三个步骤来模仿现实世界中的场景。我们的方法在定量评估和主观评价中都优于多个基准模型,并且系统地研究了不同类型上下文指导以及数据集难度等级的影响。项目页面在这里:[此 URL](this https URL)。
https://arxiv.org/abs/2505.12154
Data augmentation for domain-specific image classification tasks often struggles to simultaneously address diversity, faithfulness, and label clarity of generated data, leading to suboptimal performance in downstream tasks. While existing generative diffusion model-based methods aim to enhance augmentation, they fail to cohesively tackle these three critical aspects and often overlook intrinsic challenges of diffusion models, such as sensitivity to model characteristics and stochasticity under strong transformations. In this paper, we propose a novel framework that explicitly integrates diversity, faithfulness, and label clarity into the augmentation process. Our approach employs saliency-guided mixing and a fine-tuned diffusion model to preserve foreground semantics, enrich background diversity, and ensure label consistency, while mitigating diffusion model limitations. Extensive experiments across fine-grained, long-tail, few-shot, and background robustness tasks demonstrate our method's superior performance over state-of-the-art approaches.
针对特定领域的图像分类任务,数据增强通常难以同时解决生成数据的多样性、忠实性和标签清晰度问题,导致下游任务表现不佳。虽然现有的基于生成扩散模型的方法旨在改进这一过程,但它们未能全面应对这三个关键方面,并且经常忽略了扩散模型固有的挑战,例如对模型特性的敏感性以及在强变换下的随机性。本文中,我们提出了一种新的框架,该框架明确地将多样性、忠实性和标签清晰度整合到增强过程中。我们的方法采用基于显著性引导的混合技术,并使用经过微调的扩散模型来保留前景语义、丰富背景多样性并确保标签一致性,同时减轻扩散模型的局限性。在细粒度分类、长尾分布、少量样本和背景鲁棒性的任务上进行的广泛实验表明,我们方法的表现优于当前最先进的技术。
https://arxiv.org/abs/2505.11813
We demonstrate that a developmentally ordered curriculum markedly improves reasoning transparency and sample-efficiency in small language models (SLMs). Concretely, we train Cognivolve, a 124 M-parameter GPT-2 model, on a four-stage syllabus that ascends from lexical matching to multi-step symbolic inference and then evaluate it without any task-specific fine-tuning. Cognivolve reaches target accuracy in half the optimization steps of a single-phase baseline, activates an order-of-magnitude more gradient-salient reasoning heads, and shifts those heads toward deeper layers, yielding higher-entropy attention that balances local and long-range context. The same curriculum applied out of order or with optimizer resets fails to reproduce these gains, confirming that progression--not extra compute--drives the effect. We also identify open challenges: final-answer success still lags a conventional run by about 30%, and our saliency probe under-detects verbal-knowledge heads in the hardest stage, suggesting directions for mixed-stage fine-tuning and probe expansion.
我们展示了按发展阶段顺序编排的课程能够显著提升小型语言模型(SLM)在推理透明度和样本效率方面的表现。具体来说,我们在一个由词汇匹配逐步上升到多步骤符号推断的四阶段课程上训练了Cognivolve这一参数量为1.24亿的GPT-2模型,并且在没有任何特定任务微调的情况下对其进行评估。结果显示,相较于单一阶段基线,在优化步骤的一半时间内,Cognivolve即可达到目标精度;它激活的梯度显著性推理头数量提高了约一个数量级,并将这些头转向更深层次,从而产生更高熵值的关注机制,平衡了局部和长距离语境信息。如果倒序或在优化器重置的情况下应用相同的课程,则无法再现这种效果,这证实是进步而非额外计算量推动了该效应的出现。此外,我们还指出了开放性挑战:最终答案的成功率仍落后于传统方法约30%,我们的显著性探针在最难阶段低估了言语知识头的存在,提示混合阶段微调和探针扩展的方向。
https://arxiv.org/abs/2505.11643
Incorporating an autonomous auxiliary camera into robot-assisted minimally invasive surgery (RAMIS) enhances spatial awareness and eliminates manual viewpoint control. Existing path planning methods for auxiliary cameras track two-dimensional surgical features but do not simultaneously account for camera orientation, workspace constraints, and robot joint limits. This study presents AutoCam: an automatic auxiliary camera placement method to improve visualization in RAMIS. Implemented on the da Vinci Research Kit, the system uses a priority-based, workspace-constrained control algorithm that combines heuristic geometric placement with nonlinear optimization to ensure robust camera tracking. A user study (N=6) demonstrated that the system maintained 99.84% visibility of a salient feature and achieved a pose error of 4.36 $\pm$ 2.11 degrees and 1.95 $\pm$ 5.66 mm. The controller was computationally efficient, with a loop time of 6.8 $\pm$ 12.8 ms. An additional pilot study (N=6), where novices completed a Fundamentals of Laparoscopic Surgery training task, suggests that users can teleoperate just as effectively from AutoCam's viewpoint as from the endoscope's while still benefiting from AutoCam's improved visual coverage of the scene. These results indicate that an auxiliary camera can be autonomously controlled using the da Vinci patient-side manipulators to track a salient feature, laying the groundwork for new multi-camera visualization methods in RAMIS.
将自主辅助相机集成到机器人辅助的微创手术(RAMIS)中,可以增强空间感知能力,并且不需要手动控制视点。现有的辅助相机路径规划方法追踪二维外科特征,但并未同时考虑摄像机的方向、工作区限制和机器人关节限位。本研究提出了AutoCam:一种自动辅助相机放置方法,以提高RAMIS中的可视化效果。该系统在达芬奇研究套件上实施,并使用基于优先级的工作空间受限控制算法,结合启发式几何布置与非线性优化来确保摄像机的稳健跟踪。 一项用户研究(N=6)表明,系统能够保持99.84%的重要特征可见度,并且姿态误差为4.36±2.11度和1.95±5.66毫米。控制器计算效率高,循环时间为6.8±12.8毫秒。 另外一项初步研究(N=6),其中新手完成了腹腔镜手术基础训练任务,表明用户可以从AutoCam的视点操作与内窥镜一样有效地进行操作,并且仍然受益于AutoCam提供的改进场景视觉覆盖。这些结果表明,可以通过达芬奇患者侧机械臂自主控制辅助摄像机以跟踪重要特征,为RAMIS中的多摄像头可视化方法奠定了基础。
https://arxiv.org/abs/2505.10398