We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.
我们研究了组合视频理解(CVU),在这种情况下,模型必须识别动词和物体,并将它们组合起来以推广到未见过的组合。我们发现现有的零样本组合动作识别(ZS-CAR)模型主要由于一个被忽略的问题模式而失败:基于对象的动词捷径。通过系统的分析,我们展示了这种行为是由两个相互交织的因素引起的:组成监督的高度稀疏性和偏斜性,以及动词和物体之间的不对称学习难度。随着训练的进行,现有的ZS-CAR模型越来越忽视视觉证据,并过度适应共现统计信息。因此,现有模型无法获得在未见过的动词-物体组合中的组合识别益处。 为了解决这个问题,我们提出了RCORE,这是一个简单而有效的框架,强制执行基于时间的基础动词学习。RCORE引入了(i)一种组合感知增强方法,可以在不破坏运动线索的情况下多样化动词-对象组合;(ii)一种时间顺序正则化损失,通过显式建模时间结构来惩罚捷径行为。在两个基准测试Sth-com和我们新构建的EK100-com上,RCORE显著提高了未见过组合的准确性,减少了对共现偏差的依赖,并实现了持续的正面组合差距。 我们的发现揭示了基于对象的捷径作为ZS-CAR中的关键限制因素,并证明解决这些问题对于稳健的组合视频理解至关重要。
https://arxiv.org/abs/2601.16211
Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing $\mathcal{LD}$ methods greatly impact the dynamic model's performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $\mathcal{LD}$ in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by $9.32\%$ and $2.25\%$ for high and no dropping cases with $33.3\%$ reduction in training time.
边缘设备在有限和变化的资源环境中运行,需要能够适应可用资源限制的动态架构。为了满足这一需求,通常采用层掉落($\mathcal{LD}$)方法将静态模型转换为动态模型,通过跳过网络的部分来减少整体计算复杂度。然而,现有的层掉落方法对低频和高频掉层情况下的动态模型性能影响很大,从而恶化了性能与计算量之间的权衡。为此,我们提出了一种基于蒸馏的层掉落(DLD)框架,该框架能够以端到端的方式有效地结合知识蒸馏和$\mathcal{LD}$的能力,从而在动态语音网络中实现最先进的性能。 通过使用包括Conformer和WavLM在内的知名语音识别方法,在三个公共基准上进行的全面实验展示了我们框架的有效性。对于高频掉层情况,我们的框架将词错误率降低了9.32%,而对于无掉层的情况则减少了2.25%。此外,该框架在训练时间方面也实现了33.3%的减少。
https://arxiv.org/abs/2601.16117
Optical Character Recognition (OCR) for low-resource languages remains a significant challenge due to the scarcity of large-scale annotated training datasets. Languages such as Kashmiri, with approximately 7 million speakers and a complex Perso-Arabic script featuring unique diacritical marks, currently lack support in major OCR systems including Tesseract, TrOCR, and PaddleOCR. Manual dataset creation for such languages is prohibitively expensive, time-consuming, and error-prone, often requiring word by word transcription of printed or handwritten text. We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages. Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets. The system implements a comprehensive pipeline encompassing text segmentation (character, word, n-gram, sentence, and line levels), Unicode normalization with script purity enforcement, multi-font rendering with configurable distribution, and 25+ data augmentation techniques simulating real-world document degradations including rotation, blur, noise, and scanner artifacts. We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset, which we release publicly on HuggingFace. This work provides a practical pathway for bringing low-resource languages into the era of vision-language AI models, and the tool is openly available for researchers and practitioners working with underserved writing systems worldwide.
对于资源匮乏的语言,光学字符识别(OCR)仍然是一项重大挑战,主要是因为缺乏大规模的标注训练数据集。像克什米尔语这样的语言,拥有大约700万使用者和复杂的波斯-阿拉伯书写系统,其中包括独特的标点符号,在Tesseract、TrOCR 和 PaddleOCR 等主要 OCR 系统中目前仍得不到支持。为这类语言创建手动数据集的成本高得令人难以承受,耗时且容易出错,并常常需要逐词转录印刷或手写文本。 我们提出了一种开源的合成 OCR 数据集生成器——SynthOCR-Gen,专门针对资源匮乏的语言设计。该工具通过将数字 Unicode 文本语料库转换为即用型训练数据集来解决 OCR 开发中的基本瓶颈问题。系统实现了一个全面的工作流程,包括文本分割(字符、单词、n-gram、句子和行级别)、Unicode 正规化以及强制实施书写系统的纯度,多字体渲染和支持配置的分布设置,以及 25 多种数据增强技术来模拟现实世界文档退化的多种情况,如旋转、模糊、噪声和扫描器产生的伪影。 我们通过生成一个包含60万样本的克什米尔语单词级 OCR 数据集来展示了这种方法的有效性,并将其公开发布在 HuggingFace 上。本工作为资源匮乏的语言进入视觉-语言 AI 模型时代提供了一条实用路径,且工具对全世界从事未得到充分服务的文字系统的研究人员和实践者完全开放使用。
https://arxiv.org/abs/2601.16113
In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and this http URL results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a "Latency Wall" exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.
在虚拟现实(VR)和人机交互(HCI)领域,实时情绪识别对帮助自闭症谱系障碍(ASD)患者提高社交技能具有潜力。这一任务需要严格处理延迟与精度之间的权衡问题,即运动到光子(MTP)的延迟应保持在140毫秒以下以维持连续性。然而,大多数现成的深度学习模型更注重准确性而非消费品硬件严格的定时约束条件。作为迈向可访问VR疗法的第一步,我们使用UIBVFED数据集对虚拟角色进行零样本面部表情识别(FER)任务,来基准测试最新的状态-of-the-art (SOTA) 模型。我们评估了YOLO的Medium和Nano变体(v8, v11 和 v12),用于脸部检测,并包括通用视觉转换器如CLIP、SigLIP以及另一个未明确指明的模型。 仅使用CPU进行推断的结果显示,尽管在风格化的头像上实现面部检测是稳健的(准确率为100%),但在分类阶段存在一个“延迟墙”。YOLOv11n架构提供了最佳平衡以用于检测(约为54毫秒)。然而,通用视觉转换器如CLIP和SigLIP未能达到实时循环中的可接受精度(小于23%)或速度(大于150毫秒)。这项研究强调了为了实现可访问的、实时的人工智能在治疗环境中的应用,需要开发轻量级且特定领域的架构。
https://arxiv.org/abs/2601.15914
A convolutional neural network (CNN) is a deep learning algorithm that has been specifically designed for computer vision applications. The CNNs proved successful in handling the increasing amount of data in many computer vision problems, where classical machine learning algorithms were insufficient. Flowers have many uses in our daily lives, from decorating to making medicines to detoxifying the environment. Identifying flower types requires expert knowledge. However, accessing experts at any time and in any location may not always be feasible. In this study a mobile application based on CNNs was developed to recognize different types of flowers to provide non-specialists with quick and easy access to information about flower types. The study employed three distinct CNN models, namely MobileNet, DenseNet121, and Xception, to determine the most suitable model for the mobile application. The classification performances of the models were evaluated by training them with seven different optimization algorithms. The DenseNet-121 architecture, which uses the stochastic gradient descent (SGD) optimization algorithm, was the most successful, achieving 95.84 % accuracy, 96.00% precision, recall, and F1-score. This result shows that CNNs can be used for flower classification in mobile applications.
卷积神经网络(CNN)是一种专门为计算机视觉应用设计的深度学习算法。在许多计算机视觉问题中,随着数据量的增长,传统的机器学习算法变得不够用,而CNN们成功地解决了这一挑战。花朵在我们的日常生活中有着多种用途,从装饰到制作药物再到净化环境。识别花种需要专业知识。然而,在任何时间和地点都能随时访问专家并不总是可行的。为此,本研究开发了一款基于CNN的移动应用,用于识别不同种类的花卉,为非专业人士提供快速便捷地获取有关花种的信息途径。该研究采用了三种不同的CNN模型,即MobileNet、DenseNet121和Xception,以确定最适合应用于移动端的模型。通过使用七种不同的优化算法对这些模型进行训练后对其分类性能进行了评估。结果表明,在使用随机梯度下降(SGD)优化算法的情况下,DenseNet-121架构表现最佳,达到了95.84%的准确率、96.00%的精确度、召回率和F1值。这一结果显示了CNN在移动应用中用于花卉分类的有效性。
https://arxiv.org/abs/2601.15810
An utterance-level speaker embedding is typically obtained by aggregating a sequence of frame-level representations. However, in real-world scenarios, individual frames encode not only speaker-relevant information but also various nuisance factors. As a result, different frames contribute unequally to the final utterance-level speaker representation for Automatic Speaker Verification systems. To address this issue, we propose to estimate the inherent uncertainty of each frame and assign adaptive weights accordingly, where frames with higher uncertainty receive lower attention. Based on this idea, we present U3-xi, a comprehensive framework designed to produce more reliable and interpretable uncertainty estimates for speaker embeddings. Specifically, we introduce several strategies for uncertainty supervision. First, we propose speaker-level uncertainty supervision via a Stochastic Variance Loss, where the distance between an utterance embedding and its corresponding speaker centroid serves as a pseudo ground truth for uncertainty learning. Second, we incorporate global-level uncertainty supervision by injecting the predicted uncertainty into the sof tmax scale during training. This adaptive scaling mechanism adjusts the sharpness of the decision boundary according to sample difficulty, providing global guidance. Third, we redesign the uncertainty estimation module by integrating a Transformer encoder with multi-view self-attention, enabling the model to capture rich local and long-range temporal dependencies. Comprehensive experiments demonstrate that U3-xi is model-agnostic and can be seamlessly applied to various speaker encoders. In particular, when applied to ECAPA-TDNN, it achieves 21.1% and 15.57% relative improvements on the VoxCeleb1 test sets in terms of EER and minDCF, respectively.
一段级别的说话人嵌入通常通过聚合一系列帧级别表示来获得。然而,在实际场景中,单个帧不仅包含与说话人相关的信息,还包含了各种干扰因素。因此,不同的帧对最终的段级别说话人表示的贡献是不均衡的。为了解决这个问题,我们提出了一种估计每个帧固有不确定性的方法,并根据这种不确定性给各帧分配自适应权重,其中不确定性较高的帧会得到较少的关注度。基于这一思路,我们提出了U3-xi框架,旨在生成更为可靠且具有解释力的说话人嵌入不确定性估计。具体来说,我们引入了几种策略来提供不确定性监督。 首先,我们通过随机方差损失提出了说话级别上的不确定性监督方法,在这种方法中,一段嵌入与其对应的说话人中心点之间的距离被用作不确定性的伪真实标签进行学习。 其次,我们在训练过程中将预测出的不确定性注入softmax尺度内,以引入全局级别的不确定性监督。这种自适应调整机制可以根据样本难度来调节决策边界的锐度,为模型提供全局指导。 第三,我们重新设计了不确定性估计模块,通过整合Transformer编码器与多视角自我注意机制,使模型能够捕捉到丰富的局部和长时依赖关系。 综合实验表明,U3-xi是一个不依赖于具体模型的方法,并可以无缝地应用到各种说话人编码器中。尤其在应用于ECAPA-TDNN架构时,在VoxCeleb1测试集上分别实现了21.1%与15.57%的相对性能提升(以EER和minDCF度量)。
https://arxiv.org/abs/2601.15719
Few-shot recognition in synthetic aperture radar (SAR) imagery remains a critical bottleneck for real-world applications due to extreme data scarcity. A promising strategy involves synthesizing a large dataset with a generative adversarial network (GAN), pre-training a model via self-supervised learning (SSL), and then fine-tuning on the few labeled samples. However, this approach faces a fundamental paradox: conventional GANs themselves require abundant data for stable training, contradicting the premise of few-shot learning. To resolve this, we propose the consistency-regularized generative adversarial network (Cr-GAN), a novel framework designed to synthesize diverse, high-fidelity samples even when trained under these severe data limitations. Cr-GAN introduces a dual-branch discriminator that decouples adversarial training from representation learning. This architecture enables a channel-wise feature interpolation strategy to create novel latent features, complemented by a dual-domain cycle consistency mechanism that ensures semantic integrity. Our Cr-GAN framework is adaptable to various GAN architectures, and its synthesized data effectively boosts multiple SSL algorithms. Extensive experiments on the MSTAR and SRSDD datasets validate our approach, with Cr-GAN achieving a highly competitive accuracy of 71.21% and 51.64%, respectively, in the 8-shot setting, significantly outperforming leading baselines, while requiring only ~5 of the parameters of state-of-the-art diffusion models. Code is available at: this https URL.
在合成孔径雷达(SAR)图像中的少量样本识别仍然是实际应用中的一个重要瓶颈,原因在于极端的数据稀缺。一种有前途的策略是利用生成对抗网络(GAN)合成大量数据集,并通过自监督学习(SSL)进行预训练模型,然后对有限标记样本进行微调。然而,这种方法面临着一个基本矛盾:传统的GAN本身需要大量的数据才能进行稳定训练,这与少量样本学习的前提相违背。为了解决这个问题,我们提出了受一致性正则化的生成对抗网络(Cr-GAN),这是一种新颖的框架,旨在即使在这些严苛的数据限制条件下也能合成多样化且高保真的样本。 Cr-GAN引入了一个双分支判别器,将对抗性训练与表示学习解耦。这种架构支持一种基于通道的特征插值策略来创建新的潜在特征,并通过一个跨域循环一致性机制确保语义完整性。我们的Cr-GAN框架可以适应各种GAN架构,其生成的数据能够有效增强多种SSL算法。在MSTAR和SRSDD数据集上的广泛实验验证了我们方法的有效性,在8次样本的设置中,Cr-GAN分别达到了71.21%和51.64%的高度竞争准确性,显著优于领先的基准模型,并且仅需最先进的扩散模型参数的大约5%。代码可在以下网址获取:[this https URL]。
https://arxiv.org/abs/2601.15681
Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs' expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning. Project page: this https URL
情感信息在多模态感知中扮演着独特的角色。然而,当前的语音大型语言模型(SpeechLLMs)以及传统的语音情绪识别(SER)系统仍然将情绪理解视为一个简单的分类问题。这种做法限制了预测的解释性,并未充分利用LLM的表达和推理能力。为此,在这项工作中,我们首次尝试通过强化学习(RL)将SER重新定义为深度推理问题。我们提出了EmotionThinker模型,旨在生成基于细粒度声学线索的情感预测及可解释的说明。 为了实现这一目标,首先构建了带有链式思考标注和详细描述的情感推理数据集EmotionCoT-35K。其次,观察到当前的SpeechLLMs对语调感知较弱,而语调信号构成了理解情绪的基本信号。为了解决这个问题,我们开发了一个增强型基础模型EmotionThinker-Base,并展示了语调增强改善了情感的理解能力。最后,引入了基于逐步信任意识推理奖励的组相对策略优化(Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward, GRPO-PTR)用于RL。与仅依赖规则基础结果奖励的标准GRPO不同,GRPO-PTR逐渐引入推理奖励,并根据推论和结果之间的对齐情况动态调整,使用基于多维度标准的奖励模型来评估整体推理质量。 EmotionThinker在情感准确性及解释质量方面均超越了先前最先进的评估模型,推动SER向可解释的多模态推理迈进。项目页面:[此处插入实际URL]
https://arxiv.org/abs/2601.15668
Cross-subject EEG-based emotion recognition (EER) remains challenging due to strong inter-subject variability, which induces substantial distribution shifts in EEG signals, as well as the high complexity of emotion-related neural representations in both spatial organization and temporal evolution. Existing approaches typically improve spatial modeling, temporal modeling, or generalization strategies in isolation, which limits their ability to align representations across subjects while capturing multi-scale dynamics and suppressing subject-specific bias within a unified framework. To address these gaps, we propose a Region-aware Spatiotemporal Modeling framework with Collaborative Domain Generalization (RSM-CoDG) for cross-subject EEG emotion recognition. RSM-CoDG incorporates neuroscience priors derived from functional brain region partitioning to construct region-level spatial representations, thereby improving cross-subject comparability. It also employs multi-scale temporal modeling to characterize the dynamic evolution of emotion-evoked neural activity. In addition, the framework employs a collaborative domain generalization strategy, incorporating multidimensional constraints to reduce subject-specific bias in a fully unseen target subject setting, which enhances the generalization to unknown individuals. Extensive experimental results on SEED series datasets demonstrate that RSM-CoDG consistently outperforms existing competing methods, providing an effective approach for improving robustness. The source code is available at this https URL.
跨学科的脑电图(EEG)情感识别(EER)仍然面临挑战,主要是由于个体间的强烈差异导致的信号分布变化,以及与情感相关神经表示在空间组织和时间演化上的高度复杂性。现有的方法通常单独改进空间建模、时间建模或泛化策略,这限制了它们在一个统一框架内对齐跨个体表示的能力,同时捕捉多尺度动态并抑制特定于个人的偏差。为了解决这些差距,我们提出了一种基于区域感知时空模型与协作领域泛化的框架(RSM-CoDG),用于跨学科EEG情感识别。 RSM-CoDG融合了从功能性脑区划分中得出的认知科学先验知识来构建区域级别的空间表示,从而提高跨个体的可比性。此外,它采用多尺度时间建模以描述情绪引发的神经活动动态演化。该框架还采用了协作领域泛化策略,在完全未见过的目标个体设置下,结合多维度约束减少特定于个人的偏差,这增强了对未知个体的泛化能力。 在SEED系列数据集上的大量实验结果表明,RSM-CoDG持续优于现有的竞争方法,提供了一种有效的方法来提高鲁棒性。源代码可在提供的链接中获取。
https://arxiv.org/abs/2601.15615
Bias in chest X-ray classifiers frequently stems from sex- and age-related shortcuts, leading to systematic underdiagnosis of minority subgroups. Previous pixel-space attribute neutralizers, which rely on convolutional encoders, lessen but do not fully remove this attribute leakage at clinically usable edit strengths. This study evaluates whether substituting the U-Net convolutional encoder with a Vision Transformer backbone in the Attribute-Neutral Framework can reduce demographic attribute leakage while preserving diagnostic accuracy. A data-efficient Image Transformer Small (DeiT-S) neutralizer was trained on the ChestX-ray14 dataset. Its edited images, generated across eleven edit-intensity levels, were evaluated with an independent AI judge for attribute leakage and with a convolutional neural network (ConvNet) for disease prediction. At a moderate edit level (alpha = 0.5), the Vision Transformer (ViT) neutralizer reduces patient sex-recognition area under the curve (AUC) to approximately 0.80, about 10 percentage points below the original framework's convolutional U-Net encoder, despite being trained for only half as many epochs. Meanwhile, macro receiver operating characteristic area under the curve (ROC AUC) across 15 findings stays within five percentage points of the unedited baseline, and the worst-case subgroup AUC remains near 0.70. These results indicate that global self-attention vision models can further suppress attribute leakage without sacrificing clinical utility, suggesting a practical route toward fairer chest X-ray AI.
胸部X光分类器中的偏见通常源自性别和年龄相关的捷径,导致少数群体的系统性漏诊。先前基于像素空间属性中和方法(依赖于卷积编码器)虽然减少了这种属性泄露,但并未完全消除在临床上可接受的编辑强度下的属性泄漏。这项研究评估了在属性中立框架中用Vision Transformer骨干网络替换U-Net卷积编码器能否减少人口统计学特征泄露,同时保持诊断准确性。 使用ChestX-ray14数据集训练了一种高效的数据图像变换小模型(DeiT-S)中和器,并生成了具有十一级编辑强度的修改后的图像。通过独立的人工智能裁判评估这些修改后图像中的属性泄漏情况,并用卷积神经网络(ConvNet)进行疾病预测。 在适度编辑水平(alpha = 0.5)下,Vision Transformer (ViT) 中和器将患者性别的识别曲线下面积(AUC)降低到约0.80,比原框架中使用的卷积U-Net编码器低10个百分点左右。尽管训练的周期只有后者的半数。同时,在15种发现情况下的宏受试者操作特征曲线下面积(ROC AUC)保持在原始未编辑基线的五个百分点范围内,最差情况下子群体AUC仍接近0.70。 这些结果表明,全局自我注意视觉模型能够在不牺牲临床实用性的情况下进一步抑制属性泄露,为更公平的胸部X光AI提供了一条实用途径。
https://arxiv.org/abs/2601.15490
Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance on a wide range of vision-language tasks, raising interest in their potential use for biometric applications. In this paper, we conduct a systematic evaluation of state-of-the-art MLLMs for heterogeneous face recognition (HFR), where enrollment and probe images are from different sensing modalities, including visual (VIS), near infrared (NIR), short-wave infrared (SWIR), and thermal camera. We benchmark multiple open-source MLLMs across several cross-modality scenarios, including VIS-NIR, VIS-SWIR, and VIS-THERMAL face recognition. The recognition performance of MLLMs is evaluated using biometric protocols and based on different metrics, including Acquire Rate, Equal Error Rate (EER), and True Accept Rate (TAR). Our results reveal substantial performance gaps between MLLMs and classical face recognition systems, particularly under challenging cross-spectral conditions, in spite of recent advances in MLLMs. Our findings highlight the limitations of current MLLMs for HFR and also the importance of rigorous biometric evaluation when considering their deployment in face recognition systems.
最近,多模态大型语言模型(MLLMs)在广泛的视觉-语言任务中表现出色,引发了人们对其在生物识别应用中的潜在用途的兴趣。在这篇论文中,我们对最先进的MLLMs进行了系统评估,用于异构人脸识别(HFR),其中注册和探针图像来自不同的传感模式,包括可见光(VIS)、近红外(NIR)、短波红外(SWIR)和热成像相机。我们在多个跨模态场景下对多种开源MLLM进行基准测试,其中包括VIS-NIR、VIS-SWIR和VIS-THERMAL人脸识别。 我们使用生物识别协议并基于不同的指标,包括获取率(Acquire Rate)、等错误率(EER)和真正接受率(TAR),评估了MLLMs的识别性能。我们的研究结果揭示了在具有挑战性的跨光谱条件下,即使是最近取得的进步,MLLMs与经典的人脸识别系统之间仍然存在显著的性能差距。这些发现强调了当前MLLMs在HFR中的局限性,并且也突显了在其部署于人脸识别系统时进行严格的生物特征评估的重要性。
https://arxiv.org/abs/2601.15406
We investigate intelligent personal assistants (IPAs) accessibility for deaf and hard of hearing (DHH) people who can use their voice in everyday communication. The inability of IPAs to understand diverse accents including deaf speech renders them largely inaccessible to non-signing and speaking DHH individuals. Using an Echo Show, we compare the usability of natural language input via spoken English; with Alexa's automatic speech recognition and a Wizard-of-Oz setting with a trained facilitator re-speaking commands against that of a large language model (LLM)-assisted touch interface in a mixed-methods study. The touch method was navigated through an LLM-powered "task prompter," which integrated the user's history and smart environment to suggest contextually-appropriate commands. Quantitative results showed no significant differences across both spoken English conditions vs LLM-assisted touch. Qualitative results showed variability in opinions on the usability of each method. Ultimately, it will be necessary to have robust deaf-accented speech recognized natively by IPAs.
我们研究了智能个人助手(IPA)对聋人和听力障碍者(DHH)的可用性,这些人在日常交流中可以使用他们的声音。由于IPAs无法理解包括聋人语音在内的各种口音,因此对于不使用手语且能说话的DHH人士来说,它们难以使用。 我们利用Echo Show设备比较了通过口头英语进行自然语言输入;Alexa的自动语音识别和由受训协调员复述命令的“巫师之 Oz”设置与大型语言模型(LLM)辅助触控界面在混合方法研究中的可用性。触控方法通过一个由LLM驱动的"任务提示器"来导航,该提示器结合了用户的使用历史和智能环境以提出合适的上下文指令。 定量结果显示,在口头英语条件下的表现与LLM辅助触控无显著差异。定性的结果表明,对于每种方法的可用性存在不同的意见。最终,需要IPAs能够原生地识别聋人特有的口音。
https://arxiv.org/abs/2601.15209
Despite tremendous improvements in tasks such as image classification, object detection, and segmentation, the recognition of visual relationships, commonly modeled as the extraction of a graph from an image, remains a challenging task. We believe that this mainly stems from the fact that there is no canonical way to approach the visual graph recognition task. Most existing solutions are specific to a problem and cannot be transferred between different contexts out-of-the box, even though the conceptual problem remains the same. With broad applicability and simplicity in mind, in this paper we develop a method, \textbf{Gra}ph Recognition via \textbf{S}ubgraph \textbf{P}rediction (\textbf{GraSP}), for recognizing graphs in images. We show across several synthetic benchmarks and one real-world application that our method works with a set of diverse types of graphs and their drawings, and can be transferred between tasks without task-specific modifications, paving the way to a more unified framework for visual graph recognition.
尽管在图像分类、目标检测和分割等任务上取得了巨大的进步,但识别视觉关系(通常建模为从图像中提取图形)仍然是一项具有挑战性的任务。我们认为这主要是由于目前没有一个标准的方式来处理视觉图识别任务。大多数现有的解决方案都是针对特定问题的,无法直接应用于不同的情境中,尽管概念性的问题是一样的。本着广泛适用性和简单性的原则,在本文中我们开发了一种方法——通过子图预测进行图形识别(GraSP),用于在图像中识别图形。我们在多个合成基准测试和一个现实世界的应用程序上展示了该方法可以处理多种不同类型的图及其绘制方式,并且可以在任务之间转移,而无需特定任务的修改,这为视觉图识别的一个更加统一框架铺平了道路。
https://arxiv.org/abs/2601.15133
The ubiquity of Large Language Models (LLMs) is driving a paradigm shift where user convenience supersedes computational efficiency. This article defines the "Plausibility Trap": a phenomenon where individuals with access to Artificial Intelligence (AI) models deploy expensive probabilistic engines for simple deterministic tasks-such as Optical Character Recognition (OCR) or basic verification-resulting in significant resource waste. Through micro-benchmarks and case studies on OCR and fact-checking, we quantify the "efficiency tax"-demonstrating a ~6.5x latency penalty-and the risks of algorithmic sycophancy. To counter this, we introduce Tool Selection Engineering and the Deterministic-Probabilistic Decision Matrix, a framework to help developers determine when to use Generative AI and, crucially, when to avoid it. We argue for a curriculum shift, emphasizing that true digital literacy relies not only in knowing how to use Generative AI, but also on knowing when not to use it.
大型语言模型(LLMs)的普及正在推动一种范式转变,其中用户便利性超越了计算效率。本文定义了“可信度陷阱”:拥有访问人工智能模型权限的个人会为简单的确定性任务——如光学字符识别(OCR)或基本验证——部署昂贵的概率引擎,从而导致资源浪费的现象。通过微基准测试和关于OCR及事实核查的案例研究,我们量化了这种“效率税”,即大约6.5倍的延迟惩罚,并探讨算法谄媚的风险。为应对这一问题,我们介绍了工具选择工程和确定性-概率决策矩阵这一框架,旨在帮助开发者决定何时使用生成式AI以及最重要的是何时避免使用它。我们认为应该进行课程改革,强调真正的数字素养不仅在于知道如何使用生成式AI,还在于知道在什么时候不应该使用它。
https://arxiv.org/abs/2601.15130
While deep learning has significantly advanced robotic object recognition, purely data-driven approaches often lack semantic consistency and fail to leverage valuable, pre-existing knowledge about the environment. This report presents the ExPrIS project, which addresses this challenge by investigating how knowledge-level expectations can serve as to improve object interpretation from sensor data. Our approach is based on the incremental construction of a 3D Semantic Scene Graph (3DSSG). We integrate expectations from two sources: contextual priors from past observations and semantic knowledge from external graphs like ConceptNet. These are embedded into a heterogeneous Graph Neural Network (GNN) to create an expectation-biased inference process. This method moves beyond static, frame-by-frame analysis to enhance the robustness and consistency of scene understanding over time. The report details this architecture, its evaluation, and outlines its planned integration on a mobile robotic platform.
虽然深度学习在机器人物体识别方面取得了显著进展,但纯粹基于数据驱动的方法往往缺乏语义一致性,并且未能利用关于环境的已有宝贵知识。本报告介绍了ExPrIS项目,该项目通过研究如何将知识层面的预期用于改进从传感器数据中解释对象的问题来应对这一挑战。我们的方法建立在逐步构建3D语义场景图(3DSSG)的基础上。我们将来自两个来源的期望纳入其中:基于过去观察得出的上下文先验以及来自外部图谱如ConceptNet的概念性知识。这些信息被嵌入到异构图神经网络(GNN)中,以创建一个带有预期偏置的推理过程。这种方法超越了静态、帧对帧的分析,增强了场景理解在时间上的鲁棒性和一致性。本报告详细介绍了该架构及其评估,并概述了其计划中的移动机器人平台集成方案。
https://arxiv.org/abs/2601.15025
Text-Based Person Search (TBPS) has seen significant progress with vision-language models (VLMs), yet it remains constrained by limited training data and the fact that VLMs are not inherently pre-trained for pedestrian-centric recognition. Existing TBPS methods therefore rely on dataset-centric fine-tuning to handle distribution shift, resulting in multiple independently trained models for different datasets. While synthetic data can increase the scale needed to fine-tune VLMs, it does not eliminate dataset-specific adaptation. This motivates a fundamental question: can we train a single unified TBPS model across multiple datasets? We show that naive joint training over all datasets remains sub-optimal because current training paradigms do not scale to a large number of unique person identities and are vulnerable to noisy image-text pairs. To address these challenges, we propose Scale-TBPS with two contributions: (i) a noise-aware unified dataset curation strategy that cohesively merges diverse TBPS datasets; and (ii) a scalable discriminative identity learning framework that remains effective under a large number of unique identities. Extensive experiments on CUHK-PEDES, ICFG-PEDES, RSTPReid, IIITD-20K, and UFine6926 demonstrate that a single Scale-TBPS model outperforms dataset-centric optimized models and naive joint training.
基于文本的人体搜索(Text-Based Person Search,TBPS)在视觉-语言模型(VLMs)的推动下取得了显著进展,但仍受限于训练数据量小以及现有VLM并不是专门为行人识别而预训练的事实。因此,现有的TBPS方法依赖于针对特定数据集进行微调以处理分布变化,导致为不同数据集分别独立训练多个模型的现象出现。虽然合成数据可以增加对VLM进行微调所需的规模,但它并不能消除针对特定数据集的适应性需求。这引发了一个根本问题:我们能否训练一个跨越多个人体搜索数据集的单一统一TBPS模型? 研究表明,简单地将所有数据集联合训练仍然是次优解,因为目前的训练范式无法扩展到大量的独特行人身份,并且容易受到图像-文本对噪声的影响。 为了解决这些挑战,我们提出了Scale-TBPS方法,该方法通过两个方面的贡献来实现目标: (i) 一种基于识别噪声感知的数据集整合策略,能够将多样化的TBPS数据集无缝地结合起来; (ii) 一个可扩展的身份判别性学习框架,在面对大量独特身份的情况下依然保持高效。 广泛的实验结果在CUHK-PEDES、ICFG-PEDES、RSTPReid、IIITD-20K和UFine6926数据集上验证了一个单一的Scale-TBPS模型优于针对特定数据集优化的模型以及简单的联合训练方法。
https://arxiv.org/abs/2601.14978
Speech Emotion Recognition models typically use single categorical labels, overlooking the inherent ambiguity of human emotions. Ambiguous Emotion Recognition addresses this by representing emotions as probability distributions, but progress is limited by unreliable ground-truth distributions inferred from sparse human annotations. This paper explores whether Large Audio-Language Models (ALMs) can mitigate the annotation bottleneck by generating high-quality synthetic annotations. We introduce a framework leveraging ALMs to create Synthetic Perceptual Proxies, augmenting human annotations to improve ground-truth distribution reliability. We validate these proxies through statistical analysis of their alignment with human distributions and evaluate their impact by fine-tuning ALMs with the augmented emotion distributions. Furthermore, to address class imbalance and enable unbiased evaluation, we propose DiME-Aug, a Distribution-aware Multimodal Emotion Augmentation strategy. Experiments on IEMOCAP and MSP-Podcast show that synthetic annotations enhance emotion distribution, especially in low-ambiguity regions where annotation agreement is high. However, benefits diminish for highly ambiguous emotions with greater human disagreement. This work provides the first evidence that ALMs could address annotation scarcity in ambiguous emotion recognition, but highlights the need for more advanced prompting or generation strategies to handle highly ambiguous cases.
语音情感识别模型通常使用单一类别标签,忽略了人类情绪的内在模糊性。为了解决这一问题,模糊情感识别通过将情感表示为概率分布来进行建模,但其进展受到由稀疏的人类注释推断出的不可靠真实地面分布的限制。本文探讨了大型音频-语言模型(ALMs)是否可以通过生成高质量的合成注释来缓解标注瓶颈。我们引入了一个框架,利用ALM来创建合成感知代理,以增强人类注释并提高真实地面分布的可靠性。通过统计分析其与人类分布的一致性来验证这些代理,并通过在扩充的情绪分布上微调ALMs来评估它们的影响。 为了处理类别不平衡问题,并进行无偏评价,我们提出了一种基于分布感知的多模态情感增强策略DiME-Aug。在IEMOCAP和MSP-Podcast数据集上的实验表明,合成注释可以改善情绪分布,特别是在标注一致性较高的低模糊区域更为显著。然而,在人类意见分歧较大的高度模糊的情感中,其益处会减弱。 这项工作首次证明了ALMs可能解决模糊情感识别中的标注稀缺问题,但同时也强调需要更先进的提示或生成策略来处理高度模糊的情况。
https://arxiv.org/abs/2601.14620
Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at this https URL.
图像表示学习的模型通常设计用于识别或生成任务。对比学习的各种形式帮助模型学习将图像转换为对分类、检测和分割有用的嵌入表示。另一方面,通过像素级、感知性和对抗性损失函数训练模型来重构图像,从而学习到对于图像生成有用的潜在空间。我们试图用一种前所未有的模型统一这两种方向,该模型能够同时学习出对识别和生成都有用的表示。 我们的模型被训练为隐式神经表示(INR)超网络,它学会将图像映射到模型权重上以实现快速准确地重构。此外,我们将INR超网络与知识蒸馏技术集成在一起,从而提高其泛化能力和性能表现。除了创新的训练设计之外,该模型还学习出了一个前所未有的压缩嵌入空间,在各种视觉任务中表现出色。 整个模型在图像表示学习领域取得了接近或达到现有最佳水平的结果,并且由于其高质量的小型嵌入,还能具备生成能力。代码可在提供的链接处获取(原文中的“this https URL”)。
https://arxiv.org/abs/2601.14256
Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide specialized features for learning diverse instances, which enables VLMs to learn new instances in-context in a one-shot manner. Further, IIR-VLM leverages this knowledge for instance-aware visual understanding. We validate IIR-VLM's efficacy on existing instance personalization benchmarks. Finally, we demonstrate its superior ILR performance on a challenging new benchmark, which assesses ILR capabilities across varying difficulty and diverse categories, with person, face, pet and general objects as the instances at task.
实例级识别(ILR)涉及区分个体实例,其中行人再识别是一个显著的例子。尽管现代视觉语言模型(VLMs)在视觉感知能力方面表现出色,但我们在其实例级识别性能上发现令人不满意的结果,通常大幅落后于特定领域的ILR模型。这种限制阻碍了VLM在许多实际应用中的有效性,例如,在有效视觉理解中需要认出熟悉的人和物体的场景中。现有的解决方案通常通过使用特定于每个实例的数据集一次一个地学习来实现实例识别,这不仅会产生大量的数据收集和训练成本,而且难以进行细微的区分。 为此,我们提出了IIR-VLM(In-context Instance-level Recognition增强型视觉语言模型),这是一种经过优化以在上下文中执行一次性实例级识别任务的VLM。我们在该模型中集成了预训练的ILR专家模型作为辅助视觉编码器,为学习多样化的实例提供了专门的功能特征,这使得VLM能够在一次输入后学会新的实例(即无需额外数据训练)。此外,IIR-VLM利用这种知识来实现对实例的理解。 我们已经在现有的实例个性化基准测试上验证了IIR-VLM的有效性。最后,我们在一个具有挑战性的新基准上展示了其优越的ILR性能评估,在这个基准中,评估了不同难度和多样类别下的ILR能力,任务中的实例包括人、面部、宠物以及一般对象。
https://arxiv.org/abs/2601.14188
Despite significant progress in human action recognition, generalizing to diverse viewpoints remains a challenge. Most existing datasets are captured from ground-level perspectives, and models trained on them often struggle to transfer to drastically different domains such as aerial views. This paper examines how curriculum-based training strategies can improve generalization to unseen real aerial-view data without using any real aerial data during training. We explore curriculum learning for cross-view action recognition using two out-of-domain sources: synthetic aerial-view data and real ground-view data. Our results on the evaluation on order of training (fine-tuning on synthetic aerial data vs. real ground data) shows that fine-tuning on real ground data but differ in how they transition from synthetic to real. The first uses a two-stage curriculum with direct fine-tuning, while the second applies a progressive curriculum that expands the dataset in multiple stages before fine-tuning. We evaluate both methods on the REMAG dataset using SlowFast (CNN-based) and MViTv2 (Transformer-based) architectures. Results show that combining the two out-of-domain datasets clearly outperforms training on a single domain, whether real ground-view or synthetic aerial-view. Both curriculum strategies match the top-1 accuracy of simple dataset combination while offering efficiency gains. With the two-step fine-tuning method, SlowFast achieves up to a 37% reduction in iterations and MViTv2 up to a 30% reduction compared to simple combination. The multi-step progressive approach further reduces iterations, by up to 9% for SlowFast and 30% for MViTv2, relative to the two-step method. These findings demonstrate that curriculum-based training can maintain comparable performance (top-1 accuracy within 3% range) while improving training efficiency in cross-view action recognition.
尽管在人类动作识别方面取得了重大进展,但将模型泛化到不同视角仍然是一项挑战。大多数现有的数据集都是从地面视角捕捉的,并且在这些数据集上训练出来的模型往往难以转移到诸如航拍视图等完全不同领域的任务中。本文探讨了基于课程的学习策略如何能在不使用任何真实航拍数据的情况下,提高对未见过的真实航拍数据的泛化能力。我们探索了使用两种跨域数据源进行基于课程的交叉视角动作识别:合成航拍数据和真实的地面对视数据。 我们在训练顺序(在合成航拍数据上微调与在真实地面数据上微调)上的评估表明,虽然两个方法都采用了从合成到真实的数据过渡策略,但在如何实现这一过程方面有所不同。第一个方法采用两阶段课程法直接进行微调;第二个方法则采取渐进式课程学习,在多阶段扩大训练集后再进行微调。 我们在REMAG数据集上使用了SlowFast(基于CNN)和MViTv2(基于Transformer)两种架构来评估这两种方法的效果。结果显示,将两个跨域的数据集结合在一起的训练效果明显优于单个领域的训练,无论该领域是真实地面对视还是合成航拍视角。同时,两种课程策略都能达到简单的数据组合方式所能实现的最佳准确率(top-1 accuracy),而前者在效率方面有所提升。 采用两步微调法时,SlowFast和MViTv2的迭代次数分别减少了最多37%和30%,相比于简单地合并单一领域训练。多阶段渐进式方法进一步减少了迭代次数,对于SlowFast而言减少幅度为9%,而对于MViTv2则达到了相对两步骤法减幅的最大值即30%。 这些发现表明,在交叉视角动作识别中,基于课程的培训可以在保持相当性能(top-1准确率在3%范围内)的同时提高训练效率。
https://arxiv.org/abs/2601.14101