Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
视觉语言行动(VLA)任务要求在复杂多变的环境中对视觉场景进行推理并执行适应性动作。尽管最近关于VLA推理的研究表明,明确的思维链(CoT)可以提高泛化能力,但由于冗长的推理过程导致了较高的推断延迟。我们提出了一种高效的推理框架Fast-ThinkAct,通过可表达的潜在推理实现了紧凑而高性能的规划。Fast-ThinkAct 通过从教师模型中蒸馏学习,以偏好引导的目标为导向,将操作轨迹对齐,从而同时转移语言和视觉规划能力用于具身控制。这使得增强型策略学习能够有效连接简洁的推理与行动执行。在多个具有挑战性的具身操作和推理基准测试中的广泛实验表明,Fast-ThinkAct 在不牺牲长期规划、少量样本适应性及故障恢复性能的情况下,将最先进的VLA推理模型的推断延迟最多减少了89.3%。
https://arxiv.org/abs/2601.09708
Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model's input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.
基于Transformer的语言模型在数学推理基准测试中通常能取得很好的结果,但在基本数值理解和算术运算方面仍然表现出脆弱性。一个主要的限制是数字被处理为符号标记,其嵌入并未明确编码数值大小,从而导致系统性的错误。我们引入了一种价值感知的数值表示方法,该方法通过添加专门的前缀令牌来增强标准分词输入,这个令牌的嵌入受到底层数值大小的影响。这种机制直接将数量级信息注入模型的输入空间,并且与现有的标记器和解码器专用Transformer架构兼容。 在算术任务上的评估表明,所提出的方法在不同数值格式、任务及操作数长度上均优于基线方法。这些结果表明,明确编码数值大小是提高语言模型基本数值稳健性的有效且高效的方式。
https://arxiv.org/abs/2601.09706
LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.
大型语言模型(LLMs)正越来越多地被整合到临床工作流程中,但它们通常缺乏临床同理心,这是有效医患沟通的一个关键方面。现有的自然语言处理框架主要集中在反应性地标记医生回复中的同理心,但在预测一般健康查询所需的同理心需求方面支持有限。我们介绍了情感适用性框架(EAF),这是一个基于理论的方法,根据临床、上下文和语言线索来分类患者咨询中情感反应和解释的适用性。我们发布了一个由人类和GPT-4双重标注的真实患者咨询基准数据集。在达成人类共识的数据子集中,我们也观察到了显著的人类与GPT之间的一致性。为了验证EAF的有效性,我们在人工标记和仅GPT标记的数据上训练分类器来预测情感适用性,取得了优异的表现,并超越了启发式方法和零样本LLM基准线。错误分析突显了一些持续存在的挑战:隐含的压力、临床严重程度的模糊性和上下文上的困境,这强调了需要多标注者建模、临床医生在循环中的校准以及文化多样化的标注工作。EAF为识别响应生成前的情感需求提供了一个框架,建立了预测性同理心模型的基准,并支持异步医疗保健中的情感沟通。
https://arxiv.org/abs/2601.09696
As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. These statistics are processed by an LLM agent equipped with self-reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models.
随着大型语言模型(LLM)的规模不断扩大,后训练修剪作为一种有望在减少计算成本的同时保持性能的方法逐渐崭露头角。现有方法如SparseGPT和Wanda通过逐层权重重构或基于激活量化的幅度剪枝实现高稀疏度,但它们依赖于统一的手工启发式来确定每层的稀疏比率。此外,最近的研究表明,被修剪的LLM在事实性知识方面有显著下降,结构化修剪方法在这种情况下几乎完全丧失了事实问答能力。我们引入了代理引导剪枝,其中基础模型充当自适应剪枝代理,在每次迭代中智能地选择要修剪的层,同时保持关键的知识路径不变。我们的方法通过结合Wanda启发式的权重-激活度量与梯度重要性得分来构建逐层敏感度概况,并将其归一化为z分数以进行模型无关比较。这些统计信息由具有自我反思能力的LLM代理处理,使其能够从先前的修剪结果中学习并迭代优化其策略。一种检查点回滚机制通过在困惑度下降超过阈值时恢复来保持模型质量。我们在Qwen3模型(4B和8B参数)上大约以45%的稀疏度评估我们的方法,结果显示相对于结构化剪枝基线有显著改进:MMLU准确率提高了56%,FreebaseQA上的事实性知识保留提高了19倍,困惑度下降降低了69%。值得注意的是,我们的框架不需要重新训练,在模型无关的方式下运行,并且仅通过2-4次回滚在21-40次迭代中表现出有效的自我纠正能力,表明基础模型可以有效地指导其他基础模型的压缩过程。
https://arxiv.org/abs/2601.09694
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
我们介绍了STEP3-VL-10B,这是一个轻量级的开源基础模型,旨在重新定义高效性和前沿级多模态智能之间的权衡。通过两个战略性的转变来实现STEP3-VL-10B:首先,在超过1.2T多模态标记上进行统一且完全未冻结的预训练策略,结合语言对齐的感知编码器与Qwen3-8B解码器,以建立内在的视觉-语言协同效应;其次,通过超过1k次强化学习迭代实现扩展后的后期训练流水线。至关重要的是,我们实施了并行协调推理(PaCoRe),用于在测试时扩大计算资源分配,重点在于可扩展的感觉推理,探索和综合多样化的视觉假设。因此,尽管其紧凑的10B规模,STEP3-VL-10B与10倍至20倍大的模型(例如GLM-4.6V-106B、Qwen3-VL-235B)以及顶级专有旗舰产品如Gemini 2.5 Pro和Seed-1.5-VL相比,表现不相上下甚至更胜一筹。它在MMBench上记录了92.2%的得分,在MMMU上的得分为80.11%,并在复杂推理任务AIME2025和MathVision中分别取得了94.43%和75.95%的成绩,展示了其卓越的表现能力。我们发布了完整的模型套件,为社区提供了一个强大、高效且可重复验证的基准。
https://arxiv.org/abs/2601.09668
Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long sequences due to memory constraints and temporal error propagation. We introduce a highly efficient, self-supervised method that reframes animal identification as a global clustering task rather than a sequential tracking problem. Our approach assumes a known, fixed number of individuals within a single video -- a common scenario in practice -- and requires only bounding box detections and the total count. By sampling pairs of frames, using a frozen pre-trained backbone, and employing a self-bootstrapping mechanism with the Hungarian algorithm for in-batch pseudo-label assignment, our method learns discriminative features without identity labels. We adapt a Binary Cross Entropy loss from vision-language models, enabling state-of-the-art accuracy ($>$97\%) while consuming less than 1 GB of GPU memory per batch -- an order of magnitude less than standard contrastive methods. Evaluated on challenging real-world datasets (3D-POP pigeons and 8-calves feeding videos), our framework matches or surpasses supervised baselines trained on over 1,000 labeled frames, effectively removing the manual annotation bottleneck. This work enables practical, high-accuracy animal identification on consumer-grade hardware, with broad applicability in resource-constrained research settings. All code written for this paper are \href{this https URL}{here}.
在长期视频中识别个体动物对于行为生态学、野生动物监测和畜牧业管理至关重要。传统方法需要大量的手动标注,而现有的自监督方法由于内存限制和时间误差传播的问题,在计算上非常耗费资源且不适合处理长时间序列。我们提出了一种高效、自监督的方法,将动物识别重新定义为一个全局聚类任务而非顺序跟踪问题。我们的方法假设单个视频中存在已知数量的个体(这在实践中是一个常见的场景),只需要边界框检测和总的数量信息。通过帧对采样、使用冻结预训练骨干网络以及利用匈牙利算法进行批量伪标签分配,我们的自引导机制能够在没有身份标签的情况下学习区分性特征。我们从视觉-语言模型中调整了二元交叉熵损失函数,使得这种方法在仅消耗不到1GB的GPU内存(比标准对比方法少一个数量级)的前提下,实现了97%以上的业界领先准确率。在具有挑战性的实际数据集(3D-POP鸽子和8头牛进食视频)上评估时,我们的框架匹配或超越了基于超过1000帧标注训练的监督基线,有效地消除了手动注释瓶颈。这项工作使得在消费级硬件上进行实用且高准确率的动物识别成为可能,并在资源受限的研究环境中具有广泛的适用性。本论文的所有代码都可在[此处](https://this https URL)找到。
https://arxiv.org/abs/2601.09663
Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, including newly emerging entities and culturally specific categories. We introduce LiteEmbed, a lightweight framework for few-shot personalization of CLIP that enables new classes to be added without retraining its encoders. LiteEmbed performs subspace-guided optimization of text embeddings within CLIP's vocabulary, leveraging a PCA-based decomposition that disentangles coarse semantic directions from fine-grained variations. Two complementary objectives, coarse alignment and fine separation, jointly preserve global semantic consistency while enhancing discriminability among visually similar classes. Once optimized, the embeddings are plug-and-play, seamlessly substituting CLIP's original text features across classification, retrieval, segmentation, and detection tasks. Extensive experiments demonstrate substantial gains over prior methods, establishing LiteEmbed as an effective approach for adapting CLIP to underrepresented, rare, or unseen classes.
大型视觉-语言模型,如CLIP,在零样本识别方面表现出色,但在预训练期间很少见到的类别(包括新出现的实体和特定文化的类别)上则表现不佳。为此,我们引入了LiteEmbed,这是一个轻量级框架,用于对CLIP进行少量样例个性化处理,它使得可以在不重新训练其编码器的情况下添加新的类别。LiteEmbed在CLIP词汇表内执行子空间引导的文本嵌入优化,并利用基于PCA(主成分分析)的分解来解开粗略语义方向和细粒度变化之间的联系。 该框架通过两个互补的目标——粗略对齐与精细分离,同时保持全局语义一致性并增强视觉上相似类别间的区分性。一旦完成优化,这些嵌入可以即插即用,在分类、检索、分割和检测任务中无缝替换CLIP原有的文本特征。广泛的实验表明,LiteEmbed相对于先前的方法取得了显著的改进,并确立了其作为将CLIP适应于代表性不足、罕见或未见过类别方面的有效方法的地位。
https://arxiv.org/abs/2601.09661
Estimating physically accurate, simulation-ready garments from a single image is challenging due to the absence of image-to-physics datasets and the ill-posed nature of this problem. Prior methods either require multi-view capture and expensive differentiable simulation or predict only garment geometry without the material properties required for realistic simulation. We propose a feed-forward framework that sidesteps these limitations by first fine-tuning a vision-language model to infer material composition and fabric attributes from real images, and then training a lightweight predictor that maps these attributes to the corresponding physical fabric parameters using a small dataset of material-physics measurements. Our approach introduces two new datasets (FTAG and T2P) and delivers simulation-ready garments from a single image without iterative optimization. Experiments show that our estimator achieves superior accuracy in material composition estimation and fabric attribute prediction, and by passing them through our physics parameter estimator, we further achieve higher-fidelity simulations compared to state-of-the-art image-to-garment methods.
从单张图像中估计出物理上准确且可用于模拟的服装是一个挑战,这是因为缺乏图像到物理属性的数据集以及该问题本身的病态性(ill-posed nature)。此前的方法要么需要多视角捕捉和昂贵的可微分模拟,要么仅预测服装几何形状而没有进行逼真模拟所需的材料特性。我们提出了一种前馈框架,通过首先对一个视觉语言模型进行微调来克服这些限制,使其能够从真实图像中推断出材料组成及织物属性,并随后训练一种轻量级预测器,该预测器将这些属性映射到使用小规模的材料物理测量数据集对应的实际织物参数。我们的方法引入了两个新的数据集(FTAG和T2P),并且无需迭代优化就能从单张图像中生成可用于模拟的服装。实验表明,我们的估计器在材料组成估算和织物属性预测方面实现了更优的准确性,并且通过将这些属性传递到我们物理参数估计算法中,与当前最先进的图像到服装方法相比,还进一步提高了高保真度的模拟效果。
https://arxiv.org/abs/2601.09658
While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users' more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. To facilitate this study, we introduce AndroidIntent, a benchmark designed to evaluate agents' ability in resolving vague instructions and providing proactive suggestions through reasoning over long-term user records. We annotated 775 user-specific preferences and 215 routines from 20k long-term records across different users for evaluation. Furthermore, we introduce Hierarchical Intent Memory Agent (HIM-Agent), which maintains a continuously updating personal memory and hierarchically organizes user preferences and routines for personalization. Finally, we evaluate a range of GUI agents on AndroidIntent, including GPT-5, Qwen3-VL, and UI-TARS, further results show that HIM-Agent significantly improves both execution and proactive performance by 15.7% and 7.3%.
尽管基于图形用户界面(GUI)的代理在接收明确和完成指令时表现出色,但现实世界的部署要求它们能够与用户的复杂隐含意图对齐。在此项工作中,我们介绍了“PersonalAlign”,即个性化 GUI 代理的任务,该任务需要代理利用长期用户记录作为持续上下文来解决模糊指示中遗漏的偏好,并通过推理长期用户记录为用户提供前瞻性建议。 为了支持这项研究,我们引入了 AndroidIntent 基准测试,旨在评估代理在解析模糊指令并通过推理长期用户记录提供前瞻性建议方面的能力。我们在不同用户的 20,000 条长期记录中标记了 775 种特定于用户的偏好和 215 个常规流程,以进行评价。 此外,我们介绍了层次化意图记忆代理(Hierarchical Intent Memory Agent,简称 HIM-Agent),该代理能够维护一个持续更新的个人记忆,并分层组织用户偏好和常规流程,从而实现个性化。最后,在 AndroidIntent 基准测试中评估了一系列 GUI 代理,包括 GPT-5、Qwen3-VL 和 UI-TARS,进一步的结果表明,HIM-Agent 在执行性能和前瞻性性能方面分别提高了 15.7% 和 7.3%。
https://arxiv.org/abs/2601.09636
Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant "Reasoning Gap": while native-like models (Claude 3.7) perform intuitively (40\% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54\%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4\% valid poems), while our hybrid verification loop restores performance to 73.1\%. We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.
尽管大型语言模型(LLM)在自然语言处理任务中表现出色,但在诸如押韵检测和生成这类以音系为基础的现象上仍存在挑战。尤其对于资源较少的语言,例如现代希腊语,这一问题更加明显。本文介绍了一种混合系统,该系统结合了LLM与确定性音系算法,能够实现准确的押韵识别/分析及生成。 我们的方法实现了全面的希腊语押韵类型分类,包括纯式、丰富式、不完整式、马赛克式和同前元音(IDV)模式,并采用了具有音系验证的代理生成管道。我们评估了多种提示策略(零样本、少样本、链式思维、RAG增强),并测试了多个LLM模型,包括Claude 3.7和4.5、GPT-4o、Gemini 2.0以及开源模型如Llama 3.1(8B和70B)和Mistral Large。实验结果显示了一个显著的“推理差距”:虽然母语级别的模型(例如,Claude 3.7)能够直观地完成任务(识别准确率为40%),但在少样本情况下需要更多推理能力的模型(如Claude 4.5)仅在使用链式思维提示时才能达到最先进的性能(54%)。最关键的是,纯LLM生成完全失败(有效诗歌低于4%),而我们的混合验证循环将性能恢复到了73.1%。 我们公开了该系统以及一套关键的、严格清理过的数据集,包含40,000多个押韵语料,来源于Anemoskala和战间期诗歌语料库。这些资源为未来的研究提供了支持。
https://arxiv.org/abs/2601.09631
In recent years, foundation models have become very popular due to their exceptional performance, mainly in natural language (NLP) tasks where they were first introduced. These models usually consist of hundreds of millions, or even billions, of parameters, making them resource-intensive during training and in production systems, leading to increased costs. This paper focuses on the reduction of a foundation's model size when applied to music information retrieval (MIR) tasks. Our research combines the Branchformer architecture with SummaryMixing, which were first applied in speech recognition, along with a random quantization process. To facilitate reproducibility, we conduct pre-training on publicly available datasets, complemented by a proprietary dataset comparable in scale to other private datasets reported in the literature. We ensure robust evaluation by using a framework consisting of a variety of downstream MIR tasks. Our results show that our architecture achieves competitive performance when compared with other state-of-the-art models that use multi-head self-attention, while reducing the model size from 8.5% up to 12.3%.
近年来,由于在自然语言处理(NLP)任务中表现出色,基础模型变得非常流行。这些模型通常包含数亿甚至数十亿个参数,在训练和生产系统中资源消耗大,导致成本增加。本文的重点是研究将基础模型应用于音乐信息检索(MIR)任务时的模型规模缩减问题。我们的研究结合了在语音识别领域首次应用的Branchformer架构与SummaryMixing方法,并加入随机量化过程。为了便于重现实验结果,我们在公开的数据集上进行预训练,并补充了一个可与文献中其他私有数据集相媲美的专有数据集。我们通过一个包含多种下游MIR任务的框架来确保评估的稳健性。我们的结果显示,与使用多头自注意力机制的最新模型相比,我们的架构在保持竞争力的同时,将模型规模减少了8.5%至12.3%。
https://arxiv.org/abs/2601.09603
This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at this https URL.
这篇论文介绍了一种新颖的应用,即使用视频联合嵌入预测架构(V-JEPAs)进行面部表情识别(FER)。不同于依赖于像素级重建的传统视频理解预训练方法,V-JEPAs通过从未遮挡区域的嵌入中预测遮挡区域的嵌入来进行学习。这使得经过训练的编码器不会捕获给定视频中的无关信息,例如背景某个区域内像素的颜色。利用一个预先训练好的V-JEPA视频编码器,我们使用RAVDESS和CREMA-D数据集训练浅层分类器,在RAVDESS上达到了最先进的性能,并且在CREMA-D上的表现超过了所有其他基于视觉的方法(+1.48 WAR)。此外,跨数据集的评估显示了强大的泛化能力,证明了仅基于嵌入式的预训练方法对推进FER具有潜力。我们在提供的链接中发布了我们的代码。
https://arxiv.org/abs/2601.09524
To teach robots complex manipulation tasks, it is now a common practice to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data. However, since this recipe updates existing representations, it is unsuitable for long-term operation in the real world, where robots must continually adapt to new tasks and environments while retaining the knowledge they have already acquired. Existing continual learning methods for robotics commonly require storing previous data (exemplars), struggle with long task sequences, or rely on task identifiers for deployment. To address these limitations, we propose CLARE, a general, parameter-efficient framework for exemplar-free continual learning with VLAs. CLARE introduces lightweight modular adapters into selected feedforward layers and autonomously expands the model only where necessary when learning a new task, guided by layer-wise feature similarity. During deployment, an autoencoder-based routing mechanism dynamically activates the most relevant adapters without requiring task labels. Through extensive experiments on the LIBERO benchmark, we show that CLARE achieves high performance on new tasks without catastrophic forgetting of earlier tasks, significantly outperforming even exemplar-based methods. Code and data are available at this https URL.
为了教导机器人执行复杂的操作任务,目前的常用做法是针对特定任务的数据对预训练的视觉-语言-动作模型(VLA)进行微调。然而,由于这种方法会更新现有的表示形式,因此不适合在现实世界中长期运行,因为在这种情况下,机器人必须不断适应新的任务和环境,同时保留已经获得的知识。目前为机器人设计的持续学习方法通常需要存储先前的数据(示例),难以处理长序列的任务,并且依赖于任务标识符来进行部署。 为了克服这些限制,我们提出了CLARE框架——一个通用且参数高效的无示例数据的持续学习框架,适用于VLA模型。CLARE在选定的前馈层中引入了轻量级模块化适配器,并在学习新任务时只在必要的情况下通过逐层特征相似性来自主扩展模型。部署期间,基于自动编码器的路由机制可以根据相关性动态激活最相关的适配器,而无需使用任务标签。 通过LIBERO基准测试的广泛实验,我们证明了CLARE能够在执行新的任务时保持高水平的表现,同时避免对之前学习的任务产生灾难性的遗忘,从而显著优于基于示例的方法。代码和数据可在以下链接获得:[此URL](请将括号内的文本替换为实际提供的网址)。
https://arxiv.org/abs/2601.09512
We present PrivLEX, a novel image privacy classifier that grounds its decisions in legally defined personal data concepts. PrivLEX is the first interpretable privacy classifier aligned with legal concepts that leverages the recognition capabilities of Vision-Language Models (VLMs). PrivLEX relies on zero-shot VLM concept detection to provide interpretable classification through a label-free Concept Bottleneck Model, without requiring explicit concept labels during training. We demonstrate PrivLEX's ability to identify personal data concepts that are present in images. We further analyse the sensitivity of such concepts as perceived by human annotators of image privacy datasets.
我们介绍了PrivLEX,这是一种新颖的图像隐私分类器,其决策依据是法律定义的个人数据概念。PrivLEX是第一个与法律概念相一致且利用视觉-语言模型(VLMs)识别能力的可解释性隐私分类器。PrivLEX依靠零样本VLM概念检测来提供无需训练时明确概念标签即可通过无标签的概念瓶颈模型进行可解释分类的能力。我们展示了PrivLEX能够识别图像中包含的个人数据概念,并进一步分析了人类标注者在图像隐私数据集中对这些概念敏感度的感知。
https://arxiv.org/abs/2601.09449
In language models (LMs), intra-memory knowledge conflict largely arises when inconsistent information about the same event is encoded within the model's parametric knowledge. While prior work has primarily focused on resolving conflicts between a model's internal knowledge and external resources through approaches such as fine-tuning or knowledge editing, the problem of localizing conflicts that originate during pre-training within the model's internal representations remain unexplored. In this work, we design a framework based on mechanistic interpretability methods to identify where and how conflicting knowledge from the pre-training data is encoded within LMs. Our findings contribute to a growing body of evidence that specific internal components of a language model are responsible for encoding conflicting knowledge from pre-training, and we demonstrate how mechanistic interpretability methods can be leveraged to causally intervene in and control conflicting knowledge at inference time.
在语言模型(LM)中,内部记忆知识冲突主要发生在同一事件的不一致信息被编码到模型的参数化知识中的时候。虽然先前的工作主要集中于通过微调或知识编辑等方法解决模型内部知识与外部资源之间的矛盾问题,但如何定位和处理预训练阶段产生的内在表示中的冲突问题仍然未得到充分探索。在这项工作中,我们设计了一个基于机制可解释性方法的框架,以识别来自预训练数据的冲突知识在语言模型内部表示中被编码的位置及方式。我们的发现支持了一种日益增长的观点:特定的语言模型内部组件负责编码预训练过程中的矛盾信息,并展示了如何利用机制可解释性方法在推理过程中对这些冲突知识进行因果干预和控制。
https://arxiv.org/abs/2601.09445
Automated analysis of ancient coins has the potential to help researchers extract more historical insights from large collections of coins and to help collectors understand what they are buying or selling. Recent research in this area has shown promise in focusing on identification of semantic elements as they are commonly depicted on ancient coins, by using convolutional neural networks (CNNs). This paper is the first to apply the recently proposed Vision Transformer (ViT) deep learning architecture to the task of identification of semantic elements on coins, using fully automatic learning from multi-modal data (images and unstructured text). This article summarises previous research in the area, discusses the training and implementation of ViT and CNN models for ancient coins analysis and provides an evaluation of their performance. The ViT models were found to outperform the newly trained CNN models in accuracy.
自动化分析古代硬币有望帮助研究人员从大量的硬币收藏中提取更多的历史见解,并且有助于收藏者更好地理解他们所购买或出售的物品。近年来,该领域的研究通过使用卷积神经网络(CNNs)专注于识别古币上常见的语义元素,显示出巨大的潜力。本文首次将最近提出的视觉变压器(ViT)深度学习架构应用于识别硬币上的语义元素任务,并利用多模态数据(图像和非结构化文本)进行全自动学习。本文总结了该领域的先前研究,讨论了为古代硬币分析训练和实现的ViT和CNN模型,并提供了它们性能的评估。实验结果表明,ViT模型在准确性方面优于新训练的CNN模型。
https://arxiv.org/abs/2601.09433
Pre-trained language models (LMs) have, over the last few years, grown substantially in both societal adoption and training costs. This rapid growth in size has constrained progress in understanding and mitigating their biases. Since re-training LMs is prohibitively expensive, most debiasing work has focused on post-hoc or masking-based strategies, which often fail to address the underlying causes of bias. In this work, we seek to democratise pre-model debiasing research by using low-cost proxy models. Specifically, we investigate BabyLMs, compact BERT-like models trained on small and mutable corpora that can approximate bias acquisition and learning dynamics of larger models. We show that BabyLMs display closely aligned patterns of intrinsic bias formation and performance development compared to standard BERT models, despite their drastically reduced size. Furthermore, correlations between BabyLMs and BERT hold across multiple intra-model and post-model debiasing methods. Leveraging these similarities, we conduct pre-model debiasing experiments with BabyLMs, replicating prior findings and presenting new insights regarding the influence of gender imbalance and toxicity on bias formation. Our results demonstrate that BabyLMs can serve as an effective sandbox for large-scale LMs, reducing pre-training costs from over 500 GPU-hours to under 30 GPU-hours. This provides a way to democratise pre-model debiasing research and enables faster, more accessible exploration of methods for building fairer LMs.
近年来,预训练语言模型(LMs)在社会应用和培训成本方面都有了显著的增长。这种快速扩大规模的情况限制了对它们偏见的理解和缓解工作。由于重新训练大型语言模型的成本过高,大多数消除偏见的研究主要集中在事后处理或基于掩码的方法上,这些方法往往无法解决偏见的根本原因。在这项工作中,我们试图通过使用低成本的代理模型来使预模型去偏研究民主化。具体而言,我们调查了BabyLMs,这是一种小型、可变语料库训练的紧凑型BERT类似模型,可以近似大型模型获取和学习偏见的动力学。 我们的研究表明,尽管BabyLMs的大小大幅减少,但它们在内在偏见形成和发展方面的表现与标准的BERT模型非常一致。此外,BabyLMs与BERT之间的相关性在多种内源去偏方法和事后处理去偏策略上都存在。利用这些相似之处,我们使用BabyLMs进行了预训练去偏实验,并复制了先前的研究发现,同时提出了关于性别不平衡和毒性对偏见形成影响的新见解。 我们的结果表明,BabyLMs可以作为大规模语言模型的有效试验平台,将预训练成本从超过500个GPU小时降低到不到30个GPU小时。这为预模型去偏研究的民主化提供了一种途径,并使构建更公平的语言模型的方法探索更加迅速且易于访问。
https://arxiv.org/abs/2601.09421
With the rise of large-scale foundation models, efficiently adapting them to downstream tasks remains a central challenge. Linear probing, which freezes the backbone and trains a lightweight head, is computationally efficient but often restricted to last-layer representations. We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers. To leverage this distribution of information, we apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer. This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions. Across 20 diverse datasets and multiple pretrained foundation models, our method achieves consistent, substantial gains over standard linear probes. Attention heatmaps further reveal that tasks different from the pre-training domain benefit most from intermediate representations. Overall, our findings underscore the value of intermediate layer information and demonstrate a principled, task aware approach for unlocking their potential in probing-based adaptation.
随着大规模基础模型的兴起,如何高效地将它们适应到下游任务仍然是一个核心挑战。线性探针方法通过冻结骨干网络并训练轻量级头部,在计算上是高效的,但往往仅限于使用最后一层表示。我们证明了与特定任务相关的信息分布在网络层级之中,而不仅仅是编码在最后一层中。为了利用这种信息分布,我们应用了一种注意探针机制,该机制动态融合来自Vision Transformer所有层次的表示。这一机制能够学习识别对目标任务最相关的层次,并将低级别的结构线索与高级别的语义抽象相结合。 通过20个多样化的数据集和多个预训练的基础模型,我们的方法在标准线性探针的基础上取得了持续且显著的提升。注意力热图进一步揭示了那些不同于预训练领域中的任务从中间层表示中获益最多。总体而言,我们的研究结果强调了中间层次信息的价值,并展示了一种原则性的、针对特定任务的方法来解锁其在线索适应中的潜力。
https://arxiv.org/abs/2601.09322
Whole-brain parcellation from MRI is a critical yet challenging task due to the complexity of subdividing the brain into numerous small, irregular shaped regions. Traditionally, template-registration methods were used, but recent advances have shifted to deep learning for faster workflows. While large models like the Segment Anything Model (SAM) offer transferable feature representations, they are not tailored for the high precision required in brain parcellation. To address this, we propose BrainSegNet, a novel framework that adapts SAM for accurate whole-brain parcellation into 95 regions. We enhance SAM by integrating U-Net skip connections and specialized modules into its encoder and decoder, enabling fine-grained anatomical precision. Key components include a hybrid encoder combining U-Net skip connections with SAM's transformer blocks, a multi-scale attention decoder with pyramid pooling for varying-sized structures, and a boundary refinement module to sharpen edges. Experimental results on the Human Connectome Project (HCP) dataset demonstrate that BrainSegNet outperforms several state-of-the-art methods, achieving higher accuracy and robustness in complex, multi-label parcellation.
从MRI进行全脑分割是一项既关键又具有挑战性的任务,因为需要将大脑复杂地划分为许多小而形状不规则的区域。传统上使用模板配准方法来完成这一任务,但最近的研究进展转向了深度学习技术以实现更高效的流程。虽然大型模型如Segment Anything Model (SAM)提供了可转移的特征表示能力,但它们并不专门针对全脑分割所需的高精度进行了优化。 为了解决这个问题,我们提出了一种新的框架——BrainSegNet,它对SAM进行改进,使其能够准确地将整个大脑划分为95个区域。通过在编码器和解码器中整合U-Net跳跃连接以及专业的模块来增强SAM的功能,从而实现了精细的解剖精度。 BrainSegNet的关键组件包括: 1. 混合编码器:结合了U-Net跳跃连接与SAM的变压器块。 2. 多尺度注意力解码器:通过金字塔池化处理不同大小的结构。 3. 边界细化模块:用于锐化边界,提高分割质量。 在Human Connectome Project (HCP) 数据集上的实验结果表明,BrainSegNet优于几种最新的方法,在复杂多标签的全脑分割任务中表现出更高的准确性和鲁棒性。
https://arxiv.org/abs/2601.09263
Infrared object detection focuses on identifying and locating objects in complex environments (\eg, dark, snow, and rain) where visible imaging cameras are disabled by poor illumination. However, due to low contrast and weak edge information in infrared images, it is challenging to extract discriminative object features for robust detection. To deal with this issue, we propose a novel vision-language representation learning paradigm for infrared object detection. An additional textual supervision with rich semantic information is explored to guide the disentanglement of object and non-object features. Specifically, we propose a Semantic Feature Alignment (SFA) module to align the object features with the corresponding text features. Furthermore, we develop an Object Feature Disentanglement (OFD) module that disentangles text-aligned object features and non-object features by minimizing their correlation. Finally, the disentangled object features are entered into the detection head. In this manner, the detection performance can be remarkably enhanced via more discriminative and less noisy features. Extensive experimental results demonstrate that our approach achieves superior performance on two benchmarks: M\textsuperscript{3}FD (83.7\% mAP), FLIR (86.1\% mAP). Our code will be publicly available once the paper is accepted.
红外物体检测专注于在复杂环境中(例如黑暗、下雪和下雨)识别和定位对象,这些环境使可见光成像相机由于照明不良而失效。然而,由于红外图像中的低对比度和弱边缘信息,提取具有区分性的目标特征以实现稳健的检测非常具有挑战性。为了解决这一问题,我们提出了一种新颖的视觉-语言表示学习范式用于红外物体检测。通过探索额外包含丰富语义信息的文本监督来指导分离对象与非对象特征。具体而言,我们提出了一个语义特征对齐(Semantic Feature Alignment, SFA)模块,将目标特征与相应的文本特征进行对齐。此外,我们开发了一个目标特征分离(Object Feature Disentanglement, OFD)模块,通过最小化它们之间的相关性来分离与文本对齐的目标特征和非对象特征。最后,这些经过分离的物体特征被输入到检测头中。通过这种方式,利用更具区分性和更少噪声的特征可以显著提升检测性能。广泛的实验结果表明,我们的方法在两个基准数据集上(M³FD 83.7% mAP、FLIR 86.1% mAP)取得了优越的表现。一旦论文被接受,我们的代码将公开提供。
https://arxiv.org/abs/2601.09228