Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative view of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffusion-based framework that applies controlled geometric transformations in latent space, augmented with view-guided correspondence attention to maintain structural consistency during reconstruction. Operating on frozen pre-trained models without detector access or watermark knowledge, our method achieves state-of-the-art watermark suppression across 15 watermarking methods--outperforming 14 baseline attacks while maintaining superior perceptual quality across multiple datasets.
不可见水印已成为验证AI生成图像内容真伪的关键机制,各大平台已开始大规模部署水印方案。然而,评估这些方案抵御复杂移除攻击的能力仍然至关重要,以衡量其可靠性并指导稳健设计。在这项工作中,我们揭示了不可见水印的一个基本漏洞,通过将水印移除重新表述为视图合成问题来实现这一点。我们的关键见解是,生成同一语义内容的感知上一致的替代视角(类似于从稍有不同的角度重新观察一个场景),可以自然地去除嵌入的水印同时保持视觉保真度。这揭示了一个重要缺口:即使对像素空间和频域攻击具有鲁棒性的水印,在面对保持语义不变的视点变换时仍然脆弱。 我们引入了一种零样本扩散框架,该框架在潜在空间中应用受控几何变换,并通过视图引导对应注意力来维护重建过程中的结构一致性。此方法无需访问检测器或了解具体水印知识,仅基于预训练模型即可操作。我们的方法在15种不同的水印方案上实现了最先进的水印抑制效果,在超越了14种基线攻击的同时,还能在多个数据集上保持优越的感知质量。
https://arxiv.org/abs/2601.08832
Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: this https URL
视频对象分割方法如SAM2通过基于内存的架构实现了强大的性能,但因依赖于外观特征,在视角变化较大的情况下表现不佳。传统三维实例分割方法解决了视角一致性问题,但却需要摄像机姿态、深度图及昂贵的预处理步骤。我们引入了3AM,这是一种在训练期间增强的方法,它将MUSt3R的具有三维感知特性的特征集成到了SAM2中。我们的轻量级特征融合器整合了多个级别的MUSt3R特征,这些特征编码了隐式的几何对应关系。结合SAM2的外观特征后,模型能够基于空间位置和视觉相似度实现几何一致性识别。 我们还提出了一种视场感知采样策略,确保帧观察到的空间一致的对象区域可以可靠地学习三维对应关系。重要的是,我们的方法在推理时只需要RGB输入,并不需要摄像机姿态或预处理步骤。 在具有大基线运动的挑战性数据集(如ScanNet++、Replica)上,3AM显著优于SAM2及其扩展版本,在ScanNet++选定子集中达到了90.6%的IoU和71.7%的正面IoU,分别比最先进的视频对象分割方法提高了+15.9和+30.4个百分点。 项目页面:[此链接](https://this.is.the.url.of.project)
https://arxiv.org/abs/2601.08831
In this work, we explore the Large Language Model (LLM) agent reviewer dynamics in an Elo-ranked review system using real-world conference paper submissions. Multiple LLM agent reviewers with different personas are engage in multi round review interactions moderated by an Area Chair. We compare a baseline setting with conditions that incorporate Elo ratings and reviewer memory. Our simulation results showcase several interesting findings, including how incorporating Elo improves Area Chair decision accuracy, as well as reviewers' adaptive review strategy that exploits our Elo system without improving review effort. Our code is available at this https URL.
在这项工作中,我们使用真实的会议论文提交数据,在一个基于Elo排名的评审系统中探索大型语言模型(LLM)代理审稿人的动态。多个具有不同人格特征的LLM代理评审人参与多轮评审互动,并由领域主席进行调解。我们将基础设置与包含Elo评分和评审人记忆条件的情况进行了比较。我们的模拟结果显示了一些有趣的发现,包括引入Elo评分如何提高领域主席决策的准确性,以及评审人利用我们Elo系统的自适应评审策略,而无需增加评审努力。我们的代码可在[此处](https://this https URL)获取。请注意,链接应该是实际可用的有效URL,在此示例中使用了占位符“this https URL”。
https://arxiv.org/abs/2601.08829
Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.
尽管视频生成模型取得了快速进展,但数据在影响运动方面的作用仍然不为人知。我们提出了Motive(用于视频生成的运动归因框架),这是一个以运动为中心、基于梯度的数据归因框架,可以应用于现代大规模高质量视频数据集和模型。利用这个工具,我们可以研究哪些微调片段能够改善或削弱时间动态。通过使用运动加权损失掩码,Motive能够将时间动态与静态外观分离,从而高效且可扩展地计算特定于运动的影响。 在从文本到视频的模型中,Motive能够识别强烈影响运动的片段,并指导数据整理工作以提高时间一致性及物理合理性。利用由Motive筛选出的重要数据进行微调后,我们的方法在VBench基准上提升了动作平滑度和动态程度,比预训练的基础模型赢得了74.1%的人类偏好胜率。 据我们所知,这是首个针对视频生成模型的归因框架,它专门关注运动而非视觉外观,并且首次使用该技术来整理用于微调的数据。
https://arxiv.org/abs/2601.08828
People can respond to feedback and guidance in different ways, and it is important for robots to personalize their interactions and utilize verbal and nonverbal communication cues. We aim to understand how older adults respond to different cadences of verbal and nonverbal feedback of a robot exercise coach. We conducted an online study of older adults, where participants evaluated videos of the robot giving feedback at different cadences for each modality. The results indicate that changing the cadence of one modality affects the perception of both it and the other modality. We can use the results from this study to better design the frequency of the robot coach's feedback during an exercise session with this population.
人们可以以不同的方式回应反馈和指导,因此机器人需要个性化互动并利用言语和非言语沟通线索。我们的目标是了解老年人如何对机器人健身教练不同节奏的言语和非言语反馈作出反应。我们进行了一项针对老年人的在线研究,在该研究中,参与者评估了机器人在不同节奏下提供反馈的视频片段(每种模式分别)。结果表明,改变一种模态的节奏会影响人们对这种模式及其另一种模式的感知。我们可以利用这项研究的结果来更好地设计机器人教练在与这一群体进行健身会话时的反馈频率。
https://arxiv.org/abs/2601.08819
The evolution of recommender systems has shifted preference storage from rating matrices and dense embeddings to semantic memory in the agentic era. Yet existing agents rely on isolated memory, overlooking crucial collaborative signals. Bridging this gap is hindered by the dual challenges of distilling vast graph contexts without overwhelming reasoning agents with cognitive load, and evolving the collaborative memory efficiently without incurring prohibitive computational costs. To address this, we propose MemRec, a framework that architecturally decouples reasoning from memory management to enable efficient collaborative augmentation. MemRec introduces a dedicated, cost-effective LM_Mem to manage a dynamic collaborative memory graph, serving synthesized, high-signal context to a downstream LLM_Rec. The framework operates via a practical pipeline featuring efficient retrieval and cost-effective asynchronous graph propagation that evolves memory in the background. Extensive experiments on four benchmarks demonstrate that MemRec achieves state-of-the-art performance. Furthermore, architectural analysis confirms its flexibility, establishing a new Pareto frontier that balances reasoning quality, cost, and privacy through support for diverse deployments, including local open-source models. Code:this https URL and Homepage: this https URL
推荐系统的进化已经将偏好存储从评分矩阵和密集嵌入转移到了代理时代的语义记忆。然而,现有的代理依赖于孤立的记忆,忽视了至关重要的协作信号。这种差距的弥合受到双重挑战的阻碍:一方面需要提炼大量的图上下文信息而不让推理代理的认知负荷过载;另一方面则是在不带来高昂计算成本的情况下有效进化协作记忆。为了解决这些问题,我们提出了MemRec框架,该框架在架构上将推理与内存管理分离,以实现高效的协同增强。MemRec引入了一种专用且低成本的LM_Mem来动态地管理和维护一个合作记忆图,并向下游的LLM_Rec提供合成的、高信号内容。整个框架通过一种实用的流水线工作,该流水线包括高效检索和成本效益高的异步图传播机制,在后台进化内存。 在四个基准测试上的广泛实验表明,MemRec达到了最先进的性能水平。此外,架构分析证实了它的灵活性,并且它确立了一个新的帕累托前沿面,这通过支持各种部署(包括本地开源模型)来平衡推理质量、成本和隐私保护。 代码:[此处应为实际链接] 主页:[此处应为实际链接]
https://arxiv.org/abs/2601.08816
The recent development of Large Language Models (LLMs) with strong reasoning ability has driven research in various domains such as mathematics, coding, and scientific discovery. Meanwhile, 3D visual grounding, as a fundamental task in 3D understanding, still remains challenging due to the limited reasoning ability of recent 3D visual grounding models. Most of the current methods incorporate a text encoder and visual feature encoder to generate cross-modal fuse features and predict the referring object. These models often require supervised training on extensive 3D annotation data. On the other hand, recent research also focus on scaling synthetic data to train stronger 3D visual grounding LLM, however, the performance gain remains limited and non-proportional to the data collection cost. In this work, we propose a 3D visual grounding data pipeline, which is capable of automatically synthesizing 3D visual grounding data along with corresponding reasoning process. Additionally, we leverage the generated data for LLM fine-tuning and introduce Reason3DVG-8B, a strong 3D visual grounding LLM that outperforms previous LLM-based method 3D-GRAND using only 1.6% of their training data, demonstrating the effectiveness of our data and the importance of reasoning in 3D visual grounding.
近期,具备强大推理能力的大规模语言模型(LLMs)的发展,在数学、编程和科学发现等多个领域推动了研究进展。然而,三维视觉定位作为三维理解中的一个基本任务,由于最近的三维视觉定位模型推理能力有限,仍然面临挑战。目前大多数方法结合文本编码器和视觉特征编码器来生成跨模态融合特征,并预测所指的对象。这些模型通常需要在大量标注的3D数据上进行监督训练。另一方面,近期的研究也集中在利用合成数据扩展训练更强的三维视觉定位LLM,然而性能提升仍然有限且不成比例于数据收集成本。 在此项工作中,我们提出了一种三维视觉定位的数据流水线,能够自动合成三维视觉定位数据及其对应的推理过程。此外,我们将生成的数据用于大规模语言模型的微调,并引入了Reason3DVG-8B,这是一个强大的三维视觉定位LLM,在仅使用先前方法3D-GRAND 1.6%训练数据的情况下就超过了它们的表现,证明了我们的数据的有效性以及在三维视觉定位中推理的重要性。
https://arxiv.org/abs/2601.08811
Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at this https URL.
大型语言模型通常通过链式思维(CoT)更有效地解决复杂的推理任务,但代价是生成较长的、带宽较低的标记序列。相比之下,人类经常通过保持对未来步骤的各种可能分布来进行软性推理。受此启发,我们提出了多重思考机制,这是一种随机软推理方法,在每次思考步骤中,它会从候选标记中采样K个,并将其嵌入整合为一个单一的连续多重标记。这种方法保留了词汇嵌入先验和标准离散生成中的抽样动态,同时诱导出一种可处理的概率分布来覆盖多重展开过程。因此,多重轨迹可以直接利用在策略强化学习(RL)进行优化。 重要的是,多重思考机制具有自我适应性:当模型非常自信时,多重标记接近于离散状态,并且行为类似标准CoT;而在不确定的情况下,它会紧凑地表示多种可能的后续步骤而不会增加序列长度。在一系列具有挑战性的数学推理基准测试中,无论是在Pass@1到Pass@1024之间,多重思考机制都持续优于强大的离散CoT和RL基线模型,并且产生的序列更短。 代码及检查点可以在以下链接获取:[此URL](请将"[此URL]"替换为实际提供的URL)。
https://arxiv.org/abs/2601.08808
Tracklet quality is often treated as an afterthought in most person re-identification (ReID) methods, with the majority of research presenting architectural modifications to foundational models. Such approaches neglect an important limitation, posing challenges when deploying ReID systems in real-world, difficult scenarios. In this paper, we introduce S3-CLIP, a video super-resolution-based CLIP-ReID framework developed for the VReID-XFD challenge at WACV 2026. The proposed method integrates recent advances in super-resolution networks with task-driven super-resolution pipelines, adapting them to the video-based person re-identification setting. To the best of our knowledge, this work represents the first systematic investigation of video super-resolution as a means of enhancing tracklet quality for person ReID, particularly under challenging cross-view conditions. Experimental results demonstrate performance competitive with the baseline, achieving 37.52% mAP in aerial-to-ground and 29.16% mAP in ground-to-aerial scenarios. In the ground-to-aerial setting, S3-CLIP achieves substantial gains in ranking accuracy, improving Rank-1, Rank-5, and Rank-10 performance by 11.24%, 13.48%, and 17.98%, respectively.
在大多数人物重新识别(ReID)方法中,轨迹质量通常被当作次要问题处理,而多数研究集中在对基础模型进行架构上的改进。这种做法忽视了一个重要的局限性,在实际部署ReID系统时会遇到挑战,尤其是在复杂且困难的场景下。在这篇论文中,我们介绍了S3-CLIP框架,这是一个基于视频超分辨率(Video Super-Resolution)和CLIP-ReID技术相结合的方法,并用于WACV 2026年举办的VReID-XFD竞赛。该方法将最近在超分辨率网络中的进展与任务驱动的超分辨率流程集成起来,适应于基于视频的人物重新识别场景。 据我们所知,这项工作是首次系统性地研究视频超分辨率技术如何提升轨迹质量以增强人物ReID性能的研究,特别是在跨视角(cross-view)条件下。实验结果表明,我们的方法在性能上与基准线相当,并且在从空中到地面和从地面到空中的场景中分别达到了37.52%的mAP和29.16%的mAP。尤其是在地面到空中设置的情况下,S3-CLIP框架显著提高了排名准确度,在Rank-1、Rank-5和Rank-10性能方面分别提升了11.24%,13.48%,以及17.98%。
https://arxiv.org/abs/2601.08807
We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. Unlike existing evaluations that focus on narrow, well-defined tasks, APEX-SWE assesses two novel task types that reflect real-world software engineering work: (1) Integration tasks (n=100), which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and (2) Observability tasks (n=100), which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context. We evaluated eight frontier models on APEX-SWE. Gemini 3 Pro (Thinking = High) performs best, with a Pass@1 score of 25\%. Our analysis shows that strong performance is primarily driven by epistemic reasoning, defined as the ability to distinguish between assumptions and verified facts, combined with agency to resolve uncertainty prior to acting. We open-source the APEX-SWE evaluation harness and a dev set (n=50).
我们介绍了AI生产力指数(软件工程版)(APEX-SWE),这是一个用于评估前沿AI模型能否执行具有经济价值的软件工程项目的标准。与现有的专注于狭窄、定义明确的任务评估不同,APEX-SWE评估了两种新型任务类型,这些任务反映了实际的软件工程工作: 1. 集成任务(n=100),需要跨异构云原语、业务应用和基础设施即代码服务构建端到端系统。 2. 可观察性任务(n=100),要求使用日志和仪表板等遥测信号以及非结构化上下文来调试生产故障。 我们对八种前沿模型进行了APEX-SWE评估。Gemini 3 Pro(思考能力=高)表现最佳,其Pass@1得分为25%。我们的分析表明,强大的性能主要由认识论推理驱动,即区分假设和验证事实的能力,以及在行动之前通过代理解决不确定性问题的意愿。我们开源了APEX-SWE评估工具包和一个开发集(n=50)。
https://arxiv.org/abs/2601.08806
Accurate individual identification is essential for monitoring rare amphibians, yet invasive marking is often unsuitable for critically endangered species. We evaluate state-of-the-art computer-vision methods for photographic re-identification of the Hula painted frog (Latonia nigriventer) using 1,233 ventral images from 191 individuals collected during 2013-2020 capture-recapture surveys. We compare deep local-feature matching in a zero-shot setting with deep global-feature embedding models. The local-feature pipeline achieves 98% top-1 closed-set identification accuracy, outperforming all global-feature models; fine-tuning improves the best global-feature model to 60% top-1 (91% top-10) but remains below local matching. To combine scalability with accuracy, we implement a two-stage workflow in which a fine-tuned global-feature model retrieves a short candidate list that is re-ranked by local-feature matching, reducing end-to-end runtime from 6.5-7.8 hours to ~38 minutes while maintaining ~96% top-1 closed-set accuracy on the labeled dataset. Separation of match scores between same- and different-individual pairs supports thresholding for open-set identification, enabling practical handling of novel individuals. We deploy this pipeline as a web application for routine field use, providing rapid, standardized, non-invasive identification to support conservation monitoring and capture-recapture analyses. Overall, in this species, zero-shot deep local-feature matching outperformed global-feature embedding and provides a strong default for photo-identification.
准确的个体识别对于监测稀有两栖动物至关重要,然而,侵入性标记方法通常不适用于极度濒危物种。我们评估了最先进的计算机视觉方法,用于2013年至2020年间在捕捉和再捕捉调查中收集到的191只霍拉彩蛙(Latonia nigriventer)的1,233张腹部图像的摄影重新识别。我们在零样本设置下比较了深度局部特征匹配与深度全局特征嵌入模型的效果。局部特征管道达到了98%的第一名封闭集识别准确率,优于所有全局特征模型;微调将最佳全局特征模型改进至60%的第一名(91%的前十名),但仍低于局部匹配的表现。为了结合可扩展性和准确性,我们实施了一个两阶段工作流程,在该流程中,一个微调后的全局特征模型检索出一个简短的候选名单,然后通过局部特征匹配重新排序,将端到端运行时间从6.5至7.8小时减少到了大约38分钟,同时保持了标记数据集上的约96%的第一名封闭集准确率。同源个体与不同个体配对之间的匹配得分差异支持阈值设置以进行开放集识别,从而能够处理新出现的个体。我们将此流程部署为网络应用程序供日常现场使用,提供快速、标准化且非侵入性的识别方式来支持保护监测和捕捉再释放分析。总体而言,在该物种中,零样本深度局部特征匹配优于全局特征嵌入,并提供了照片识别的一个强大的默认选项。
https://arxiv.org/abs/2601.08798
Diagnosing dental diseases from radiographs is time-consuming and challenging due to the subtle nature of diagnostic evidence. Existing methods, which rely on object detection models designed for natural images with more distinct target patterns, struggle to detect dental diseases that present with far less visual support. To address this challenge, we propose {\bf DentalX}, a novel context-aware dental disease detection approach that leverages oral structure information to mitigate the visual ambiguity inherent in radiographs. Specifically, we introduce a structural context extraction module that learns an auxiliary task: semantic segmentation of dental anatomy. The module extracts meaningful structural context and integrates it into the primary disease detection task to enhance the detection of subtle dental diseases. Extensive experiments on a dedicated benchmark demonstrate that DentalX significantly outperforms prior methods in both tasks. This mutual benefit arises naturally during model optimization, as the correlation between the two tasks is effectively captured. Our code is available at this https URL.
从放射影像诊断牙科疾病是一项耗时且具有挑战性的任务,因为诊断证据往往非常细微。现有的方法依赖于为自然图像设计的对象检测模型,这些模型在目标模式更为明显的场景下表现良好,但在面对牙科疾病这样视觉支持较少的情况下则难以有效检测出病症。为了应对这一挑战,我们提出了{\bf DentalX},这是一种新型的基于上下文感知的牙科疾病检测方法,利用口腔结构信息来减轻放射影像中固有的视觉模糊性。 具体而言,DentalX引入了一个结构上下文提取模块,该模块学习一个辅助任务:牙科解剖学的语义分割。这个模块可以提取有意义的结构上下文,并将其整合到主要的疾病检测任务中,从而增强对细微牙科疾病的检测能力。在专门为此设计的数据集上进行的大量实验表明,DentalX在这两项任务上的表现都显著优于以前的方法。这种互相增益的效果自然地出现在模型优化过程中,因为该方法能够有效捕捉两个任务之间的相关性。 我们的代码可在此网址获取:[此链接](请提供实际的有效链接)。
https://arxiv.org/abs/2601.08797
The rapid emergence of image synthesis models poses challenges to the generalization of AI-generated image detectors. However, existing methods often rely on model-specific features, leading to overfitting and poor generalization. In this paper, we introduce the Multi-Cue Aggregation Network (MCAN), a novel framework that integrates different yet complementary cues in a unified network. MCAN employs a mixture-of-encoders adapter to dynamically process these cues, enabling more adaptive and robust feature representation. Our cues include the input image itself, which represents the overall content, and high-frequency components that emphasize edge details. Additionally, we introduce a Chromatic Inconsistency (CI) cue, which normalizes intensity values and captures noise information introduced during the image acquisition process in real images, making these noise patterns more distinguishable from those in AI-generated content. Unlike prior methods, MCAN's novelty lies in its unified multi-cue aggregation framework, which integrates spatial, frequency-domain, and chromaticity-based information for enhanced representation learning. These cues are intrinsically more indicative of real images, enhancing cross-model generalization. Extensive experiments on the GenImage, Chameleon, and UniversalFakeDetect benchmark validate the state-of-the-art performance of MCAN. In the GenImage dataset, MCAN outperforms the best state-of-the-art method by up to 7.4% in average ACC across eight different image generators.
图像合成模型的迅速发展给AI生成图像检测器的一般化能力带来了挑战。然而,现有的方法往往依赖于特定模型的特征,导致过拟合和较差的泛化性能。在本文中,我们提出了一个多线索聚合网络(MCAN),这是一种新的框架,它在一个统一的网络中整合了不同的但互补的信息线索。MCAN采用混合编码器适配器来动态处理这些线索,从而使更适应和鲁棒的功能表现成为可能。我们的线索包括输入图像本身,这代表了整体内容,以及强调边缘细节的高频分量。此外,我们引入了一种色度不一致性(CI)线索,该线索通过标准化强度值并捕捉在真实图像获取过程中引入的噪声信息来使这些噪声模式与AI生成的内容中的噪声模式更加区别开来。 不同于先前的方法,MCAN的新颖性在于其统一的多线索聚合框架,这个框架整合了空间、频域和色度基的信息,以增强表示学习。这些线索本质上更能够指示真实图像,从而提高了跨模型的一般化能力。在GenImage、Chameleon以及UniversalFakeDetect基准测试中的广泛实验验证了MCAN处于业界领先地位的性能表现。在GenImage数据集中,MCAN相对于当前最佳的方法,在八个不同的图像生成器上的平均ACC上最高提升了7.4%。
https://arxiv.org/abs/2601.08790
Large language model-based agents are rapidly evolving from simple conversational assistants into autonomous systems capable of performing complex, professional-level tasks in various domains. While these advancements promise significant productivity gains, they also introduce critical safety risks that remain under-explored. Existing safety evaluations primarily focus on simple, daily assistance tasks, failing to capture the intricate decision-making processes and potential consequences of misaligned behaviors in professional settings. To address this gap, we introduce \textbf{SafePro}, a comprehensive benchmark designed to evaluate the safety alignment of AI agents performing professional activities. SafePro features a dataset of high-complexity tasks across diverse professional domains with safety risks, developed through a rigorous iterative creation and review process. Our evaluation of state-of-the-art AI models reveals significant safety vulnerabilities and uncovers new unsafe behaviors in professional contexts. We further show that these models exhibit both insufficient safety judgment and weak safety alignment when executing complex professional tasks. In addition, we investigate safety mitigation strategies for improving agent safety in these scenarios and observe encouraging improvements. Together, our findings highlight the urgent need for robust safety mechanisms tailored to the next generation of professional AI agents.
基于大型语言模型的代理正在迅速从简单的对话助手演变为能够执行各种领域中复杂、专业级任务的自主系统。虽然这些进展承诺了显著的生产力提升,但也引入了一些关键的安全风险,这些问题尚未得到充分探索。现有的安全评估主要集中在简单的日常辅助任务上,无法捕捉到在专业环境中行为不一致时复杂的决策过程和潜在后果。为解决这一缺口,我们介绍了**SafePro**,这是一个全面的基准测试工具,旨在评估执行专业活动的人工智能代理的安全性一致性。SafePro 包含了一个通过严格的迭代创建和审查流程开发的数据集,涵盖了具有安全风险的各种复杂专业领域的任务。对最先进的AI模型进行评估后发现存在显著的安全漏洞,并揭示了专业环境中新的不安全行为。此外,我们还表明这些模型在执行复杂的专业任务时表现出安全判断不足以及安全一致性弱的问题。进一步地,我们探讨了改善此类场景中代理安全性的策略,并观察到一些令人鼓舞的改进措施。综上所述,我们的研究结果突显了为下一代专业AI代理制定稳健的安全机制的紧迫需求。
https://arxiv.org/abs/2601.06663
As large language models (LLMs) become deeply embedded in digital platforms and decision-making systems, concerns about their political biases have grown. While substantial work has examined social biases such as gender and race, systematic studies of political bias remain limited, despite their direct societal impact. This paper introduces a general methodology for constructing political bias benchmarks by aligning model-generated voting predictions with verified parliamentary voting records. We instantiate this methodology in three national case studies: PoliBiasNL (2,701 Dutch parliamentary motions and votes from 15 political parties), PoliBiasNO (10,584 motions and votes from 9 Norwegian parties), and PoliBiasES (2,480 motions and votes from 10 Spanish parties). Across these benchmarks, we assess ideological tendencies and political entity bias in LLM behavior. As part of our evaluation framework, we also propose a method to visualize the ideology of LLMs and political parties in a shared two-dimensional CHES (Chapel Hill Expert Survey) space by linking their voting-based positions to the CHES dimensions, enabling direct and interpretable comparisons between models and real-world political actors. Our experiments reveal fine-grained ideological distinctions: state-of-the-art LLMs consistently display left-leaning or centrist tendencies, alongside clear negative biases toward right-conservative parties. These findings highlight the value of transparent, cross-national evaluation grounded in real parliamentary behavior for understanding and auditing political bias in modern LLMs.
随着大型语言模型(LLM)在数字平台和决策系统中的广泛应用,人们对它们的政治偏见越来越担忧。尽管已有大量研究关注性别、种族等社会偏见问题,但关于政治偏见的系统性研究仍十分有限,而这直接关系到其对社会的影响。本文提出了一种通过对比由模型生成的投票预测与验证过的议会投票记录来构建政治偏见基准的方法,并在三个国家案例中具体化了该方法:PoliBiasNL(荷兰15个政党的2,701项议会动议和投票),PoliBiasNO(挪威9个政党的10,584项动议和投票)以及PoliBiasES(西班牙10个政党的2,480项动议和投票)。在这些基准测试中,我们评估了LLM行为中的意识形态倾向和政治实体偏见。作为评价框架的一部分,我们还提出了一种方法,在共享的二维CHES(Chapel Hill专家调查)空间内可视化LLM与政治党派的思想观念,通过将其基于投票立场的位置链接到CHES维度上,从而可以直接且可解释地比较模型和现实中的政治行为者。 我们的实验揭示了细微的政治意识形态差异:最先进的LLM在表现左倾或中立倾向的同时,还明显表现出对右翼保守党的负面偏见。这些发现突显了基于实际议会行为的透明、跨国评估对于理解和审查现代LLM中的政治偏见的价值。
https://arxiv.org/abs/2601.08785
Researchers have proposed numerous text-to-SQL techniques to streamline data analytics and accelerate the development of database-driven applications. To compare these techniques and select the best one for deployment, the community depends on public benchmarks and their leaderboards. Since these benchmarks heavily rely on human annotations during question construction and answer evaluation, the validity of the annotations is crucial. In this paper, we conduct an empirical study that (i) benchmarks annotation error rates for two widely used text-to-SQL benchmarks, BIRD and Spider 2.0-Snow, and (ii) corrects a subset of the BIRD development (Dev) set to measure the impact of annotation errors on text-to-SQL agent performance and leaderboard rankings. Through expert analysis, we show that BIRD Mini-Dev and Spider 2.0-Snow have error rates of 52.8% and 62.8%, respectively. We re-evaluate all 16 open-source agents from the BIRD leaderboard on both the original and the corrected BIRD Dev subsets. We show that performance changes range from -7% to 31% (in relative terms) and rank changes range from $-9$ to $+9$ positions. We further assess whether these impacts generalize to the full BIRD Dev set. We find that the rankings of agents on the uncorrected subset correlate strongly with those on the full Dev set (Spearman's $r_s$=0.85, $p$=3.26e-5), whereas they correlate weakly with those on the corrected subset (Spearman's $r_s$=0.32, $p$=0.23). These findings show that annotation errors can significantly distort reported performance and rankings, potentially misguiding research directions or deployment choices. Our code and data are available at this https URL.
研究人员提出了多种文本到SQL的技术,以简化数据分析并加速数据库驱动的应用程序开发。为了比较这些技术并在部署时选择最佳方案,社区依赖于公共基准测试及其排行榜。由于这些基准测试在构建问题和评估答案期间严重依赖人工注释,因此注释的有效性至关重要。在这篇论文中,我们进行了一项实证研究,该研究(i)为两个广泛使用的文本到SQL基准——BIRD 和 Spider 2.0-Snow 的注释错误率进行了基准测试;以及(ii) 对 BIRD 开发集的一部分进行了更正,以评估这些注释错误对文本到SQL代理性能和排行榜排名的影响。通过专家分析,我们展示了BIRD Mini-Dev 和 Spider 2.0-Snow 的错误率分别为52.8% 和62.8%。我们在原始的和修正后的 BIRD 开发集子集上重新评估了来自 BIRD 排行榜上的所有16个开源代理。结果显示,性能变化范围从相对的-7%到31%,而排名的变化则在9名之间波动(负九位到正九位)。我们进一步评估这些影响是否普遍适用于完整的BIRD开发集。我们发现,在未更正子集中的代理排名与完整开发集中的排名密切相关(Spearman's $r_s$=0.85,$p$=3.26e-5),而修正后的子集中则相关性较弱 (Spearman's $r_s$=0.32, $p$=0.23)。这些发现表明注释错误可能显著扭曲报告的性能和排名,可能会误导研究方向或部署选择。我们的代码和数据可在以下网址获取:[此链接](请将实际URL替换为具体链接)。
https://arxiv.org/abs/2601.08778
Aligning large language models (LLMs) to serve users with heterogeneous and potentially conflicting preferences is a central challenge for personalized and trustworthy AI. We formalize an ideal notion of universal alignment through test-time scaling: for each prompt, the model produces $k\ge 1$ candidate responses and a user selects their preferred one. We introduce $(k,f(k))$-robust alignment, which requires the $k$-output model to have win rate $f(k)$ against any other single-output model, and asymptotic universal alignment (U-alignment), which requires $f(k)\to 1$ as $k\to\infty$. Our main result characterizes the optimal convergence rate: there exists a family of single-output policies whose $k$-sample product policies achieve U-alignment at rate $f(k)=\frac{k}{k+1}$, and no method can achieve a faster rate in general. We show that popular post-training methods, including Nash learning from human feedback (NLHF), can fundamentally underutilize the benefits of test-time scaling. Even though NLHF is optimal for $k=1$, sampling from the resulting (often deterministic) policy cannot guarantee win rates above $\tfrac{1}{2}$ except for an arbitrarily small slack. This stems from a lack of output diversity: existing alignment methods can collapse to a single majority-preferred response, making additional samples redundant. In contrast, our approach preserves output diversity and achieves the optimal test-time scaling rate. In particular, we propose a family of symmetric multi-player alignment games and prove that any symmetric Nash equilibrium policy of the $(k+1)$-player alignment game achieves the optimal $(k,\frac{k}{k+1})$-robust alignment. Finally, we provide theoretical convergence guarantees for self-play learning dynamics in these games and extend the framework to opponents that also generate multiple responses.
将大型语言模型(LLMs)与用户异质且可能冲突的偏好对齐,以提供个性化和值得信赖的人工智能服务是一个核心挑战。我们通过测试时间缩放的形式化了统一对齐的理想概念:对于每个提示,模型生成$k \geq 1$个候选响应,并让用户选择他们最喜欢的那一个。我们引入了$(k, f(k))$-稳健对齐,这要求具有$k$输出的模型必须以$f(k)$的概率胜过任何其他单一输出模型,并且提出了渐进统一对齐(U-alignment),它需要当$k \rightarrow \infty$时,$f(k) \rightarrow 1$。我们的主要结果是刻画了最优收敛速率:存在一系列单输出策略,它们的$k$样本乘积政策在$f(k)=\frac{k}{k+1}$的速率下实现了U-alignment,并且没有一般方法可以达到更快的速度。 我们证明了流行的事后训练方法(包括来自人类反馈的纳什学习(NLHF)),从根本上无法充分利用测试时间缩放带来的好处。尽管对于$k=1$时,NLHF是最优的,但从这种通常确定性的策略中采样不能保证超过$\frac{1}{2}$胜率,除非存在任意小的余量。这是因为输出多样性不足:现有的对齐方法可能会退化为单一多数偏好响应,使额外样本变得多余。相比之下,我们的方法保持了输出多样性,并实现了最优测试时间缩放速率。 具体而言,我们提出了一组对称多玩家对齐游戏家族,并证明了任何$(k+1)$-玩家对齐游戏中对称纳什均衡策略都能实现最优化的$(k, \frac{k}{k+1})$-稳健对齐。最后,我们为这些游戏中的自我学习动态提供了理论收敛保证,并将框架扩展到了生成多个响应的对手。
https://arxiv.org/abs/2601.08777
Histopathology analysis relies on Hematoxylin and Eosin (H&E) staining, but fluorescence microscopy offers complementary information. Converting fluorescence images to H&E-like appearance can aid interpretation and integration with standard workflows. We present a Cycle-Consistent Adversarial Network (CycleGAN) approach for unpaired image-to-image translation from multi-channel fluorescence microscopy to pseudo H&E stained histopathology images. The method combines C01 and C02 fluorescence channels into RGB and learns a bidirectional mapping between fluorescence and H&E domains without paired training data. The architecture uses ResNet-based generators with residual blocks and PatchGAN discriminators, trained with adversarial, cycle-consistency, and identity losses. Experiments on fluorescence microscopy datasets show the model generates realistic pseudo H&E images that preserve morphological structures while adopting H&E-like color characteristics. This enables visualization of fluorescence data in a format familiar to pathologists and supports integration with existing H&E-based analysis pipelines.
组织病理学分析依赖于苏木精和伊红(H&E)染色,但荧光显微镜提供了互补的信息。将荧光图像转换为类似于H&E的外观有助于解释并整合到标准工作流程中。我们提出了一种基于无配对图像到图像翻译的Cycle-Consistent Adversarial Network (CycleGAN) 方法,用于从多通道荧光显微镜图像生成伪H&E染色组织病理学图像。该方法将C01和C02荧光通道结合为RGB,并学习了在没有配对训练数据的情况下,在荧光与H&E领域之间的双向映射。其架构使用基于ResNet的生成器,包含残差块以及PatchGAN鉴别器,并通过对抗性、循环一致性及身份损失进行训练。 实验结果表明,在多通道荧光显微镜数据集上应用此模型可生成逼真的伪H&E图像,这些图像是在保留形态结构的同时采用了类似于H&E的颜色特性。这使得以病理学家熟悉的格式可视化荧光数据,并支持与现有的基于H&E的分析管道集成。
https://arxiv.org/abs/2601.08776
Retrieval-Augmented Generation for software engineering often relies on vector similarity search, which captures topical similarity but can fail on multi-hop architectural reasoning such as controller to service to repository chains, interface-driven wiring, and inheritance. This paper benchmarks three retrieval pipelines on Java codebases (Shopizer, with additional runs on ThingsBoard and OpenMRS Core): (A) vector-only No-Graph RAG, (B) an LLM-generated knowledge graph RAG (LLM-KB), and (C) a deterministic AST-derived knowledge graph RAG (DKB) built with Tree-sitter and bidirectional traversal. Using 15 architecture and code-tracing queries per repository, we measure indexing time, query latency, corpus coverage, cost, and answer correctness. DKB builds its graph in seconds, while LLM-KB requires much longer graph generation. LLM-KB also shows indexing incompleteness: on Shopizer, 377 files are skipped or missed, reducing embedded chunk coverage and graph size compared to DKB. End-to-end cost is modest for DKB relative to the vector-only baseline but much higher for LLM-KB, especially as repository scale increases. Query latency is similar for No-Graph and DKB, while LLM-KB is slower and more variable. On the Shopizer question suite, DKB achieves the highest correctness, LLM-KB is close behind, and the vector-only baseline performs worst on upstream architectural queries and has the highest hallucination risk. Overall, deterministic AST-derived graphs provide more reliable coverage and multi-hop grounding than LLM-extracted graphs at substantially lower indexing cost.
基于检索的增强生成(Retrieval-Augmented Generation,RAG)在软件工程中通常依赖于向量相似性搜索,这种方法能够捕捉主题相似性,但在多跳架构推理方面可能会遇到困难,比如从控制器到服务再到仓库链、接口驱动的连接以及继承等。本文对三个检索管道在Java代码库(Shopizer,另外还包括ThingsBoard和OpenMRS Core)上进行了基准测试:(A) 仅向量的无图RAG,(B) 基于大型语言模型生成的知识图谱RAG (LLM-KB),以及(C) 使用Tree-sitter构建并采用双向遍历的确定性抽象语法树派生知识图谱RAG (DKB)。我们使用每个仓库15个架构和代码跟踪查询,测量索引时间、查询延迟、语料库覆盖率、成本和答案正确性。 对于DKB来说,其图结构可以在几秒钟内构建完成,而LLM-KB则需要更长时间来生成图谱。此外,在Shopizer上,LLM-KB还显示出索引不完整的问题:有377个文件被跳过或错过,这使得嵌入块覆盖率和图大小相比于DKB有所减少。从整体成本来看,相对于仅向量的基线而言,DKB的成本较低,但随着仓库规模增大,LLM-KB的成本明显更高。在查询延迟方面,无图和DKB表现相似,而LLM-KB则更慢且变化更大。 在Shopizer问题集上,DKB取得了最高的正确率,LLM-KB紧随其后,仅向量基线模型在上游架构查询中表现最差,并且具有最高的幻觉风险。总体而言,在显著较低的索引成本下,基于确定性抽象语法树的知识图谱提供了一种更加可靠的覆盖范围和多跳定位能力,优于大型语言模型提取的图结构。 综上所述,使用AST生成的图结构在处理复杂架构推理问题时更具优势,尤其是在确保查询准确性和减少幻觉风险方面。
https://arxiv.org/abs/2601.08773
Generative AI systems are predominantly designed, evaluated, and marketed as intelligent systems which will benefit society by augmenting or automating human cognitive labor, promising to increase personal, corporate, and macroeconomic productivity. But this mainstream narrative about what AI is and what it can do is in tension with another emerging use case: entertainment. We argue that the field of AI is unprepared to measure or respond to how the proliferation of entertaining AI-generated content will impact society. Emerging data suggest AI is already widely adopted for entertainment purposes -- especially by young people -- and represents a large potential source of revenue. We contend that entertainment will become a primary business model for major AI corporations seeking returns on massive infrastructure investments; this will exert a powerful influence on the technology these companies produce in the coming years. Examining current evaluation practices, we identify a critical asymmetry: while AI assessments rigorously measure both benefits and harms of intelligence, they focus almost exclusively on cultural harms. We lack frameworks for articulating how cultural outputs might be actively beneficial. Drawing on insights from the humanities, we propose "thick entertainment" as a framework for evaluating AI-generated cultural content -- one that considers entertainment's role in meaning-making, identity formation, and social connection rather than simply minimizing harm. While AI is often touted for its potential to revolutionize productivity, in the long run we may find that AI turns out to be as much about "intelligence" as social media is about social connection.
生成式人工智能系统主要被设计、评估和推广为能够通过增强或自动化人类认知劳动来造福社会的智能系统,承诺提升个人、企业乃至宏观经济的生产力。然而,这种主流叙述与另一种新兴用途——娱乐之间存在紧张关系。我们认为,AI领域尚未准备好衡量或应对大量娱乐性AI生成内容对社会的影响。目前的数据表明,AI已经广泛用于娱乐目的,尤其是在年轻人中,并代表了一个巨大的潜在收入来源。我们主张娱乐将成为寻求从大规模基础设施投资中获得回报的主要AI公司的主要商业模式;这将在未来几年对这些公司所生产的技术产生强大的影响。 审视当前的评估实践,我们发现一个关键不对称:虽然人工智能评估系统严格地衡量了智能带来的好处和危害,但几乎完全集中在文化伤害上。我们需要框架来阐述如何使文化产出具有积极的影响。借鉴人文科学领域的见解,我们将提出“厚娱乐”作为一种评估AI生成的文化内容的方法——这种方法考虑到了娱乐在意义创造、身份形成和社会连接中的作用,而不仅仅是最小化危害。 虽然人工智能经常因其潜在的生产力革命性被广泛宣传,但长远来看,我们可能会发现,人工智能和社交媒体一样,在很大程度上关乎“社会联系”,而不只是“智能”。
https://arxiv.org/abs/2601.08768