Climate disinformation has become a major challenge in today digital world, especially with the rise of misleading images and videos shared widely on social media. These false claims are often convincing and difficult to detect, which can delay actions on climate change. While vision-language models (VLMs) have been used to identify visual disinformation, they rely only on the knowledge available at the time of training. This limits their ability to reason about recent events or updates. The main goal of this paper is to overcome that limitation by combining VLMs with external knowledge. By retrieving up-to-date information such as reverse image results, online fact-checks, and trusted expert content, the system can better assess whether an image and its claim are accurate, misleading, false, or unverifiable. This approach improves the model ability to handle real-world climate disinformation and supports efforts to protect public understanding of science in a rapidly changing information landscape.
在当今的数字世界中,气候不实信息已成为一个主要挑战,尤其是随着误导性的图片和视频在社交媒体上广泛传播。这些虚假声明往往极具说服力且难以察觉,这可能会延迟应对气候变化的行动。尽管视觉-语言模型(VLM)已被用于识别视觉上的不实信息,但它们仅依赖于训练时已有的知识。这种限制使得它们无法有效推理最近发生的事件或更新的情况。本文的主要目标是通过将VLM与外部知识相结合来克服这一局限性。通过检索最新的信息,如反向图像搜索结果、在线事实核查和可信的专家内容,系统能够更好地评估图片及其声明是否准确、具有误导性、虚假或无法验证。这种方法提高了模型处理现实世界中气候不实信息的能力,并支持在快速变化的信息环境中保护公众对科学的理解的努力。
https://arxiv.org/abs/2601.16108
The rise of live streaming has transformed online interaction, enabling massive real-time engagement but also exposing platforms to complex risks such as scams and coordinated malicious behaviors. Detecting these risks is challenging because harmful actions often accumulate gradually and recur across seemingly unrelated streams. To address this, we propose CS-VAR (Cross-Session Evidence-Aware Retrieval-Augmented Detector) for live streaming risk assessment. In CS-VAR, a lightweight, domain-specific model performs fast session-level risk inference, guided during training by a Large Language Model (LLM) that reasons over retrieved cross-session behavioral evidence and transfers its local-to-global insights to the small model. This design enables the small model to recognize recurring patterns across streams, perform structured risk assessment, and maintain efficiency for real-time deployment. Extensive offline experiments on large-scale industrial datasets, combined with online validation, demonstrate the state-of-the-art performance of CS-VAR. Furthermore, CS-VAR provides interpretable, localized signals that effectively empower real-world moderation for live streaming.
直播流媒体的兴起已经改变了在线互动方式,它不仅促进了大规模的实时参与,也使平台面临诸如诈骗和协同恶意行为等复杂风险。由于有害行为常常逐渐积累并在看似无关的直播间中反复出现,因此检测这些风险颇具挑战性。为解决这一问题,我们提出了CS-VAR(跨会话证据感知检索增强检测器)用于直播流媒体的风险评估。 在CS-VAR架构中,一个轻量级、特定领域的模型执行快速的会话级别风险推断,并且在训练过程中由大型语言模型(LLM)指导。该大型语言模型通过检索跨会话的行为证据进行推理,并将其从局部到全局的理解传递给小型模型。这种设计使小型模型能够识别不同直播间中的重复模式,执行结构化的风险评估,并保持实时部署的效率。 我们通过对大规模工业数据集进行了广泛的离线实验,并结合在线验证,展示了CS-VAR在性能上的领先水平。此外,CS-VAR还提供了可解释且本地化的信号,有效地支持了实际直播流媒体内容管理中的监管工作。
https://arxiv.org/abs/2601.16027
Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.
目的:准确的三维手部姿态估计支持手术应用,如技能评估、机器人辅助干预和几何感知工作流程分析。然而,手术环境带来了严重的挑战,包括强烈的局部照明、频繁的手被仪器或工作人员遮挡以及由于手套导致的手部外观一致化,并且缺乏可靠的模型训练所需的数据集。 方法:我们提出了一种稳健的多视角流水线,用于在手术环境中进行三维手部姿态估计,该流水线无需特定领域的微调,仅依赖现成的预训练模型。这个流程包括可靠的人体检测、全身姿势估计和基于跟踪的手部裁剪区域内的最先进的二维关键点预测,并通过受约束的三维优化来完成整个过程。此外,我们引入了一个新颖的手术基准数据集,该数据集包含超过68,000帧及3,000个手动注释的二维手部姿态,这些数据是在一个模拟手术室中记录下来的,在不同的场景复杂度下都有三角测量的三维真实值。 结果:定量实验表明,我们的方法在性能上始终优于基准模型,实现了2D平均关节误差降低31%,以及3D平均每关节位置误差减少76%的成绩。 结论:我们提出的工作为手术环境中的三维手部姿态估计建立了坚实的基础,提供了一个无需训练的流水线和一个全面注释的数据集,以促进未来在手术计算机视觉领域的研究。
https://arxiv.org/abs/2601.15918
In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and this http URL results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a "Latency Wall" exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.
在虚拟现实(VR)和人机交互(HCI)领域,实时情绪识别对帮助自闭症谱系障碍(ASD)患者提高社交技能具有潜力。这一任务需要严格处理延迟与精度之间的权衡问题,即运动到光子(MTP)的延迟应保持在140毫秒以下以维持连续性。然而,大多数现成的深度学习模型更注重准确性而非消费品硬件严格的定时约束条件。作为迈向可访问VR疗法的第一步,我们使用UIBVFED数据集对虚拟角色进行零样本面部表情识别(FER)任务,来基准测试最新的状态-of-the-art (SOTA) 模型。我们评估了YOLO的Medium和Nano变体(v8, v11 和 v12),用于脸部检测,并包括通用视觉转换器如CLIP、SigLIP以及另一个未明确指明的模型。 仅使用CPU进行推断的结果显示,尽管在风格化的头像上实现面部检测是稳健的(准确率为100%),但在分类阶段存在一个“延迟墙”。YOLOv11n架构提供了最佳平衡以用于检测(约为54毫秒)。然而,通用视觉转换器如CLIP和SigLIP未能达到实时循环中的可接受精度(小于23%)或速度(大于150毫秒)。这项研究强调了为了实现可访问的、实时的人工智能在治疗环境中的应用,需要开发轻量级且特定领域的架构。
https://arxiv.org/abs/2601.15914
Contrast medium plays a pivotal role in radiological imaging, as it amplifies lesion conspicuity and improves detection for the diagnosis of tumor-related diseases. However, depending on the patient's health condition or the medical resources available, the use of contrast medium is not always feasible. Recent work has explored AI-based image translation to synthesize contrast-enhanced images directly from non-contrast scans, aims to reduce side effects and streamlines clinical workflows. Progress in this direction has been constrained by data limitations: (1) existing public datasets focus almost exclusively on brain-related paired MR modalities; (2) other collections include partially paired data but suffer from missing modalities/timestamps and imperfect spatial alignment; (3) explicit labeling of CT vs. CTC or DCE phases is often absent; (4) substantial resources remain private. To bridge this gap, we introduce the first public, fully paired, pan-cancer medical imaging dataset spanning 11 human organs. The MR data include complete dynamic contrast-enhanced (DCE) sequences covering all three phases (DCE1-DCE3), while the CT data provide paired non-contrast and contrast-enhanced acquisitions (CTC). The dataset is curated for anatomical correspondence, enabling rigorous evaluation of 1-to-1, N-to-1, and N-to-N translation settings (e.g., predicting DCE phases from non-contrast inputs). Built upon this resource, we establish a comprehensive benchmark. We report results from representative baselines of contemporary image-to-image translation. We release the dataset and benchmark to catalyze research on safe, effective contrast synthesis, with direct relevance to multi-organ oncology imaging workflows. Our code and dataset are publicly available at this https URL.
对比剂在放射影像学中扮演着关键角色,因为它能增强病灶的显影度并提高肿瘤相关疾病的检测效果。然而,由于患者的健康状况或可用医疗资源的不同,使用对比剂并不总是可行的。最近的研究探索了基于人工智能的图像转换技术,旨在从非对比扫描直接合成对比增强图像,以减少副作用并简化临床工作流程。不过,这一方向的发展受到了数据限制的制约:(1)现有的公开数据集几乎只关注脑部相关成对MR模式;(2)其他数据集中包含部分配对的数据但存在模态缺失或时间戳丢失的问题,并且空间对齐不完美;(3)CT与CTC或DCE阶段之间的显式标注常常缺失;(4)大量资源仍为私有。 为了弥补这些不足,我们推出了首个公开的、完全成对的跨癌症医学影像数据集,涵盖了11个人体器官。该MR数据包括完整的动态对比增强(DCE)序列,覆盖了三个阶段(DCE1-DCE3),而CT数据则提供了非对比与对比增强获取之间的配对(CTC)。此数据集经过精心整理以确保解剖对应关系,从而能够严谨评估一对一、多对一和多对多的转换场景(例如,从非对比输入预测DCE阶段)。基于这一资源,我们建立了一个全面基准。我们报告了当代图像到图像翻译代表性基线的结果,并公开发布数据集与基准以激发关于安全有效的对比合成的研究,这直接关联到多器官肿瘤影像工作流程。 我们的代码和数据集可以在以下网址获得:[此链接处应填入实际的公开访问URL]。
https://arxiv.org/abs/2601.15884
This paper introduces a novel approach to securing machine learning model deployments against potential distribution shifts in practical applications, the Total Variation Out-of-Distribution (TV-OOD) detection method. Existing methods have produced satisfactory results, but TV-OOD improves upon these by leveraging the Total Variation Network Estimator to calculate each input's contribution to the overall total variation. By defining this as the total variation score, TV-OOD discriminates between in- and out-of-distribution data. The method's efficacy was tested across a range of models and datasets, consistently yielding results in image classification tasks that were either comparable or superior to those achieved by leading-edge out-of-distribution detection techniques across all evaluation metrics.
本文介绍了一种新颖的方法,用于在实际应用中保护机器学习模型部署免受潜在的数据分布变化的影响,这种方法称为总变异离分布(TV-OOD)检测方法。尽管现有方法已经取得了令人满意的结果,但TV-OOD通过利用总变差网络估计器来计算每个输入对整体总变差的贡献,从而在效果上超越了这些方法。将这一贡献定义为总变分分数后,TV-OOD能够区分内分布和外分布的数据。该方法的有效性已经在多种模型和数据集上进行了测试,并且在图像分类任务的所有评估指标中,其性能要么与最先进的离分布检测技术相当,要么更优。
https://arxiv.org/abs/2601.15867
The rapid spread of multimodal fake news poses a serious societal threat, as its evolving nature and reliance on timely factual details challenge existing detection methods. Dynamic Retrieval-Augmented Generation provides a promising solution by triggering keyword-based retrieval and incorporating external knowledge, thus enabling both efficient and accurate evidence selection. However, it still faces challenges in addressing issues such as redundant retrieval, coarse similarity, and irrelevant evidence when applied to deceptive content. In this paper, we propose ExDR, an Explanation-driven Dynamic Retrieval-Augmented Generation framework for Multimodal Fake News Detection. Our framework systematically leverages model-generated explanations in both the retrieval triggering and evidence retrieval modules. It assesses triggering confidence from three complementary dimensions, constructs entity-aware indices by fusing deceptive entities, and retrieves contrastive evidence based on deception-specific features to challenge the initial claim and enhance the final prediction. Experiments on two benchmark datasets, AMG and MR2, demonstrate that ExDR consistently outperforms previous methods in retrieval triggering accuracy, retrieval quality, and overall detection performance, highlighting its effectiveness and generalization capability.
多模态虚假新闻的快速传播构成了严重的社会威胁,因为其不断演变的性质和对及时事实细节的依赖挑战了现有的检测方法。动态检索增强生成提供了一种有希望的解决方案,通过触发基于关键词的检索并融合外部知识来实现高效的证据选择和准确度。然而,在应用于欺骗性内容时,它仍然面临诸如冗余检索、粗略相似性和无关证据等挑战。 在本文中,我们提出了ExDR,这是一种针对多模态虚假新闻检测的解释驱动动态检索增强生成框架。我们的框架系统地利用模型生成的解释来触发检索和证据检索模块,并从三个互补维度评估触发信心;通过融合欺骗实体构建基于实体感知的索引;并根据特定于欺骗性的特征检索对比证据,以挑战初始主张并提升最终预测。 在两个基准数据集AMG和MR2上的实验表明,ExDR在检索触发准确性、检索质量和整体检测性能方面始终优于先前的方法,突显了其有效性和泛化能力。
https://arxiv.org/abs/2601.15820
Autonomous Unmanned Underwater Vehicles (UUVs) enable military and civilian covert operations in coastal areas without relying on support vessels or Global Navigation Satellite Systems (GNSS). Such operations are critical when surface access is not possible and stealthy navigation is required in restricted environments such as protected zones or dangerous areas under access ban. GNSS denied navigation is then essential to maintaining concealment as surfacing could expose UUVs to detection. To ensure a precise fleet positioning a constellation of beacons deployed by aerial or surface drones establish a synthetic landmark network that will guide the fleet of UUVs along an optimized path from the continental shelf to the goal on the shore. These beacons either submerged or floating emit acoustic signals for UUV localisation and navigation. A hierarchical planner generates an adaptive route for the drones executing primitive actions while continuously monitoring and replanning as needed to maintain trajectory accuracy.
自主无人水下航行器(UUVs)能够在沿海区域执行军事和民用秘密行动,无需依赖支援舰船或全球导航卫星系统(GNSS)。这种操作在表面无法进入并且需要隐蔽导航的受限环境中尤为重要,例如保护区域或禁止进入的危险地区。在这种情况下,没有GNSS信号的自主导航对于保持隐藏至关重要,因为浮出水面可能会使UUV暴露于探测风险中。 为了确保精确的舰队定位,通过无人机(空中或水面)部署的一系列信标建立了一个合成地标网络,该网络将引导UUV舰队从大陆架到岸边目标沿优化路径航行。这些信标无论是沉入水下的还是漂浮在水面上的,都会发出声波信号用于UUV的定位和导航。 一个分层规划器生成了一条适应性路线供无人机执行基本动作,并且会不断监控并根据需要重新规划路线以保持航迹精度。
https://arxiv.org/abs/2601.15802
High-dimensional malware datasets often exhibit feature redundancy, instability, and scalability limitations, which hinder the effectiveness and interpretability of machine learning-based malware detection systems. Although feature selection is commonly employed to mitigate these issues, many existing approaches lack robustness when applied to large-scale and heterogeneous malware data. To address this gap, this paper proposes CAFE-GB (Chunk-wise Aggregated Feature Estimation using Gradient Boosting), a scalable feature selection framework designed to produce stable and globally consistent feature rankings for high-dimensional malware detection. CAFE-GB partitions training data into overlapping chunks, estimates local feature importance using gradient boosting models, and aggregates these estimates to derive a robust global ranking. Feature budget selection is performed separately through a systematic k-selection and stability analysis to balance detection performance and robustness. The proposed framework is evaluated on two large-scale malware datasets: BODMAS and CIC-AndMal2020, representing large and diverse malware feature spaces. Experimental results show that classifiers trained on CAFE-GB -selected features achieve performance parity with full-feature baselines across multiple metrics, including Accuracy, F1-score, MCC, ROC-AUC, and PR-AUC, while reducing feature dimensionality by more than 95\%. Paired Wilcoxon signed-rank tests confirm that this reduction does not introduce statistically significant performance degradation. Additional analyses demonstrate low inter-feature redundancy and improved interpretability through SHAP-based explanations. Runtime and memory profiling further indicate reduced downstream classification overhead. Overall, CAFE-GB provides a stable, interpretable, and scalable feature selection strategy for large-scale malware detection.
高维恶意软件数据集常常表现出特征冗余、不稳定性以及可扩展性的限制,这些问题阻碍了基于机器学习的恶意软件检测系统的有效性和解释性。尽管常用特征选择方法来缓解这些问题,但许多现有的方法在处理大规模和异构恶意软件数据时缺乏鲁棒性。为了解决这一缺口,本文提出了CAFE-GB(使用梯度提升进行分块聚合特征估计),这是一种可扩展的特征选择框架,旨在生成高维恶意软件检测中稳定且全局一致的特征排名。CAFE-GB将训练数据划分为重叠的数据块,利用梯度增强模型估算局部特征重要性,并通过汇总这些估算值来得出稳健的全局排序。此外,通过系统性的k选和稳定性分析独立完成特征预算选择以平衡检测性能和鲁棒性。该框架在两个大规模恶意软件数据集BODMAS和CIC-AndMal2020上进行了评估,这两个数据集代表了大型且多样化的恶意软件特征空间。实验结果表明,在准确性、F1值、MCC(马修斯相关系数)、ROC-AUC及PR-AUC等多重指标下,基于CAFE-GB选择的特征训练出的分类器可达到与全特征基准相当的性能水平,并将特征维度降低超过95%。配对威尔科克森符号秩检验确认了这一减少并未引入统计显著性的性能退化。额外分析表明低互特征冗余和通过基于SHAP(Shapley Additive Explanations)的方法提升了可解释性。运行时间和内存配置文件进一步显示减少了下游分类的计算开销。总体而言,CAFE-GB为大规模恶意软件检测提供了一种稳定、可解释且可扩展的特征选择策略。
https://arxiv.org/abs/2601.15754
Fine-grained attribute prediction is essential for fashion retail applications including catalog enrichment, visual search, and recommendation systems. Vision-Language Models (VLMs) offer zero-shot prediction without task-specific training, yet their systematic evaluation on multi-attribute fashion tasks remains underexplored. A key challenge is that fashion attributes are often conditional. For example, "outer fabric" is undefined when no outer garment is visible. This requires models to detect attribute applicability before attempting classification. We introduce a three-tier evaluation framework that decomposes this challenge: (1) overall task performance across all classes (including NA class: suggesting attribute is not applicable) for all attributes, (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Using DeepFashion-MultiModal, which explicitly defines NA (meaning attribute doesn't exist or is not visible) within attribute label spaces, we benchmark nine VLMs spanning flagship (GPT-5, Gemini 2.5 Pro), efficient (GPT-5 Mini, Gemini 2.5 Flash), and ultra-efficient tiers (GPT-5 Nano, Gemini 2.5 Flash-Lite) against classifiers trained on pretrained Fashion-CLIP embeddings on 5,000 images across 18 attributes. Our findings reveal that: (1) zero-shot VLMs achieve 64.0% macro-F1, a threefold improvement over logistic regression on pretrained Fashion-CLIP embeddings; (2) VLMs excel at fine-grained classification (Tier 3: 70.8% F1) but struggle with applicability detection (Tier 2: 34.1% NA-F1), identifying a key bottleneck; (3) efficient models achieve over 90% of flagship performance at lower cost, offering practical deployment paths. This diagnostic framework enables practitioners to pinpoint whether errors stem from visibility detection or classification, guiding targeted improvements for production systems.
细粒度属性预测对于时尚零售应用(包括目录丰富、视觉搜索和推荐系统)至关重要。视觉-语言模型(VLMs)在不进行特定任务训练的情况下提供了零样本预测,然而这些模型在多属性时尚任务中的系统性评估尚处于探索阶段。一个关键挑战在于时尚属性往往是条件性的:例如,“外层面料”这一属性仅在外穿衣物可见时才具有定义。这就要求模型首先检测属性是否适用再进行分类。 我们引入了一个三级评价框架来分解这个难题: 1. 跨所有类别的整体任务表现(包括NA类别,表示该属性不适用)。 2. 属性适用性检测。 3. 在可确定的情况下进行细粒度分类。 利用DeepFashion-MultiModal数据集,该数据集在属性标签空间中明确定义了NA(表示该属性不存在或不可见),我们对九种VLMs进行了基准测试,这些模型涵盖了旗舰级(GPT-5, Gemini 2.5 Pro)、高效级(GPT-5 Mini, Gemini 2.5 Flash)和超级高效的级别(GPT-5 Nano, Gemini 2.5 Flash-Lite),并且对比了基于预训练Fashion-CLIP嵌入的分类器在跨18个属性、5000张图像上的表现。 我们的发现表明: 1. 零样本VLMs达到了64.0%的宏F1分数,相较于基于预训练Fashion-CLIP嵌入的逻辑回归模型,有三倍以上的改进。 2. VLMs在细粒度分类(第三级:70.8% F1)方面表现出色,但在适用性检测(第二级:34.1% NA-F1)上表现不佳,这揭示了一个关键瓶颈。 3. 高效模型在较低成本下实现了旗舰性能的90%以上,为实用部署提供了路径。 这一诊断框架使实践者能够确定错误是源于可见性检测还是分类,并指导生产系统进行针对性改进。
https://arxiv.org/abs/2601.15711
This work focuses on national-scale land-use/land-cover (LULC) semantic segmentation using ALOS-2 single-polarization (HH) SAR data over Japan, together with a companion binary water detection task. Building on SAR-W-MixMAE self-supervised pretraining [1], we address common SAR dense-prediction failure modes, boundary over-smoothing, missed thin/slender structures, and rare-class degradation under long-tailed labels, without increasing pipeline complexity. We introduce three lightweight refinements: (i) injecting high-resolution features into multi-scale decoding, (ii) a progressive refine-up head that alternates convolutional refinement and stepwise upsampling, and (iii) an $\alpha$-scale factor that tempers class reweighting within a focal+dice objective. The resulting model yields consistent improvements on the Japan-wide ALOS-2 LULC benchmark, particularly for under-represented classes, and improves water detection across standard evaluation metrics.
这项工作专注于使用ALOS-2单极化(HH)SAR数据在日本进行国家尺度的土地利用/土地覆盖(LULC)语义分割,同时包括一个伴生的二元水体检测任务。基于SAR-W-MixMAE自监督预训练[1],我们解决了常见的SAR密集预测失败模式,如边界过度平滑、遗漏细长结构以及在长尾标签下罕见类别的性能下降问题,而无需增加管道复杂度。我们引入了三项轻量级改进:(i) 将高分辨率特征注入多尺度解码中;(ii) 一种逐步细化和上采样的交替进行的渐进式细化头部,以及 (iii) $\alpha$-缩放因子,用于调节在焦点+dice目标下的类重新加权。最终模型在日本全境ALOS-2 LULC基准测试中取得了持续性的改进,特别是在代表性不足的类别中,并且在标准评估指标下提高了水体检测性能。
https://arxiv.org/abs/2601.15705
Active learning (AL) strategies aim to train high-performance models with minimal labeling efforts, only selecting the most informative instances for annotation. Current approaches to evaluating data informativeness predominantly focus on the data's distribution or intrinsic information content and do not directly correlate with downstream task performance, such as mean average precision (mAP) in object detection. Thus, we propose Performance-guided (i.e. mAP-guided) Reinforced Active Learning for Object Detection (MGRAL), a novel approach that leverages the concept of expected model output changes as informativeness. To address the combinatorial explosion challenge of batch sample selection and the non-differentiable correlation between model performance and selected batches, MGRAL skillfully employs a reinforcement learning-based sampling agent that optimizes selection using policy gradient with mAP improvement as reward. Moreover, to reduce the computational overhead of mAP estimation with unlabeled samples, MGRAL utilizes an unsupervised way with fast look-up tables, ensuring feasible deployment. We evaluate MGRAL's active learning performance on detection tasks over PASCAL VOC and COCO benchmarks. Our approach demonstrates the highest AL curve with convincing visualizations, establishing a new paradigm in reinforcement learning-driven active object detection.
主动学习(AL)策略旨在通过最小的标注努力训练高性能模型,仅选择最具信息量的数据实例进行注释。当前评估数据信息度的方法主要关注数据分布或内在的信息内容,并不直接与下游任务性能(如目标检测中的平均精度均值mAP)相关联。因此,我们提出了由表现引导(即以mAP为指导的)增强主动学习的目标检测方法(MGRAL),这是一种新颖的方法,它利用预期模型输出变化作为信息度的概念。为了应对批量样本选择中的组合爆炸挑战以及模型性能与所选批次之间非可微分的相关性问题,MGRAL巧妙地采用了基于强化学习的选择代理,通过使用以mAP改进为奖励的策略梯度来优化选择。 此外,为了减少未标注数据中mAP估计的计算开销,MGRAL采用了一种无监督的方法并利用快速查找表进行估算,确保了可部署性。我们在PASCAL VOC和COCO基准上的目标检测任务上评估了MGRAL的主动学习性能。我们的方法展示了最高的AL曲线,并通过可视化证明了其有效性,为强化学习驱动的目标检测设定了新的范例。
https://arxiv.org/abs/2601.15688
Realistic network traffic simulation is critical for evaluating intrusion detection systems, stress-testing network protocols, and constructing high-fidelity environments for cybersecurity training. While attack traffic can often be layered into training environments using red-teaming or replay methods, generating authentic benign background traffic remains a core challenge -- particularly in simulating the complex temporal and communication dynamics of real-world networks. This paper introduces TempoNet, a novel generative model that combines multi-task learning with multi-mark temporal point processes to jointly model inter-arrival times and all packet- and flow-header fields. TempoNet captures fine-grained timing patterns and higher-order correlations such as host-pair behavior and seasonal trends, addressing key limitations of GAN-, LLM-, and Bayesian-based methods that fail to reproduce structured temporal variation. TempoNet produces temporally consistent, high-fidelity traces, validated on real-world datasets. Furthermore, we show that intrusion detection models trained on TempoNet-generated background traffic perform comparably to those trained on real data, validating its utility for real-world security applications.
现实网络流量仿真对于评估入侵检测系统、压力测试网络协议以及构建高保真的网络安全培训环境至关重要。虽然通常可以通过红队演习或重播方法将攻击流量融入训练环境中,但生成真实的良性背景流量仍然是一个核心挑战——尤其是在模拟现实世界网络中复杂的时序和通信动态方面。本文介绍了TempoNet,这是一种结合多任务学习与多标记时间点过程的新型生成模型,旨在联合建模间隔时间和所有数据包及流头字段。 TempoNet能够捕捉到精细的时间模式以及更高阶的相关性,如主机对行为和季节趋势,并且解决了基于GAN(生成对抗网络)、LLM(语言生成模型)和贝叶斯方法在重现结构化时间变化方面所面临的关键限制。通过实际数据集验证,TempoNet可以产生时序一致、高保真的流量跟踪。 此外,我们还展示了使用TempoNet生成的背景流量训练入侵检测系统所获得的结果与基于真实数据训练的结果相当,从而证实了其在现实世界安全应用中的实用性和有效性。
https://arxiv.org/abs/2601.15663
Hallucinations in Large Language Models (LLMs) -- generations that are plausible but factually unfaithful -- remain a critical barrier to high-stakes deployment. Current detection methods typically rely on computationally expensive external retrieval loops or opaque black-box LLM judges requiring 70B+ parameters. In this work, we introduce [Model Name], a hybrid detection framework that combines neuroscience-inspired signal design with supervised machine learning. We extract interpretable signals grounded in Predictive Coding (quantifying surprise against internal priors) and the Information Bottleneck (measuring signal retention under perturbation). Through systematic ablation, we demonstrate three key enhancements: Entity-Focused Uptake (concentrating on high-value tokens), Context Adherence (measuring grounding strength), and Falsifiability Score (detecting confident but contradictory claims). Evaluating on HaluBench (n=200, perfectly balanced), our theory-guided baseline achieves 0.8017 AUROC. BASE supervised models reach 0.8274 AUROC, while IMPROVED features boost performance to 0.8669 AUROC (4.95% gain), demonstrating consistent improvements across architectures. This competitive performance is achieved while using 75x less training data than Lynx (200 vs 15,000 samples), 1000x faster inference (5ms vs 5s), and remaining fully interpretable. Crucially, we report a negative result: the Rationalization signal fails to distinguish hallucinations, suggesting that LLMs generate coherent reasoning for false premises ("Sycophancy"). This work demonstrates that domain knowledge encoded in signal architecture provides superior data efficiency compared to scaling LLM judges, achieving strong performance with lightweight (less than 1M parameter), explainable models suitable for production deployment.
大型语言模型(LLMs)中的幻觉问题——即生成看似合理但实际上不准确的信息——仍然是其在高风险场景中部署的主要障碍。目前的检测方法通常依赖于计算成本高昂的外部检索循环或需要70B+参数的黑盒LLM判别器。在这项工作中,我们引入了一个名为[模型名称]的混合检测框架,该框架结合了神经科学启发的信号设计与监督机器学习技术。 我们的方法从预测编码(量化对内部先验知识的惊讶程度)和信息瓶颈(测量扰动下的信号保留度)中提取可解释信号。通过系统的剔除实验,我们展示了三个关键改进:实体聚焦吸收(集中在高价值标记上),一致性上下文(衡量依据强度),以及验证性得分(检测自信但矛盾的说法)。在HaluBench数据集(n=200,完美平衡)上的评估中,基于理论指导的基线达到了0.8017 AUROC。基础监督模型实现了0.8274 AUROC,而改进特征将性能提升至0.8669 AUROC(增长4.95%),展示了跨架构的一致改善。 这种竞争性表现是在使用比Lynx少75倍的训练数据(200 vs 15,000样本)、推理速度快1000倍(5ms vs 5s)并且保持完全可解释的情况下实现的。尤为重要的是,我们发现理性化信号无法区分幻觉,表明LLM为错误前提生成连贯但误导性的理由。 这项工作展示了领域知识编码在信号架构中优于通过扩展LLM判别器来提高数据效率,并且使用轻量级(少于100万参数)的可解释模型实现了强大的性能,适合生产部署。
https://arxiv.org/abs/2601.15652
Most prior deepfake detection methods lack explainable outputs. With the growing interest in multimodal large language models (MLLMs), researchers have started exploring their use in interpretable deepfake detection. However, a major obstacle in applying MLLMs to this task is the scarcity of high-quality datasets with detailed forgery attribution annotations, as textual annotation is both costly and challenging - particularly for high-fidelity forged images or videos. Moreover, multiple studies have shown that reinforcement learning (RL) can substantially enhance performance in visual tasks, especially in improving cross-domain generalization. To facilitate the adoption of mainstream MLLM frameworks in deepfake detection with reduced annotation cost, and to investigate the potential of RL in this context, we propose an automated Chain-of-Thought (CoT) data generation framework based on Self-Blended Images, along with an RL-enhanced deepfake detection framework. Extensive experiments validate the effectiveness of our CoT data construction pipeline, tailored reward mechanism, and feedback-driven synthetic data generation approach. Our method achieves performance competitive with state-of-the-art (SOTA) approaches across multiple cross-dataset benchmarks. Implementation details are available at this https URL.
大多数先前的深度伪造检测方法缺乏可解释性的输出。随着多模态大型语言模型(MLLMs)兴趣的增长,研究人员已经开始探索它们在可解释性深度伪造检测中的应用。然而,在将MLLMs应用于此任务时的主要障碍是高质量数据集的稀缺性,这些数据集中详细记录了伪造属性标注,因为文本标注既昂贵又具有挑战性——特别是对于高保真的伪造图像或视频而言更是如此。此外,多项研究表明,强化学习(RL)可以在视觉任务中显著提升性能,尤其是在改善跨域泛化方面。 为了促进主流MLLM框架在减少标注成本的情况下应用于深度伪造检测,并探讨RL在此领域的潜力,我们提出了一种基于自混合图像的自动化Chain-of-Thought (CoT) 数据生成框架以及一种增强型RL深度伪造检测框架。广泛的实验验证了我们的CoT数据构建流程的有效性、定制化的奖励机制以及反馈驱动的合成数据生成方法。 我们的方法在多个跨数据集基准测试中达到了与最先进(SOTA)方法相当的表现水平。实现细节可以在该链接[此处应提供具体网址]获取。
https://arxiv.org/abs/2601.15624
The rapid growth of live-streaming platforms such as Twitch has introduced complex challenges in moderating toxic behavior. Traditional moderation approaches, such as human annotation and keyword-based filtering, have demonstrated utility, but human moderators on Twitch constantly struggle to scale effectively in the fast-paced, high-volume, and context-rich chat environment of the platform while also facing harassment themselves. Recent advances in large language models (LLMs), such as DeepSeek-R1-Distill and Llama-3-8B-Instruct, offer new opportunities for toxicity detection, especially in understanding nuanced, multimodal communication involving emotes. In this work, we present an exploratory comparison of toxicity detection approaches tailored to Twitch. Our analysis reveals that incorporating emotes improves the detection of toxic behavior. To this end, we introduce ToxiTwitch, a hybrid model that combines LLM-generated embeddings of text and emotes with traditional machine learning classifiers, including Random Forest and SVM. In our case study, the proposed hybrid approach reaches up to 80 percent accuracy under channel-specific training (with 13 percent improvement over BERT and F1-score of 76 percent). This work is an exploratory study intended to surface challenges and limits of emote-aware toxicity detection on Twitch.
直播平台如Twitch的迅速增长带来了管理有毒行为的复杂挑战。传统的管理模式,比如人工标注和基于关键词过滤的方法,在实践中展示了一定的效果,但Twitch的人类管理员却难以在快速、高流量且包含丰富语境的聊天环境中有效扩展工作范围,并且他们还面临着自身遭受骚扰的问题。近期大型语言模型(LLM)的发展,例如DeepSeek-R1-Distill和Llama-3-8B-Instruct,为毒性检测提供了新的机会,尤其是在理解涉及表情符号的微妙多模态沟通方面。在这项工作中,我们展示了对适用于Twitch平台的毒性检测方法进行探索性比较的研究。我们的分析显示,在检测有毒行为时,将表情符号纳入考量能够提升识别效果。为此,我们推出了ToxiTwitch模型,这是一种混合模型,结合了大型语言模型生成的文字和表情符号嵌入与传统机器学习分类器(包括随机森林和支持向量机)的应用。在案例研究中,所提出的混合方法在特定频道训练下可达到高达80%的准确率(相对于BERT,提高了13%,F1分数为76%)。这项工作是一个旨在揭示Twitch平台上具有表情符号意识的毒性检测挑战和限制的探索性研究。
https://arxiv.org/abs/2601.15605
Continual Test-Time Adaptation (CTTA) seeks to update a pretrained model during deployment using only the incoming, unlabeled data stream. Although prior approaches such as Tent, EATA etc. provide meaningful improvements under short evolving shifts, they struggle when the test distribution changes rapidly or over extremely long horizons. This challenge is exemplified by the CCC benchmark, where models operate over streams of 7.5M samples with continually changing corruption types and severities. We propose RDumb++, a principled extension of RDumb that introduces two drift-detection mechanisms i.e entropy-based drift scoring and KL-divergence drift scoring, together with adaptive reset strategies. These mechanisms allow the model to detect when accumulated adaptation becomes harmful and to recover before prediction collapse occurs. Across CCC-medium with three speeds and three seeds (nine runs, each containing one million samples), RDumb++ consistently surpasses RDumb, yielding approx 3% absolute accuracy gains while maintaining stable adaptation throughout the entire stream. Ablation experiments on drift thresholds and reset strengths further show that drift-aware resetting is essential for preventing collapse and achieving reliable long-horizon CTTA.
持续测试时间自适应(CTTA)旨在通过仅使用流入的未标记数据流来更新预训练模型。尽管先前的方法,如Tent、EATA等,在面对短暂的变化时提供了有意义的改进,但当测试分布快速变化或在非常长的时间范围内变化时,这些方法表现不佳。这一挑战在CCC基准中得到了体现,该基准要求模型处理包含750万样本的数据流,并且数据流中的干扰类型和严重程度不断变化。 我们提出了RDumb++,这是RDumb的一个原则性扩展,引入了两种漂移检测机制:基于熵的漂移评分和KL散度漂移评分,以及自适应重置策略。这些机制使模型能够检测到累积适应开始有害的时间,并在预测崩溃发生之前进行恢复。在CCC中等难度条件下,通过三种速度和三个种子进行了九次运行(每次包含一百万个样本),RDumb++始终优于RDumb,在整个数据流过程中保持稳定的适应性的同时,平均准确率提高了约3%。 进一步的消融实验表明,漂移感知重置对于防止崩溃并实现可靠的长周期CTTA至关重要。
https://arxiv.org/abs/2601.15544
Early detection of malignant skin lesions is critical for improving patient outcomes in aggressive, metastatic skin cancers. This study evaluates a comprehensive system for preliminary skin lesion assessment that combines the clinically established ABCD rule of dermoscopy (analyzing Asymmetry, Borders, Color, and Dermoscopic Structures) with machine learning classification. Using a 1,000-image subset of the HAM10000 dataset, the system implements an automated, rule-based pipeline to compute a Total Dermoscopy Score (TDS) for each lesion. This handcrafted approach is compared against various machine learning solutions, including traditional classifiers (Logistic Regression, Random Forest, and SVM) and deep learning models. While the rule-based system provides high clinical interpretability, results indicate a performance bottleneck when reducing complex morphology to five numerical features. Experimental findings show that transfer learning with EfficientNet-B0 failed significantly due to domain shift between natural and medical images. In contrast, a custom three-layer Convolutional Neural Network (CNN) trained from scratch achieved 78.5% accuracy and 86.5% recall on median-filtered images, representing a 19-point accuracy improvement over traditional methods. The results demonstrate that direct pixel-level learning captures diagnostic patterns beyond handcrafted features and that purpose-built lightweight architectures can outperform large pretrained models for small, domain-specific medical datasets.
早期发现恶性皮肤病变对于改善侵袭性和转移性皮肤癌患者的预后至关重要。本研究评估了一种综合系统,用于初步皮肤病变的筛查,该系统结合了临床公认的皮肤病学ABC德摩斯科检查规则(分析不对称性、边界、颜色和皮肤病结构)与机器学习分类方法。利用HAM10000数据集的一个包含1,000张图像的子集,该系统实施了一个自动化的基于规则的工作流程来计算每个病变的总皮肤镜评分(TDS)。这种方法被对比了各种机器学习解决方案,包括传统的分类器(逻辑回归、随机森林和支持向量机)和深度学习模型。尽管基于规则的方法提供了高临床可解释性,但结果表明在将复杂形态简化为五个数值特征时存在性能瓶颈。 实验发现表明,在自然图像和医学图像之间进行领域转换使得使用EfficientNet-B0的迁移学习失败。相比之下,一个定制的三层卷积神经网络(CNN),从零开始训练,在中值滤波后的图像上达到了78.5%的准确率和86.5%的召回率,比传统方法在准确性方面提高了19个百分点。这些结果表明直接基于像素的学习能够捕捉到超越手工制作特征的诊断模式,并且为小规模、领域特定的医学数据集量身定制的轻量级架构可以胜过大型预训练模型。 总的来说,这项研究强调了结合临床专业知识和机器学习技术在提高皮肤病变检测准确性方面的重要性。
https://arxiv.org/abs/2601.15539
Hallucination in large language models (LLMs) remains an acute concern, contributing to the spread of misinformation and diminished public trust, particularly in high-risk domains. Among hallucination types, factuality is crucial, as it concerns a model's alignment with established world knowledge. Adversarial factuality, defined as the deliberate insertion of misinformation into prompts with varying levels of expressed confidence, tests a model's ability to detect and resist confidently framed falsehoods. Existing work lacks high-quality, domain-specific resources for assessing model robustness under such adversarial conditions, and no prior research has examined the impact of injected misinformation on long-form text factuality. To address this gap, we introduce AdversaRiskQA, the first verified and reliable benchmark systematically evaluating adversarial factuality across Health, Finance, and Law. The benchmark includes two difficulty levels to test LLMs' defensive capabilities across varying knowledge depths. We propose two automated methods for evaluating the adversarial attack success and long-form factuality. We evaluate six open- and closed-source LLMs from the Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates. Long-form factuality is assessed on Qwen3 (30B) under both baseline and adversarial conditions. Results show that after excluding meaningless responses, Qwen3 (80B) achieves the highest average accuracy, while GPT-5 maintains consistently high accuracy. Performance scales non-linearly with model size, varies by domains, and gaps between difficulty levels narrow as models grow. Long-form evaluation reveals no significant correlation between injected misinformation and the model's factual output. AdversaRiskQA provides a valuable benchmark for pinpointing LLM weaknesses and developing more reliable models for high-stakes applications.
大型语言模型(LLM)中的幻觉仍然是一个严重的问题,这导致了错误信息的传播和公众信任度下降,尤其是在高风险领域。在各种类型的幻觉中,事实性尤为重要,因为它关系到模型与已确立的世界知识的一致性。敌对事实性是指有意地将错误信息以不同自信程度插入提示中,测试模型检测并抵制自信表述的谎言的能力。现有的研究缺乏高质量、特定领域的资源来评估模型在这种对抗条件下的稳健性,并且没有先前的研究探讨了注入虚假信息对长篇文本事实性的影响。为了填补这一空白,我们推出了AdversaRiskQA,这是第一个经过验证和可靠的基准测试,系统地评估在健康、金融和法律领域中的敌对事实性。该基准包括两个难度级别来测试LLM的防御能力,并跨越不同的知识深度。我们提出了两种自动化方法来评估对抗攻击的成功率以及长篇文本的事实性。我们在Qwen、GPT-OSS和GPT家族中评估了六种开源和闭源的大语言模型,测量错误信息检测率。在基线和对抗条件下,对Qwen3(30B)进行长篇事实性的评估。结果显示,在排除无意义的响应后,Qwen3(80B)实现了最高的平均准确率,而GPT-5保持了持续高的准确度。性能非线性地随着模型大小变化,并且在不同领域表现各异;难度级别之间的差距随模型规模增大而缩小。长篇评估显示,在注入错误信息和模型的实际输出之间没有显著的相关性。AdversaRiskQA为定位LLM的弱点以及开发适用于高风险应用的更可靠模型提供了宝贵的基准测试。
https://arxiv.org/abs/2601.15511
The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbf{Causal Detection} (identifying if a text contains a causal link) and 2) \textbf{Causal Extraction} (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57\% ($C_{detect}$), while the best for extraction, Qwen2.5-Coder-32B-Instruct, reached just 47.12\% ($C_{extract}$). Models performed best on simple, explicit, single-sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter-annotator agreement ($\kappa \ge 0.758$), and make all our data, code, and prompts publicly available to spur further research. \href{this https URL}{Code available here: this https URL}
在高风险领域如生物医学中安全部署大规模语言模型(LLMs),需要这些模型能够理解和推理因果关系。我们通过测试13种开源的大规模语言模型在一项基本任务上的能力来研究这一问题:从文本中进行成对的因果发现(PCD)。我们的基准测试使用了12个多样化的数据集,评估两种核心技能: 1. **因果检测**(识别文本中是否包含因果联系) 2. **因果提取**(提取具体的因果短语) 我们测试了几种不同的提示方法,从简单的指令(零样本学习)到更复杂的策略如链式思维(CoT)和少样本上下文学习(FICL)。结果显示目前的模型在这些任务上存在重大缺陷。最佳的检测模型DeepSeek-R1-Distill-Llama-70B仅达到了平均49.57% ($C_{detect}$) 的得分,而最佳提取模型Qwen2.5-Coder-32B-Instruct也只有47.12% ($C_{extract}$)。这些模型在处理简单的、明确的单句关系时表现最好,但在面对更复杂的(且现实的)情况如隐含关系、跨越多句话的关系以及包含多个因果对的文本时,性能急剧下降。 我们提供了一个统一的评估框架,该框架建立在一个具有高注释者间一致性($\kappa \ge 0.758$)的数据集上,并将所有数据、代码和提示公开发布以促进进一步的研究。[此处可访问相关代码](this https URL)
https://arxiv.org/abs/2601.15479