Camouflaged Object Detection (COD), the task of identifying objects concealed within their environments, has seen rapid growth due to its wide range of practical applications. A key step toward developing trustworthy COD systems is the estimation and effective utilization of uncertainty. In this work, we propose a human-machine collaboration framework for classifying the presence of camouflaged objects, leveraging the complementary strengths of computer vision (CV) models and noninvasive brain-computer interfaces (BCIs). Our approach introduces a multiview backbone to estimate uncertainty in CV model predictions, utilizes this uncertainty during training to improve efficiency, and defers low-confidence cases to human evaluation via RSVP-based BCIs during testing for more reliable decision-making. We evaluated the framework in the CAMO dataset, achieving state-of-the-art results with an average improvement of 4.56\% in balanced accuracy (BA) and 3.66\% in the F1 score compared to existing methods. For the best-performing participants, the improvements reached 7.6\% in BA and 6.66\% in the F1 score. Analysis of the training process revealed a strong correlation between our confidence measures and precision, while an ablation study confirmed the effectiveness of the proposed training policy and the human-machine collaboration strategy. In general, this work reduces human cognitive load, improves system reliability, and provides a strong foundation for advancements in real-world COD applications and human-computer interaction. Our code and data are available at: this https URL.
伪装物体检测(COD)是指识别隐藏在环境中的物体的任务,由于其广泛的实际应用而迅速发展。开发可信的COD系统的一个关键步骤是估计和有效利用不确定性。在这项工作中,我们提出了一种结合计算机视觉(CV)模型与非侵入性脑机接口(BCI)优势的人机协作框架,用于识别伪装物体的存在情况。我们的方法引入了一个多视角骨干网络来估计CV模型预测中的不确定性,并在训练过程中使用这种不确定性以提高效率,在测试阶段通过基于RSVP的BCIs将低置信度案例转交给人工评估,从而做出更可靠的决策。 我们在CAMO数据集上对该框架进行了评估,取得了最先进的结果,平衡精度(BA)和F1分数分别平均提高了4.56%和3.66%,优于现有方法。对于表现最佳的参与者,改进程度达到了7.6%的BA和6.66%的F1分数。训练过程分析显示,我们的置信度测量与精确度之间存在强烈的相关性,而消融研究确认了所提出的训练策略和人机协作策略的有效性。 总的来说,这项工作减少了人类的认知负担,提高了系统的可靠性,并为现实世界中的COD应用以及人机交互的进步奠定了坚实的基础。我们的代码和数据可在以下网址获得:this https URL。
https://arxiv.org/abs/2502.08373
Propaganda is a form of persuasion that has been used throughout history with the intention goal of influencing people's opinions through rhetorical and psychological persuasion techniques for determined ends. Although Arabic ranked as the fourth most- used language on the internet, resources for propaganda detection in languages other than English, especially Arabic, remain extremely limited. To address this gap, the first Arabic dataset for Multi-label Propaganda, Sentiment, and Emotion (MultiProSE) has been introduced. MultiProSE is an open-source extension of the existing Arabic propaganda dataset, ArPro, with the addition of sentiment and emotion annotations for each text. This dataset comprises 8,000 annotated news articles, which is the largest propaganda dataset to date. For each task, several baselines have been developed using large language models (LLMs), such as GPT-4o-mini, and pre-trained language models (PLMs), including three BERT-based models. The dataset, annotation guidelines, and source code are all publicly released to facilitate future research and development in Arabic language models and contribute to a deeper understanding of how various opinion dimensions interact in news media1.
以下是给定文本的中文翻译: 宣传是一种历史上长期使用的说服形式,其目的是通过修辞和心理劝说技巧来影响人们的观点以达到特定的目的。尽管阿拉伯语在互联网上排名第四常用语言,但在英语以外的语言(尤其是阿拉伯语)中用于检测宣传的资源仍然极为有限。为了填补这一空白,首次推出了针对阿拉伯语的多标签宣传、情感和情绪(MultiProSE)数据集。MultiProSE是现有阿拉伯语宣传数据集ArPro的一个开源扩展版本,并为每条文本添加了情感和情绪注释。该数据集包含8,000篇经过标注的新闻文章,这是迄今为止最大的宣传数据集。对于每个任务,开发人员使用大型语言模型(如GPT-4o-mini)和预训练语言模型(包括三种基于BERT的模型)建立了多个基准模型。该数据集、注释指南及源代码均公开发布,以促进阿拉伯语语言模型的未来研究和发展,并有助于深入了解新闻媒体中各种观点维度之间的相互作用。
https://arxiv.org/abs/2502.08319
Purpose: Surgical performance depends not only on surgeons' technical skills but also on team communication within and across the different professional groups present during the operation. Therefore, automatically identifying team communication in the OR is crucial for patient safety and advances in the development of computer-assisted surgical workflow analysis and intra-operative support systems. To take the first step, we propose a new task of detecting communication briefings involving all OR team members, i.e. the team Time-out and the StOP?-protocol, by localizing their start and end times in video recordings of surgical operations. Methods: We generate an OR dataset of real surgeries, called Team-OR, with more than one hundred hours of surgical videos captured by the multi-view camera system in the OR. The dataset contains temporal annotations of 33 Time-out and 22 StOP?-protocol activities in total. We then propose a novel group activity detection approach, where we encode both scene context and action features, and use an efficient neural network model to output the results. Results: The experimental results on the Team-OR dataset show that our approach outperforms existing state-of-the-art temporal action detection approaches. It also demonstrates the lack of research on group activities in the OR, proving the significance of our dataset. Conclusion: We investigate the Team Time-Out and the StOP?-protocol in the OR, by presenting the first OR dataset with temporal annotations of group activities protocols, and introducing a novel group activity detection approach that outperforms existing approaches. Code is available at this https URL .
目的:外科手术的表现不仅取决于外科医生的技术技能,还取决于手术过程中不同专业团队之间的沟通。因此,在手术室(OR)中自动识别团队间的交流对于患者安全以及计算机辅助手术工作流程分析和术中支持系统的发展至关重要。为了迈出第一步,我们提出了一项新的任务,即通过在手术视频记录中定位其开始和结束时间来检测涉及所有手术团队成员的沟通简报,包括团队 Time-out 和 StOP?-协议。 方法:我们生成了一个名为 Team-OR 的真实外科手术数据集,该数据集包含超过一百小时由手术室多视角摄像系统捕获的外科视频。数据集中包含了 33 次 Time-out 和 22 次 StOP? 协议活动的时间标注信息。然后,我们提出了一种新颖的群体活动检测方法,在这种方法中我们将场景背景和行为特征进行编码,并采用高效的神经网络模型输出结果。 结果:在 Team-OR 数据集上的实验结果显示,我们的方法优于现有的最先进的时间动作识别方法。这还表明了对 OR 中团队活动研究不足的事实,证明我们数据集的重要性。 结论:通过介绍首个带有群体活动协议的时间标注的手术室(OR)数据集,并提出一种超越现有方法的新颖群体活动检测方法,我们在 OR 中探讨了 Team Time-Out 和 StOP?-协议。代码可在 [https://this-url.com] 获取。
https://arxiv.org/abs/2502.08299
Hate speech detection is a crucial task, especially on social media, where harmful content can spread quickly. Implementing machine learning models to automatically identify and address hate speech is essential for mitigating its impact and preventing its proliferation. The first step in developing an effective hate speech detection model is to acquire a high-quality dataset for training. Labeled data is foundational for most natural language processing tasks, but categorizing hate speech is difficult due to the diverse and often subjective nature of hate speech, which can lead to varying interpretations and disagreements among annotators. This paper examines strategies for addressing annotator disagreement, an issue that has been largely overlooked. In particular, we evaluate different approaches to deal with annotator disagreement regarding hate speech classification in Turkish tweets, based on a fine-tuned BERT model. Our work highlights the importance of the problem and provides state-of-art benchmark results for detection and understanding of hate speech in online discourse.
仇恨言论检测是一项至关重要的任务,特别是在社交媒体上,有害内容可以迅速传播。实施机器学习模型以自动识别和应对仇恨言论对于减轻其影响并防止其扩散至关重要。开发有效的仇恨言论检测模型的第一步是获取高质量的训练数据集。标注数据对大多数自然语言处理任务来说都是基础性的,但将仇恨言论分类却相当困难,因为仇恨言论在多样性和主观性方面存在巨大差异,这可能导致注释者之间出现不同的解释和争议。本文探讨了应对注释者分歧策略的问题,这是长期以来被忽视的一个问题。具体而言,我们基于经过微调的BERT模型评估了处理土耳其推文中的仇恨言论分类时不同注释者之间的分歧的不同方法。我们的工作突出了这一问题的重要性,并为在线对话中仇恨言论的检测和理解提供了最先进的基准结果。
https://arxiv.org/abs/2502.08266
Automatic monitoring of tree plantations plays a crucial role in agriculture. Flawless monitoring of tree health helps farmers make informed decisions regarding their management by taking appropriate action. Use of drone images for automatic plantation monitoring can enhance the accuracy of the monitoring process, while still being affordable to small farmers in developing countries such as India. Small, low cost drones equipped with an RGB camera can capture high-resolution images of agricultural fields, allowing for detailed analysis of the well-being of the plantations. Existing methods of automated plantation monitoring are mostly based on satellite images, which are difficult to get for the farmers. We propose an automated system for plantation health monitoring using drone images, which are becoming easier to get for the farmers. We propose a dataset of images of trees with three categories: ``Good health", ``Stunted", and ``Dead". We annotate the dataset using CVAT annotation tool, for use in research purposes. We experiment with different well-known CNN models to observe their performance on the proposed dataset. The initial low accuracy levels show the complexity of the proposed dataset. Further, our study revealed that, depth-wise convolution operation embedded in a deep CNN model, can enhance the performance of the model on drone dataset. Further, we apply state-of-the-art object detection models to identify individual trees to better monitor them automatically.
自动监测树苗种植在农业中扮演着至关重要的角色。对树木健康的精确监控帮助农民根据实际情况采取适当措施,从而做出明智的管理决策。利用无人机图像进行自动化的农田监测可以提高监测过程的准确性,并且对于印度等发展中国家的小农户来说仍具有经济性。小型低成本无人机配备RGB摄像头能够捕捉到农业用地的高分辨率影像,使得树苗种植地的整体状况得以详细分析。 现有的自动化种植监控方法大多基于卫星图像,这些图像难以获取。我们提出了一种使用无人机图像进行自动化的植树健康监测系统,该系统对于农民来说越来越容易获得。为此,我们创建了一个包含三类标签(“良好”、“发育不良”和“死亡”)的树苗图像数据集,并利用CVAT注释工具对其进行标注,以供研究用途。 为了观察这些模型在所提数据集上的表现,我们在几个知名的CNN模型上进行了实验。初步低准确率显示了该数据集的复杂性。此外,我们的研究表明,在深度卷积神经网络模型中嵌入深度卷积操作可以提高模型在无人机图像数据集上的性能。最后,我们应用最新的目标检测模型来识别单个树木,从而更好地实现自动监控。 总的来说,通过利用无人机技术和先进的机器学习技术,我们可以为农业中的植树健康监测提供一种有效的解决方案,并且这些方案对小规模农民来说也是可负担得起的。
https://arxiv.org/abs/2502.08233
The growing demand for efficient semantic communication systems capable of managing diverse tasks and adapting to fluctuating channel conditions has driven the development of robust, resource-efficient frameworks. This article introduces a novel channel-adaptive and multi-task-aware semantic communication framework based on a masked auto-encoder architecture. Our framework optimizes the transmission of meaningful information by incorporating a multi-task-aware scoring mechanism that identifies and prioritizes semantically significant data across multiple concurrent tasks. A channel-aware extractor is employed to dynamically select relevant information in response to real-time channel conditions. By jointly optimizing semantic relevance and transmission efficiency, the framework ensures minimal performance degradation under resource constraints. Experimental results demonstrate the superior performance of our framework compared to conventional methods in tasks such as image reconstruction and object detection. These results underscore the framework's adaptability to heterogeneous channel environments and its scalability for multi-task applications, positioning it as a promising solution for next-generation semantic communication networks.
对高效语义通信系统的需求不断增长,这些系统能够处理多样化任务并适应变化的信道条件,从而推动了稳健且资源高效的框架的发展。本文介绍了一种基于掩码自编码器架构的新型信道自适应和多任务感知语义通信框架。我们的框架通过引入一个多任务感知评分机制来优化有意义信息的传输,该机制能够识别并在多个并发任务中优先处理语义重要数据。采用了一个信道感知提取器,以根据实时信道条件动态选择相关信息。通过同时优化语义相关性和传输效率,该框架确保在资源受限的情况下性能损失最小化。 实验结果表明,在图像重建和目标检测等任务中,我们的框架相比传统方法表现出优越的性能。这些结果强调了该框架适应异构信道环境的能力以及其多任务应用中的可扩展性,使其成为下一代语义通信网络的一个有前途的解决方案。
https://arxiv.org/abs/2502.08221
Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We evaluate our method on two popular large data sets and achieve significant performance over the state-of-the-art this http URL, our technique also provides memory and computational advantages over the competitive techniques.
深度伪造视频因其日益逼真的特性而引发了社区的广泛关注。因此,自动化检测伪造深度伪造视频的技术也吸引了研究人员的浓厚兴趣。目前用于检测伪造视频的方法主要依赖于全局帧特征,并且未能充分利用被操纵视频中的时空不一致性。此外,它们也无法关注到在空间和时间维度上特定操作所导致的细微且精确定位的模式变化。 为了填补这些空白,我们提出了一种基于神经网络的深度伪造探测器,它专注于分析伪造视频中每个帧级别及帧序列级别的局部操纵特征。该模型使用ResNet作为基础架构,并通过引入空间注意力机制加强了浅层帧级特征的学习能力。此外,模型的空间流还通过融合纹理增强后的浅层特征与深层特征来进一步提升其性能。 与此同时,我们的模型采用了一种距离注意力机制处理帧序列,并允许在更深层次中将时间注意力图与学习到的特征进行融合。整个模型被训练为分类器以检测伪造内容。我们在两个流行的大型数据集上评估了我们提出的方法,并取得了优于现有技术的重要表现提升。此外,相比于竞争性方法,我们的技术还提供了内存和计算上的优势。
https://arxiv.org/abs/2502.08216
Collaborative perception, fusing information from multiple agents, can extend perception range so as to improve perception performance. However, temporal asynchrony in real-world environments, caused by communication delays, clock misalignment, or sampling configuration differences, can lead to information mismatches. If this is not well handled, then the collaborative performance is patchy, and what's worse safety accidents may occur. To tackle this challenge, we propose CoDynTrust, an uncertainty-encoded asynchronous fusion perception framework that is robust to the information mismatches caused by temporal asynchrony. CoDynTrust generates dynamic feature trust modulus (DFTM) for each region of interest by modeling aleatoric and epistemic uncertainty as well as selectively suppressing or retaining single-vehicle features, thereby mitigating information mismatches. We then design a multi-scale fusion module to handle multi-scale feature maps processed by DFTM. Compared to existing works that also consider asynchronous collaborative perception, CoDynTrust combats various low-quality information in temporally asynchronous scenarios and allows uncertainty to be propagated to downstream tasks such as planning and control. Experimental results demonstrate that CoDynTrust significantly reduces performance degradation caused by temporal asynchrony across multiple datasets, achieving state-of-the-art detection performance even with temporal asynchrony. The code is available at this https URL.
合作感知通过融合来自多个代理的信息,可以扩展感知范围并提高感知性能。然而,在现实环境中由于通信延迟、时钟偏差或采样配置差异导致的时间异步性会导致信息不匹配。如果不能妥善处理这种问题,则协作表现会不稳定,并且更糟糕的是可能会发生安全事故。为了应对这一挑战,我们提出了一种不确定性编码的异步融合感知框架CoDynTrust,该框架能够抵御由时间异步性引起的信息不匹配。 CoDynTrust通过建模固有不确定性和知识不确定性,并选择性地抑制或保留单个车辆特征来为每个感兴趣区域生成动态特征信任模量(DFTM),从而减轻信息不匹配问题。随后,我们设计了一个多尺度融合模块以处理由DFTM处理的多尺度特征图。 与现有同时考虑异步协作感知的工作相比,CoDynTrust能够应对各种低质量的信息,并允许不确定性传递到下游任务如规划和控制中去。实验结果表明,无论在多个数据集上,CoDynTrust都能显著减少时间异步性所导致的表现退化问题,在存在时间异步性的条件下也能达到最先进的检测性能。代码可在以下链接获取:[此URL]。
https://arxiv.org/abs/2502.08169
In the field of synthetic aperture radar (SAR) remote sensing image interpretation, although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified, and the first multi-task dialogue benchmark in the SAR field has been successfully established. The project will be released at this https URL, aiming to promote the in-depth development and wide application of SAR visual language models.
在合成孔径雷达(SAR)遥感图像解释领域,尽管视觉语言模型(VLMs)在自然语言处理和图像理解方面取得了显著进展,但由于缺乏专业领域的知识经验,它们的应用仍受到限制。本文创新性地提出了首个大规模多模态对话数据集——SARChat-2M,该数据集包含约200万对高质量的图像文本配对,并涵盖了多种详细标注目标的情景。此数据集不仅支持包括视觉理解和物体检测在内的多项关键任务,还具有独特的创新点:本研究开发了一个针对SAR领域的视觉语言数据集和基准,这使得可以评估VLMs在解释SAR图像方面的能力,并为构建跨各种遥感垂直领域多模态数据集提供了一种典范框架。通过对16个主流VLM模型的实验验证了该数据集的有效性,并成功建立了首个SAR领域的多任务对话基准。该项目将在[此链接](https://this https URL)发布,旨在促进SAR视觉语言模型的深度发展和广泛应用。
https://arxiv.org/abs/2502.08168
Recent advances in large language models (LLMs) have shown promising improvements, often surpassing existing methods across a wide range of downstream tasks in natural language processing. However, these models still face challenges, which may hinder their practical applicability. For example, the phenomenon of hallucination is known to compromise the reliability of LLMs, especially in fields that demand high factual precision. Current benchmarks primarily focus on hallucination detection and factuality evaluation but do not extend beyond identification. This paper proposes an explanation enhanced hallucination-detection model, coined as HuDEx, aimed at enhancing the reliability of LLM-generated responses by both detecting hallucinations and providing detailed explanations. The proposed model provides a novel approach to integrate detection with explanations, and enable both users and the LLM itself to understand and reduce errors. Our measurement results demonstrate that the proposed model surpasses larger LLMs, such as Llama3 70B and GPT-4, in hallucination detection accuracy, while maintaining reliable explanations. Furthermore, the proposed model performs well in both zero-shot and other test environments, showcasing its adaptability across diverse benchmark datasets. The proposed approach further enhances the hallucination detection research by introducing a novel approach to integrating interpretability with hallucination detection, which further enhances the performance and reliability of evaluating hallucinations in language models.
最近在大型语言模型(LLMs)方面取得的进展显示出显著的进步,在自然语言处理的各种下游任务中,这些模型往往超越了现有的方法。然而,这些模型仍然面临着挑战,这可能阻碍它们的实际应用效果。例如,幻觉现象会导致大型语言模型的可靠性下降,特别是在需要高度事实准确性的领域尤为突出。目前的基准测试主要集中在检测幻觉和评估准确性上,但并未超出识别范围。本文提出了一种新的解释增强型幻觉检测模型,称为HuDEx,旨在通过检测幻觉并提供详细的解释来提高大型语言模型生成响应的可靠性。所提出的模型为将检测与解释相结合提供了全新的方法,并使用户及模型本身都能理解和减少错误。我们的测量结果显示,所提出的模型在幻觉检测准确率上超越了更大的LLMs,如Llama3 70B和GPT-4,同时保持了可靠的解释能力。此外,在零样本和其他测试环境中,该模型表现出良好的性能,并展示了其跨多种基准数据集的适应性。通过将可解释性与幻觉检测相结合的新方法,所提出的方案进一步增强了幻觉检测的研究,从而提升了语言模型中评估幻觉的表现和可靠性。
https://arxiv.org/abs/2502.08109
We introduce \textbf{Knowledge Swapping}, a novel task designed to selectively regulate knowledge of a pretrained model by enabling the forgetting of user\-specified information, retaining essential knowledge, and acquiring new knowledge simultaneously. By delving into the analysis of knock-on feature hierarchy, we find that incremental learning typically progresses from low\-level representations to higher\-level semantics, whereas forgetting tends to occur in the opposite direction\-starting from high-level semantics and moving down to low-level features. Building upon this, we propose to benchmark the knowledge swapping task with the strategy of \textit{Learning Before Forgetting}. Comprehensive experiments on various tasks like image classification, object detection, and semantic segmentation validate the effectiveness of the proposed strategy. The source code is available at \href{this https URL}{this https URL}.
我们介绍了一种新的任务——**知识交换(Knowledge Swapping)**,该任务旨在通过允许遗忘用户指定的信息、保留核心知识以及同时获取新知识来选择性地调节预训练模型的知识。通过对级联特征层次的分析,我们发现增量学习通常从低层表示逐步进展到高层语义,而遗忘则往往相反,即从高层语义开始并向下延伸至低层特征。基于这一观察,我们提出了使用“先学后忘(Learning Before Forgetting)”策略来衡量知识交换任务的方法。在图像分类、目标检测和语义分割等多种任务上的全面实验验证了所提出策略的有效性。源代码可在[此处](this https URL)获取。
https://arxiv.org/abs/2502.08075
Multimodal Large Language Models (MLLMs) represent the cutting edge of AI technology, with DeepSeek models emerging as a leading open-source alternative offering competitive performance to closed-source systems. While these models demonstrate remarkable capabilities, their vision-language integration mechanisms introduce specific vulnerabilities. We implement an adapted embedding manipulation attack on DeepSeek Janus that induces targeted visual hallucinations through systematic optimization of image embeddings. Through extensive experimentation across COCO, DALL-E 3, and SVIT datasets, we achieve hallucination rates of up to 98.0% while maintaining high visual fidelity (SSIM > 0.88) of the manipulated images on open-ended questions. Our analysis demonstrates that both 1B and 7B variants of DeepSeek Janus are susceptible to these attacks, with closed-form evaluation showing consistently higher hallucination rates compared to open-ended questioning. We introduce a novel multi-prompt hallucination detection framework using LLaMA-3.1 8B Instruct for robust evaluation. The implications of these findings are particularly concerning given DeepSeek's open-source nature and widespread deployment potential. This research emphasizes the critical need for embedding-level security measures in MLLM deployment pipelines and contributes to the broader discussion of responsible AI implementation.
多模态大型语言模型(MLLM)代表了人工智能技术的前沿,DeepSeek 模型作为领先的开源替代方案,在性能上可与封闭源代码系统媲美。虽然这些模型展示了非凡的能力,但它们的视觉-语言集成机制引入了一些特定的安全漏洞。我们对 DeepSeek Janus 实施了一种适应性的嵌入操纵攻击,通过系统的图像嵌入优化来诱导有针对性的视觉幻觉。在 COCO、DALL-E 3 和 SVIT 数据集上进行广泛的实验后,在开放式问题中,我们实现了高达 98.0% 的幻觉率,同时保持了被篡改图像的高度视觉保真度(SSIM > 0.88)。我们的分析表明,DeepSeek Janus 的两种变体(1B 和 7B)都易受到此类攻击的影响,并且在封闭形式的评估中显示出了比开放式问题更高的幻觉率。我们引入了一个使用 LLaMA-3.1 8B Instruct 的新颖多提示幻觉检测框架,以进行稳健的评价。鉴于 DeepSeek 开源性质及其广泛的部署潜力,这些研究结果尤其令人担忧。这项研究表明,在 MLLM 部署流程中嵌入级别的安全措施至关重要,并为负责任的人工智能实施的更广泛讨论做出了贡献。
https://arxiv.org/abs/2502.07905
Detecting AI generated images is a challenging yet essential task. A primary difficulty arises from the detectors tendency to rely on spurious patterns, such as compression artifacts, which can influence its decisions. These issues often stem from specific patterns that the detector associates with the real data distribution, making it difficult to isolate the actual generative traces. We argue that an image should be classified as fake if and only if it contains artifacts introduced by the generative model. Based on this premise, we propose Stay Positive, an algorithm designed to constrain the detectors focus to generative artifacts while disregarding those associated with real data. Experimental results demonstrate that detectors trained with Stay Positive exhibit reduced susceptibility to spurious correlations, leading to improved generalization and robustness to post processing. Additionally, unlike detectors that associate artifacts with real images, those that focus purely on fake artifacts are better at detecting inpainted real images.
检测AI生成的图像是一项既具挑战性又十分重要的任务。主要困难在于,检测器往往依赖于一些虚假模式(如压缩伪影),这些模式会影响其判断结果。这些问题通常源于检测器将某些特定模式与真实数据分布关联起来,使得识别出真正的生成痕迹变得很困难。我们认为,一幅图像只有在包含由生成模型引入的特征时才应被归类为伪造品。基于这一前提,我们提出了Stay Positive算法,该算法旨在限制检测器仅关注生成模型产生的伪影,而忽略与真实数据相关的伪影。实验结果显示,使用Stay Positive训练的检测器具有减少对虚假相关性的敏感性,在处理后加工图像时表现出更好的泛化能力和鲁棒性。此外,与那些将特征关联于真图的检测器相比,专注于伪造品特征的检测器在识别修复后的真图方面表现更佳。
https://arxiv.org/abs/2502.07778
We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition. This dataset is publicly available here.
我们提出了一种新颖的数据集,名为WhoDunIt,用于评估大型语言模型(LLM)在叙事背景下的演绎推理能力。该数据集由开放领域的侦探小说和短篇故事构建而成,挑战LLM在阅读并理解故事后识别罪犯的能力。为了评估模型的稳健性,我们应用了一系列的角色名称增强技术,包括原始名称、名称交换以及用流行话语中广为人知的真实或虚构人物进行替换。此外,我们使用各种提示风格来研究提示对演绎推理准确性的影响。我们通过多次试验并采用多数投票选择响应的方式,对最先进的GPT-4o、GPT-4-turbo和GPT-4o-mini模型进行了评估研究以确保结果的可靠性。实验结果显示,尽管LLM在未经修改的文本上表现可靠,但在某些名称替换后,尤其是那些具有广泛认知度的情况下,准确性会下降。该数据集现已公开提供。
https://arxiv.org/abs/2502.07747
Prostate cancer is a leading health concern among men, requiring accurate and accessible methods for early detection and risk stratification. Prostate volume (PV) is a key parameter in multivariate risk stratification for early prostate cancer detection, commonly estimated using transrectal ultrasound (TRUS). While TRUS provides precise prostate volume measurements, its invasive nature often compromises patient comfort. Transabdominal ultrasound (TAUS) provides a non-invasive alternative but faces challenges such as lower image quality, complex interpretation, and reliance on operator expertise. This study introduces a new deep-learning-based framework for automatic PV estimation using TAUS, emphasizing its potential to enable accurate and non-invasive prostate cancer risk stratification. A dataset of TAUS videos from 100 individual patients was curated, with manually delineated prostate boundaries and calculated diameters by an expert clinician as ground truth. The introduced framework integrates deep-learning models for prostate segmentation in both axial and sagittal planes, automatic prostate diameter estimation, and PV calculation. Segmentation performance was evaluated using Dice correlation coefficient (%) and Hausdorff distance (mm). Framework's volume estimation capabilities were evaluated on volumetric error (mL). The framework demonstrates that it can estimate PV from TAUS videos with a mean volumetric error of -5.5 mL, which results in an average relative error between 5 and 15%. The introduced framework for automatic PV estimation from TAUS images, utilizing deep learning models for prostate segmentation, shows promising results. It effectively segments the prostate and estimates its volume, offering potential for reliable, non-invasive risk stratification for early prostate detection.
前列腺癌是男性健康的主要关注点,需要准确且易于获取的方法来进行早期检测和风险评估。前列腺体积(PV)在多变量风险分层中是一个关键参数,常通过经直肠超声波(TRUS)来估算。虽然TRUS能提供精确的前列腺体积测量值,但其侵入性特性往往影响患者的舒适度。经腹壁超声波(TAUS)则提供了非侵入性的替代方案,但面临着图像质量较低、解读复杂和依赖操作者经验等挑战。 本研究引入了一个基于深度学习的新框架,用于从TAUS视频中自动估算前列腺体积(PV),强调其在实现准确且非侵入性早期前列腺癌风险评估中的潜力。该数据集由100名患者的不同TAUS视频组成,所有视频均附有专业医生手动描绘的前列腺边界和计算出的直径作为真实参照值。 新框架整合了用于轴向和平面中前列腺分割的深度学习模型、自动前列腺直径估算以及PV计算等功能。分割性能通过Dice相关系数(%)和豪斯多夫距离(mm)进行评估,而体积估计能力则通过容积误差(mL)来衡量。研究结果表明,该框架能够从TAUS视频中以平均5.5毫升的体积误差准确估算前列腺体积,相对误差在5到15%之间。 利用深度学习模型进行前列腺分割并从TAUS图像自动估算PV的新框架展现了很有前景的结果。它有效地区分了前列腺并估计其体积,为早期非侵入性前列腺癌风险评估提供了潜在的可能性。
https://arxiv.org/abs/2502.07859
In this study, we introduce the process for creating BiaSWE, an expert-annotated dataset tailored for misogyny detection in the Swedish language. To address the cultural and linguistic specificity of misogyny in Swedish, we collaborated with experts from the social sciences and humanities. Our interdisciplinary team developed a rigorous annotation process, incorporating both domain knowledge and language expertise, to capture the nuances of misogyny in a Swedish context. This methodology ensures that the dataset is not only culturally relevant but also aligned with broader efforts in bias detection for low-resource languages. The dataset, along with the annotation guidelines, is publicly available for further research.
在这项研究中,我们介绍了创建BiaSWE的过程,这是一个专门针对瑞典语中的厌女症检测的专家注释数据集。为了应对瑞典语中厌女症的文化和语言特定性,我们与社会科学和人文科学领域的专家合作。我们的跨学科团队开发了一套严格的标注流程,结合了领域知识和语言专业知识,以捕捉瑞典语环境中厌女症的独特特点。这种方法确保该数据集不仅具有文化相关性,还与其他低资源语言中的偏见检测努力保持一致。该数据集及其注释指南已公开发布,可供进一步研究使用。
https://arxiv.org/abs/2502.07637
Perceiving the environment and its changes over time corresponds to two fundamental yet heterogeneous types of information: semantics and motion. Previous end-to-end autonomous driving works represent both types of information in a single feature vector. However, including motion tasks, such as prediction and planning, always impairs detection and tracking performance, a phenomenon known as negative transfer in multi-task learning. To address this issue, we propose Neural-Bayes motion decoding, a novel parallel detection, tracking, and prediction method separating semantic and motion learning, similar to the Bayes filter. Specifically, we employ a set of learned motion queries that operate in parallel with the detection and tracking queries, sharing a unified set of recursively updated reference points. Moreover, we employ interactive semantic decoding to enhance information exchange in semantic tasks, promoting positive transfer. Experiments on the nuScenes dataset show improvements of 5% in detection and 11% in tracking. Our method achieves state-of-the-art collision rates in open-loop planning evaluation without any modifications to the planning module.
感知环境及其随时间的变化对应于两种基本但异质的信息类型:语义和运动。之前端到端的自动驾驶工作将这两种信息类型表示在一个单一的功能向量中。然而,包含诸如预测和规划等运动任务总是会损害检测和跟踪性能,这在多任务学习中被称为负迁移现象。为了解决这个问题,我们提出了Neural-Bayes 运动解码法,这是一种新的并行检测、跟踪和预测方法,它将语义和运动学习分开,类似于贝叶斯滤波器的工作方式。具体来说,我们使用了一组与检测和跟踪查询并行工作的已学得的运动查询,并共享一组统一递归更新的参考点。此外,我们还采用了交互式语义解码来增强语义任务中的信息交换,以促进正向迁移。在nuScenes数据集上的实验表明,在检测方面提高了5%,在跟踪方面提高了11%。我们的方法在无规划模块修改的情况下实现了开放环路规划评估中最佳的碰撞率。
https://arxiv.org/abs/2502.07631
Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: this https URL
零样本异常检测(ZSAD)是一种新兴的异常检测范式。与传统的无监督异常检测设置需要大量正常样本训练模型不同,ZSAD更适用于处理数据受限的真实场景。近期,多模态大型语言模型(MLLMs)在各种视觉任务中展示了革命性的推理能力。然而,由于缺乏相应的数据集和基准,图像异常的推理研究仍然较少。为了促进异常检测与推理的研究,我们建立了首个视觉指令调优数据集Anomaly-Instruct-125k以及评估基准VisA-D&R。通过我们的基准进行调查后发现,现有的多模态大语言模型如GPT-4o无法准确地检测和描述图像中的细微异常细节。 为了解决这一问题,我们提出了首个针对ZSAD和推理的专家视觉助手Anomaly-OneVision(Anomaly-OV)。受人类在视觉检查中行为启发,Anomaly-OV采用了一种Look-Twice特征匹配机制来自适应地选择并强调异常视觉标记。广泛实验表明,在检测与推理两个方面,Anomaly-OV均实现了比先进通用模型显著的性能提升。此外还提供了将该技术应用于医疗和3D领域的扩展研究方向。 项目页面链接:[请在此处插入实际的URL链接]
https://arxiv.org/abs/2502.07601
Mass-produced optical lenses often exhibit defects that alter their scattering properties and compromise quality standards. Manual inspection is usually adopted to detect defects, but it is not recommended due to low accuracy, high error rate and limited scalability. To address these challenges, this study presents an automated defect detection system based on the YOLOv8 deep learning model. A custom dataset of optical lenses, annotated with defect and lens regions, was created to train the model. Experimental results obtained in this study reveal that the system can be used to efficiently and accurately detect defects in optical lenses. The proposed system can be utilized in real-time industrial environments to enhance quality control processes by enabling reliable and scalable defect detection in optical lens manufacturing.
大批量生产的光学镜头常常存在影响其散射特性的缺陷,从而降低产品质量标准。通常采用人工检查来发现这些缺陷,但这种方法由于精度低、错误率高以及可扩展性有限而不被推荐使用。为了解决这些问题,本研究提出了一种基于YOLOv8深度学习模型的自动缺陷检测系统。为此,创建了一个包含光学镜头并标注了缺陷区域和镜头区域的自定义数据集来训练该模型。 本研究所获得的实验结果显示,该系统能够高效且准确地检测光学镜头中的缺陷。所提出的系统可以在实时工业环境中使用,通过实现可靠的可扩展性缺陷检测来提高光学镜片制造过程中的质量控制水平。
https://arxiv.org/abs/2502.07592
Prior efforts in building computer-assisted pronunciation training (CAPT) systems often treat automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) as separate fronts: the former aims to provide multiple pronunciation aspect scores across diverse linguistic levels, while the latter focuses instead on pinpointing the precise phonetic pronunciation errors made by non-native language learners. However, it is generally expected that a full-fledged CAPT system should perform both functionalities simultaneously and efficiently. In response to this surging demand, we in this work first propose HMamba, a novel CAPT approach that seamlessly integrates APA and MDD tasks in parallel. In addition, we introduce a novel loss function, decoupled cross-entropy loss (deXent), specifically tailored for MDD to facilitate better-supervised learning for detecting mispronounced phones, thereby enhancing overall performance. A comprehensive set of empirical results on the speechocean762 benchmark dataset demonstrates the effectiveness of our approach on APA. Notably, our proposed approach also yields a considerable improvement in MDD performance over a strong baseline, achieving an F1-score of 63.85%. Our codes are made available at this https URL
之前在构建计算机辅助发音训练(CAPT)系统方面的努力通常将自动发音评估(APA)和误读检测与诊断(MDD)视为两个独立的方面:前者旨在提供跨多种语言层次的多个发音维度评分,而后者则专注于识别非母语学习者所犯的具体语音发音错误。然而,普遍认为一个完整的CAPT系统应该能够同时高效地执行这两项功能。为回应这一需求,在这项工作中,我们首先提出了HMamba,一种新颖的CAPT方法,该方法可以无缝整合APA和MDD任务并行进行。 此外,我们还引入了一种新的损失函数——解耦交叉熵损失(deXent),专门针对MDD设计,以促进对误发音检测更好的监督学习,从而提升整体性能。在speechocean762基准数据集上的一系列全面实验证明了我们的方法在APA方面的有效性。值得注意的是,我们提出的方法还显著提升了MDD的表现,在一个强大的基线系统之上取得了F1得分为63.85%的成绩。 我们的代码可在[此处](https://此链接应为原文中提供的具体URL)获取。
https://arxiv.org/abs/2502.07575