The measurements performed by particle physics experiments must account for the imperfect response of the detectors used to observe the interactions. One approach, unfolding, statistically adjusts the experimental data for detector effects. Recently, generative machine learning models have shown promise for performing unbinned unfolding in a high number of dimensions. However, all current generative approaches are limited to unfolding a fixed set of observables, making them unable to perform full-event unfolding in the variable dimensional environment of collider data. A novel modification to the variational latent diffusion model (VLD) approach to generative unfolding is presented, which allows for unfolding of high- and variable-dimensional feature spaces. The performance of this method is evaluated in the context of semi-leptonic top quark pair production at the Large Hadron Collider.
粒子物理实验中进行的测量必须考虑到用于观测相互作用的探测器的不完美响应。一种方法,展开统计地调整探测器效应的实验数据。最近,用于高维度进行无带展开的生成机器学习模型显示出在变维数据中进行无带展开的潜力。然而,目前的所有生成方法都局限于展开固定的一组观测量,使它们无法在多维碰撞数据中进行完整的局域展开。在变分自编码器模型(VLD)的生成展开方法中,提出了一种新的修改方法,允许对高维度和变维特征空间进行展开。该方法在大型强子对撞机(LHC)中半衰期裂变产物的生产环境中进行了评估。
https://arxiv.org/abs/2404.14332
Quantitatively explaining the strength of arguments under gradual semantics has recently received increasing attention. Specifically, several works in the literature provide quantitative explanations by computing the attribution scores of arguments. These works disregard the importance of attacks and supports, even though they play an essential role when explaining arguments' strength. In this paper, we propose a novel theory of Relation Attribution Explanations (RAEs), adapting Shapley values from game theory to offer fine-grained insights into the role of attacks and supports in quantitative bipolar argumentation towards obtaining the arguments' strength. We show that RAEs satisfy several desirable properties. We also propose a probabilistic algorithm to approximate RAEs efficiently. Finally, we show the application value of RAEs in fraud detection and large language models case studies.
量化解释在逐渐语义下论证的力量近年来受到了越来越多的关注。具体来说,许多文献通过计算论证的归因分数来提供定量的解释。尽管这些工作忽略了攻击和支撑的重要性,尽管它们在解释论证的力量方面发挥着至关重要的作用,但这些工作仍然忽略了这一点。在本文中,我们提出了一个名为关系归因解释理论(RAEs)的新理论,从博弈理论中借鉴Shapley值,以提供对攻击和支持在定量二分论证中获得论证力量的重要性的深入理解。我们证明了RAEs满足多个有用的属性。我们还提出了一个高效的概率算法来近似RAEs。最后,我们在欺诈检测和大语言模型案例研究中展示了RAEs的应用价值。
https://arxiv.org/abs/2404.14304
Light Detection and Ranging (LiDAR) technology has proven to be an important part of many robotics systems. Surface normals estimated from LiDAR data are commonly used for a variety of tasks in such systems. As most of the today's mechanical LiDAR sensors produce sparse data, estimating normals from a single scan in a robust manner poses difficulties. In this paper, we address the problem of estimating normals for sparse LiDAR data avoiding the typical issues of smoothing out the normals in high curvature areas. Mechanical LiDARs rotate a set of rigidly mounted lasers. One firing of such a set of lasers produces an array of points where each point's neighbor is known due to the known firing pattern of the scanner. We use this knowledge to connect these points to their neighbors and label them using the angles of the lines connecting them. When estimating normals at these points, we only consider points with the same label as neighbors. This allows us to avoid estimating normals in high curvature areas. We evaluate our approach on various data, both self-recorded and publicly available, acquired using various sparse LiDAR sensors. We show that using our method for normal estimation leads to normals that are more robust in areas with high curvature which leads to maps of higher quality. We also show that our method only incurs a constant factor runtime overhead with respect to a lightweight baseline normal estimation procedure and is therefore suited for operation in computationally demanding environments.
光探测和测距(LiDAR)技术已经成为许多机器人系统的重要组成部分。从LiDAR数据中估计的表面法线通常用于这些系统中的各种任务。由于大多数现代机械LiDAR传感器仅产生稀疏数据,在稳健地估计法线方面存在困难。在本文中,我们解决了在稀疏LiDAR数据中估计法线避开高曲率区域典型问题的問題。机械LiDAR将一组刚性安装的激光器旋转。这种设置的每发一次激光器会产生一个点,由于扫描器的已知扫描模式,每个点的邻居已知。我们利用这个知识将这些点连接到邻居并使用它们之间的角度来标记它们。在估计这些点的法线时,我们只考虑具有相同标签的点。这使我们能够避免在高曲率区域中估计法线。我们在各种自记录的和公开可用的数据上评估我们的方法。我们发现,使用我们估计法线的方法可以得到更高曲率区域中的法线,从而实现质量更高的地图。我们还发现,相对于轻量级基线法线估计方法,我们的方法只产生了恒定的运行时开销,因此它适用于计算密集型环境。
https://arxiv.org/abs/2404.14281
Recent advances in the field of generative artificial intelligence (AI) have blurred the lines between authentic and machine-generated content, making it almost impossible for humans to distinguish between such media. One notable consequence is the use of AI-generated images for fake profiles on social media. While several types of disinformation campaigns and similar incidents have been reported in the past, a systematic analysis has been lacking. In this work, we conduct the first large-scale investigation of the prevalence of AI-generated profile pictures on Twitter. We tackle the challenges of a real-world measurement study by carefully integrating various data sources and designing a multi-stage detection pipeline. Our analysis of nearly 15 million Twitter profile pictures shows that 0.052% were artificially generated, confirming their notable presence on the platform. We comprehensively examine the characteristics of these accounts and their tweet content, and uncover patterns of coordinated inauthentic behavior. The results also reveal several motives, including spamming and political amplification campaigns. Our research reaffirms the need for effective detection and mitigation strategies to cope with the potential negative effects of generative AI in the future.
近年来,在生成人工智能(AI)领域的发展已经使真实和机器生成的内容之间的界限变得模糊,使得人类很难区分这些媒体。一个著名的后果是在社交媒体上使用AI生成的图像作为虚假个人资料。虽然过去的报道中已经提到了几种形式的虚假信息活动和类似事件,但缺乏系统性的分析。在这项工作中,我们对Twitter上AI生成个人资料图片的普及进行了首次大型调查。我们通过仔细整合各种数据来源并设计了一个多阶段检测管道,解决了真实世界测量研究的挑战。我们对近1500万Twitter个人资料图片的分析显示,0.052%是人工生成的,证实了它们在平台上的显著存在。我们全面检查了这些账户的性质和推特内容,并揭示了协同伪造行为的模式。结果还显示了几个动机,包括垃圾邮件和政治放大宣传活动。我们的研究确认了在应对生成AI对未来潜在负面影响方面需要有效的检测和减轻策略。
https://arxiv.org/abs/2404.14244
The rapidly developing Large Vision Language Models (LVLMs) have shown notable capabilities on a range of multi-modal tasks, but still face the hallucination phenomena where the generated texts do not align with the given contexts, significantly restricting the usages of LVLMs. Most previous work detects and mitigates hallucination at the coarse-grained level or requires expensive annotation (e.g., labeling by proprietary models or human experts). To address these issues, we propose detecting and mitigating hallucinations in LVLMs via fine-grained AI feedback. The basic idea is that we generate a small-size sentence-level hallucination annotation dataset by proprietary models, whereby we train a hallucination detection model which can perform sentence-level hallucination detection, covering primary hallucination types (i.e., object, attribute, and relationship). Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model. Furthermore, we propose differentiating the severity of hallucinations, and introducing a Hallucination Severity-Aware Direct Preference Optimization (HSA-DPO) for mitigating hallucination in LVLMs by incorporating the severity of hallucinations into preference learning. Extensive experiments demonstrate the effectiveness of our method.
快速发展的large vision语言模型(LVLMs)在多种多模态任务上表现出了显著的能力,但仍面临着生成文本与给定上下文不协调的幻觉现象,这显著限制了LVLMs的使用。之前的工作在粗粒度或需要昂贵注释(例如,由专用模型或人类专家进行的标注)来检测和缓解幻觉。为了应对这些问题,我们通过细粒度人工智能反馈来检测和缓解LVLMs的幻觉。基本思路是我们通过专用模型生成一个句子级幻觉注释数据集,然后我们训练了一个句子级幻觉检测模型,可以检测句子级的幻觉,涵盖主要的幻觉类型(即物体、属性和关系)。接着,我们提出了一个检测-然后-重写的流水线来自动构建用于训练幻觉缓解模型的偏好数据。此外,我们还提出了区分幻觉严重程度,并引入了Hallucination Severity-Aware Direct Preference Optimization (HSA-DPO)来通过将幻觉的严重程度融入偏好学习来缓解LVLMs的幻觉。大量实验证明了我们方法的有效性。
https://arxiv.org/abs/2404.14233
We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual track. Subtask B is to detect the exact source of a text, discerning whether it is written by a human or generated by a specific LLM. Subtask C aims to identify the changing point within a text, at which the authorship transitions from human to machine. The task attracted a large number of participants: subtask A monolingual (126), subtask A multilingual (59), subtask B (70), and subtask C (30). In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For all subtasks, the best systems used LLMs.
我们展示了SemEval-2024 Task 8: 多生成器、多领域和多语言机器生成文本检测的主要结果和主要发现。该任务包括三个子任务。子任务A是一个二分类任务,确定文本是由人类编写还是由机器生成。这个子任务有两个路线:一个专注于英语文本的单语种路线和一个多语言路线。子任务B是检测文本的确切来源,分辨是是人类还是特定LLM生成。子任务C旨在确定文本中作者从人类到机器的转变点。该任务吸引了大量参与者:子任务A单语种(126),子任务A多语种(59),子任务B(70)和子任务C(30)。在本文中,我们介绍了该任务,分析了结果,并讨论了系统提交以及它们使用的技术。对于所有子任务,使用LLM的最佳系统。
https://arxiv.org/abs/2404.14183
With the popularity of social media platforms such as Instagram and TikTok, and the widespread availability and convenience of retouching tools, an increasing number of individuals are utilizing these tools to beautify their facial photographs. This poses challenges for fields that place high demands on the authenticity of photographs, such as identity verification and social media. By altering facial images, users can easily create deceptive images, leading to the dissemination of false information. This may pose challenges to the reliability of identity verification systems and social media, and even lead to online fraud. To address this issue, some work has proposed makeup removal methods, but they still lack the ability to restore images involving geometric deformations caused by retouching. To tackle the problem of facial retouching restoration, we propose a framework, dubbed Face2Face, which consists of three components: a facial retouching detector, an image restoration model named FaceR, and a color correction module called Hierarchical Adaptive Instance Normalization (H-AdaIN). Firstly, the facial retouching detector predicts a retouching label containing three integers, indicating the retouching methods and their corresponding degrees. Then FaceR restores the retouched image based on the predicted retouching label. Finally, H-AdaIN is applied to address the issue of color shift arising from diffusion models. Extensive experiments demonstrate the effectiveness of our framework and each module.
随着社交媒体平台如Instagram和TikTok的流行,以及修图工具的广泛可用性和便利性,越来越多的人使用这些工具来美化他们的面部照片。这给对照片真实度有很高要求的领域(如身份验证和社交媒体)带来了挑战。通过改变面部图像,用户可以轻松创建误导性的图像,导致传播虚假信息。这可能对身份验证系统的可靠性和社交媒体造成挑战,甚至可能导致网络欺诈。为解决这个问题,一些工作提出了化妆去除方法,但他们仍然缺乏修复由修图引起的几何变形图像的能力。为解决面部修图恢复问题,我们提出了一个名为Face2Face的框架,它由三个组件组成:面部修图检测器、一个名为FaceR的图像修复模型和一个名为Hierarchical Adaptive Instance Normalization(H-AdaIN)的颜色校正模块。首先,面部修图检测器预测包含三个整数的修图标签,表示修图方法和它们的相应程度。然后,FaceR根据预测的修图标签恢复被修复的图像。最后,H-AdaIN应用于解决扩散模型引起的颜色偏移问题。大量实验证明了我们框架和每个模块的有效性。
https://arxiv.org/abs/2404.14177
In order to clear the world of the threat posed by landmines and other explosive devices, robotic systems can play an important role. However, the development of such field robots that need to operate in hazardous conditions requires the careful consideration of multiple aspects related to the perception, mobility, and collaboration capabilities of the system. In the framework of a European challenge, the Artificial Intelligence for Detection of Explosive Devices - eXtended (AIDEDeX) project proposes to design a heterogeneous multi-robot system with advanced sensor fusion algorithms. This system is specifically designed to detect and classify improvised explosive devices, explosive ordnances, and landmines. This project integrates specialised sensors, including electromagnetic induction, ground penetrating radar, X-Ray backscatter imaging, Raman spectrometers, and multimodal cameras, to achieve comprehensive threat identification and localisation. The proposed system comprises a fleet of unmanned ground vehicles and unmanned aerial vehicles. This article details the operational phases of the AIDEDeX system, from rapid terrain exploration using unmanned aerial vehicles to specialised detection and classification by unmanned ground vehicles equipped with a robotic manipulator. Initially focusing on a centralised approach, the project will also explore the potential of a decentralised control architecture, taking inspiration from swarm robotics to provide a robust, adaptable, and scalable solution for explosive detection.
为了清除世界 landmines 等爆炸性装置带来的威胁,机器人系统可以发挥重要作用。然而,开发需要在危险条件下操作的机器人系统需要仔细考虑多个方面,包括系统的感知、移动和协作能力。在欧盟挑战框架下,人工智能检测爆炸性装置-扩展(AIDEDeX)项目提出了设计一个异构多机器人系统,该系统具有高级传感器融合算法。该系统特别设计用于检测和分类手榴弹、爆炸性炮弹和地雷。该项目集成了专门的传感器,包括电磁感应、地透射雷达、X 射线散射成像、拉曼光谱和多模态摄像头,以实现全面威胁识别和定位。所提出的系统包括一系列无人驾驶地面车辆和无人驾驶空中车辆。本文详细介绍了 AIDEDeX 系统的操作阶段,从使用无人机进行迅速地面探索,到由配备机器人操作器的无人驾驶地面车辆进行的专业检测和分类。最初集中于中心方法,该项目还将探讨去中心控制架构的潜力,从仿生机器人学中获取灵感,提供爆炸性检测的稳健、可扩展和可扩展解决方案。
https://arxiv.org/abs/2404.14167
Extremely low-light text images are common in natural scenes, making scene text detection and recognition challenging. One solution is to enhance these images using low-light image enhancement methods before text extraction. However, previous methods often do not try to particularly address the significance of low-level features, which are crucial for optimal performance on downstream scene text tasks. Further research is also hindered by the lack of extremely low-light text datasets. To address these limitations, we propose a novel encoder-decoder framework with an edge-aware attention module to focus on scene text regions during enhancement. Our proposed method uses novel text detection and edge reconstruction losses to emphasize low-level scene text features, leading to successful text extraction. Additionally, we present a Supervised Deep Curve Estimation (Supervised-DCE) model to synthesize extremely low-light images based on publicly available scene text datasets such as ICDAR15 (IC15). We also labeled texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets to allow for objective assessment of extremely low-light image enhancement through scene text tasks. Extensive experiments show that our model outperforms state-of-the-art methods in terms of both image quality and scene text metrics on the widely-used LOL, SID, and synthetic IC15 datasets. Code and dataset will be released publicly at this https URL.
极低光文本图像在自然场景中很常见,使得场景文本检测和识别变得具有挑战性。一种解决方案是在文本提取之前使用低光图像增强方法增强这些图像。然而,之前的方法通常没有特别关注低级别特征的重要性,这些特征对于下游场景文本任务具有关键作用。此外,缺乏极低光文本数据集也进一步阻碍了进一步的研究。为了克服这些限制,我们提出了一个新颖的编码器-解码器框架,配备边缘感知注意模块,以在增强过程中关注场景文本区域。我们的方法利用新的文本检测和边缘重构损失来强调低级别场景文本特征,从而实现成功的文本提取。此外,我们还提出了一个基于已知场景文本数据集如ICDAR15( see In the Dark,SID)的监督深度曲线估计(Supervised-DCE)模型,用于基于公开可用的场景文本数据合成极低光图像。我们还对极低光See In the Dark(SID)和普通Low-Light(LOL)数据集中的文本进行了标注,以使场景文本任务通过场景文本评估极低光图像增强。大量的实验结果表明,我们的模型在广泛使用的LOL、SID和合成IC15数据集上的图像质量和场景文本指标都优于最先进的方法。代码和数据集将在这个https:// URL上发布。
https://arxiv.org/abs/2404.14135
Quantum computing represents a cutting-edge frontier in artificial intelligence. It makes use of hybrid quantum-classical computation which tries to leverage quantum mechanic principles that allow us to use a different approach to deep learning classification problems. The work presented here falls within the context of the AGILE space mission, launched in 2007 by the Italian Space Agency. We implement different Quantum Convolutional Neural Networks (QCNN) that analyze data acquired by the instruments onboard AGILE to detect Gamma-Ray Bursts from sky maps or light curves. We use several frameworks such as TensorFlow-Quantum, Qiskit and PennyLane to simulate a quantum computer. We achieved an accuracy of 95.1% on sky maps with QCNNs, while the classical counterpart achieved 98.8% on the same data, using however hundreds of thousands more parameters.
量子计算代表了人工智能领域的前沿。它利用混合量子-经典计算,试图利用量子力学原理,使我们能够采用不同的方法解决深度学习分类问题。在此工作中,我们的工作处于AGILE空间任务(2007年由意大利航天局发射)的背景下。我们实现了不同的量子卷积神经网络(QCNN),用于分析AGILE仪器上获取的数据,以检测来自天空地图或光曲线的高能伽马射线爆发。我们使用几个框架如TensorFlow-Quantum,Qiskit和PennyLane来模拟量子计算机。我们可以在天空地图上的准确度达到95.1%,而古典对应物达到98.8%,尽管后者的参数有数百数千个更多。
https://arxiv.org/abs/2404.14133
In this paper, we present a simple yet effective contrastive knowledge distillation approach, which can be formulated as a sample-wise alignment problem with intra- and inter-sample constraints. Unlike traditional knowledge distillation methods that concentrate on maximizing feature similarities or preserving class-wise semantic correlations between teacher and student features, our method attempts to recover the "dark knowledge" by aligning sample-wise teacher and student logits. Specifically, our method first minimizes logit differences within the same sample by considering their numerical values, thus preserving intra-sample similarities. Next, we bridge semantic disparities by leveraging dissimilarities across different samples. Note that constraints on intra-sample similarities and inter-sample dissimilarities can be efficiently and effectively reformulated into a contrastive learning framework with newly designed positive and negative pairs. The positive pair consists of the teacher's and student's logits derived from an identical sample, while the negative pairs are formed by using logits from different samples. With this formulation, our method benefits from the simplicity and efficiency of contrastive learning through the optimization of InfoNCE, yielding a run-time complexity that is far less than $O(n^2)$, where $n$ represents the total number of training samples. Furthermore, our method can eliminate the need for hyperparameter tuning, particularly related to temperature parameters and large batch sizes. We conduct comprehensive experiments on three datasets including CIFAR-100, ImageNet-1K, and MS COCO. Experimental results clearly confirm the effectiveness of the proposed method on both image classification and object detection tasks. Our source codes will be publicly available at this https URL.
在本文中,我们提出了一种简单而有效的对比性知识蒸馏方法,可以将其表述为样本层面的对齐问题,具有内部样本和跨样本约束。与传统的知识蒸馏方法不同,该方法试图通过将样本层面的教师和学生的对数值对齐来恢复“暗知识”,具体来说,我们的方法首先通过考虑它们的数值值来最小化同一样本内的对数值差异,从而保留内部样本相似性。接下来,我们通过利用不同样本之间的差异来桥通常的语义差异。需要注意的是,对内部样本相似性和跨样本差异的限制可以有效地转化为一个新的设计的有向二进制对齐学习框架。其中一对正对是由相同的样本生成的教师和学生的对数值,而负对则是由不同样本的推理得出的。通过这种表示方法,我们的方法通过优化InfoNCE实现了对比学习的高效性和效率,其运行时间复杂度远低于$O(n^2)$,其中$n$表示训练样本的总数。此外,我们的方法可以消除关于温度参数和大批量的超参数 tuning需求,特别与温度参数和大的批量大小的相关。我们对包括CIFAR-100、ImageNet-1K和MS COCO在内的三个数据集进行了全面的实验。实验结果明确证实了所提出方法在图像分类和目标检测任务上的有效性。我们的源代码将公开发布在这个https URL上。
https://arxiv.org/abs/2404.14109
Lunar exploration has become a key focus, driving scientific and technological advances. Ongoing missions are deploying rovers to the surface of the Moon, targeting the far side and south pole. However, these terrains pose challenges, emphasizing the need for precise obstacles and resource detection to avoid mission risks. This work proposes a novel system that integrates eXtended Reality (XR) and Artificial Intelligence (AI) to teleoperate lunar rovers. It is capable of autonomously detecting rocks and recreating an immersive 3D virtual environment of the location of the robot. This system has been validated in a lunar laboratory to observe its advantages over traditional 2D-based teleoperation approaches
登月探险已成为一个关键的重点,推动了科学和技术的进步。正在进行的研究任务正在向月球表面部署漫游车,针对远端和南极。然而,这些地形带来了挑战,强调了在避免任务风险方面需要精确的障碍和资源检测的重要性。这项工作提出了一种集成增强现实(XR)和人工智能(AI)的新型系统,用于遥控月球漫游车。它能够自主检测岩石并创建机器人所在位置的沉浸式3D虚拟环境。已经在月球实验室验证了该系统的优势,与传统的2D基于遥控方法相比。
https://arxiv.org/abs/2404.14095
Purpose: The recent Segment Anything Model (SAM) has demonstrated impressive performance with point, text or bounding box prompts, in various applications. However, in safety-critical surgical tasks, prompting is not possible due to (i) the lack of per-frame prompts for supervised learning, (ii) it is unrealistic to prompt frame-by-frame in a real-time tracking application, and (iii) it is expensive to annotate prompts for offline applications. Methods: We develop Surgical-DeSAM to generate automatic bounding box prompts for decoupling SAM to obtain instrument segmentation in real-time robotic surgery. We utilise a commonly used detection architecture, DETR, and fine-tuned it to obtain bounding box prompt for the instruments. We then empolyed decoupling SAM (DeSAM) by replacing the image encoder with DETR encoder and fine-tune prompt encoder and mask decoder to obtain instance segmentation for the surgical instruments. To improve detection performance, we adopted the Swin-transformer to better feature representation. Results: The proposed method has been validated on two publicly available datasets from the MICCAI surgical instruments segmentation challenge EndoVis 2017 and 2018. The performance of our method is also compared with SOTA instrument segmentation methods and demonstrated significant improvements with dice metrics of 89.62 and 90.70 for the EndoVis 2017 and 2018. Conclusion: Our extensive experiments and validations demonstrate that Surgical-DeSAM enables real-time instrument segmentation without any additional prompting and outperforms other SOTA segmentation methods.
目的:最近,Segment Anything Model(SAM)通过点、文本或边界框提示在各种应用中展示了出色的性能。然而,在关键手术任务中,由于(i)缺少每个帧的监督学习指导,(ii)在实时跟踪应用程序中逐帧提示是不现实的,(iii)为离线应用程序标注提示成本高昂,我们开发了Surgical-DeSAM,用于生成自动边界框提示,以将SAM与实时机器人手术解耦,并获得器械分割。我们利用了一个常用的检测架构DETR并对其进行了微调,以获得器械的边界框提示。然后,通过用DETR编码器替换图像编码器,并微调提示编码器和遮罩解码器,我们实现了手术器械的实例分割。为了提高检测性能,我们采用了Swin-transformer来更好地表示特征。结果:所提出的方法已通过在EndoVis 2017和2018两个公开可用的数据集上进行验证。我们的方法与其他用于手术器械分割的最好方法进行了比较,并使用迪氏分数(89.62)和吉氏分数(90.70)证明了在EndoVis 2017和2018上显著的改进。结论:我们的大量实验和验证证明,Surgical-DeSAM实现了没有额外提示的实时器械分割,并超越了其他SOTA分割方法。
https://arxiv.org/abs/2404.14040
Automatic depression detection from conversational data has gained significant interest in recent years. The DAIC-WOZ dataset, interviews conducted by a human-controlled virtual agent, has been widely used for this task. Recent studies have reported enhanced performance when incorporating interviewer's prompts into the model. In this work, we hypothesize that this improvement might be mainly due to a bias present in these prompts, rather than the proposed architectures and methods. Through ablation experiments and qualitative analysis, we discover that models using interviewer's prompts learn to focus on a specific region of the interviews, where questions about past experiences with mental health issues are asked, and use them as discriminative shortcuts to detect depressed participants. In contrast, models using participant responses gather evidence from across the entire interview. Finally, to highlight the magnitude of this bias, we achieve a 0.90 F1 score by intentionally exploiting it, the highest result reported to date on this dataset using only textual information. Our findings underline the need for caution when incorporating interviewers' prompts into models, as they may inadvertently learn to exploit targeted prompts, rather than learning to characterize the language and behavior that are genuinely indicative of the patient's mental health condition.
近年来,自动从会话数据中检测抑郁取得了显著兴趣。DAIC-WOZ数据集和由人类控制的人造智能代理进行的访谈,已广泛用于这项任务。最近的研究表明,将访谈者的提示融入模型中可以提高性能。在这项工作中,我们假设,这种提高可能主要源于这些提示中存在的偏差,而不是所提出的架构和方法。通过消融实验和定性分析,我们发现,使用访谈者提示的模型学会了关注访谈中的特定区域,其中关于心理健康问题过去的经验问问题,并利用它们作为判别抑郁参与者的捷径。相比之下,使用参与者回答的模型收集了整个访谈的证据。最后,为了突出这一偏见的影响,我们故意利用它,这是用纯文本信息报告的该数据集上最高分数,即0.90 F1分数。我们的研究结果表明,在将访谈者的提示融入模型时需要保持警惕,因为他们可能会无意间学会利用针对性的提示,而并非真正地了解和描述患者的心理健康状况的语言和行为。
https://arxiv.org/abs/2404.14463
Deep learning has shown the great power in the field of fault detection. However, for incipient faults with tiny amplitude, the detection performance of the current deep learning networks (DLNs) is not satisfactory. Even if prior information about the faults is utilized, DLNs can't successfully detect faults 3, 9 and 15 in Tennessee Eastman process (TEP). These faults are notoriously difficult to detect, lacking effective detection technologies in the field of fault detection. In this work, we propose Autoencoder-assisted Feature Ensemble Net (AE-FENet): a deep feature ensemble framework that uses the unsupervised autoencoder to conduct the feature transformation. Compared with the principle component analysis (PCA) technique adopted in the original Feature Ensemble Net (FENet), autoencoder can mine more exact features on incipient faults, which results in the better detection performance of AE-FENet. With same kinds of basic detectors, AE-FENet achieves a state-of-the-art average accuracy over 96% on faults 3, 9 and 15 in TEP, which represents a significant enhancement in performance compared to other methods. Plenty of experiments have been done to extend our framework, proving that DLNs can be utilized efficiently within this architecture.
深度学习在故障检测领域表现出了巨大的威力。然而,对于初始故障(幅值较小)的检测,当前的深度学习网络(DLNs)的检测性能并不令人满意。即使利用先前的故障信息,DLNs也无法成功检测田纳西东部过程(TEP)中的第3、9和15个故障。这些故障尤其难以检测,在故障检测领域缺乏有效的检测技术。在这项工作中,我们提出了自编码器辅助特征集成网络(AE-FENet): 一个深度特征集成框架,利用无监督的自动编码器进行特征转换。与原始特征集成网络(FENet)中采用的原则成分分析(PCA)技术相比,自动编码器可以挖掘更多的精确特征,从而使AE-FENet在初始故障检测方面的性能更佳。与相同类型的基本检测器相比,AE-FENet在TEP中的第3、9和15个故障上实现了超过96%的顶级平均准确率,这表明与其他方法相比,性能有了显著的提高。已经进行了很多实验来扩展我们的框架,证明DLNs可以有效地利用这种架构。
https://arxiv.org/abs/2404.13941
As a preliminary work, NeRF-Det unifies the tasks of novel view synthesis and 3D perception, demonstrating that perceptual tasks can benefit from novel view synthesis methods like NeRF, significantly improving the performance of indoor multi-view 3D object detection. Using the geometry MLP of NeRF to direct the attention of detection head to crucial parts and incorporating self-supervised loss from novel view rendering contribute to the achieved improvement. To better leverage the notable advantages of the continuous representation through neural rendering in space, we introduce a novel 3D perception network structure, NeRF-DetS. The key component of NeRF-DetS is the Multi-level Sampling-Adaptive Network, making the sampling process adaptively from coarse to fine. Also, we propose a superior multi-view information fusion method, known as Multi-head Weighted Fusion. This fusion approach efficiently addresses the challenge of losing multi-view information when using arithmetic mean, while keeping low computational costs. NeRF-DetS outperforms competitive NeRF-Det on the ScanNetV2 dataset, by achieving +5.02% and +5.92% improvement in mAP@.25 and mAP@.50, respectively.
作为初步工作,NeRF-Det 统一了 novel view synthesis 和 3D 感知任务,证明了 NeRF 这样的感知任务可以通过 novel view synthesis 方法受益,显著提高了室内多视图 3D 物体检测的性能。利用 NeRF 的几何 MLP 指导检测头的注意力,并将来自 novel view 渲染的自监督损失融入其中,有助于实现所取得的改进。为了更好地利用连续空间表示中的显著优势,我们在 NeRF-Det 上引入了一个新的 3D 感知网络结构 NeRF-DetS。NeRF-DetS 的关键组件是 Multi-level Sampling-Adaptive Network,使抽样过程从粗到细进行自适应。此外,我们提出了一个更好的多视图信息融合方法,称为 Multi-head Weighted Fusion。这种融合方法有效地解决了使用算术平均值时丢失多视图信息的问题,同时保持较低的计算成本。在 ScanNetV2 数据集上,NeRF-DetS 超越了竞争 NeRF-Det,实现了 +5.02% 和 +5.92% 的 mAP@.25 和 mAP@.50 改善。
https://arxiv.org/abs/2404.13921
The availability of smart devices leads to an exponential increase in multimedia content. However, the rapid advancements in deep learning have given rise to sophisticated algorithms capable of manipulating or creating multimedia fake content, known as Deepfake. Audio Deepfakes pose a significant threat by producing highly realistic voices, thus facilitating the spread of misinformation. To address this issue, numerous audio anti-spoofing detection challenges have been organized to foster the development of anti-spoofing countermeasures. This survey paper presents a comprehensive review of every component within the detection pipeline, including algorithm architectures, optimization techniques, application generalizability, evaluation metrics, performance comparisons, available datasets, and open-source availability. For each aspect, we conduct a systematic evaluation of the recent advancements, along with discussions on existing challenges. Additionally, we also explore emerging research topics on audio anti-spoofing, including partial spoofing detection, cross-dataset evaluation, and adversarial attack defence, while proposing some promising research directions for future work. This survey paper not only identifies the current state-of-the-art to establish strong baselines for future experiments but also guides future researchers on a clear path for understanding and enhancing the audio anti-spoofing detection mechanisms.
智能设备的普及导致多媒体内容的指数级增加。然而,深度学习的快速发展已经催生出能够操纵或创建多媒体假内容的复杂算法,即Deepfake。音频Deepfakes通过产生高度逼真的声音,从而促进信息的传播,对人类社会造成了严重的威胁。为了解决这个问题,已经组织了大量的音频抗伪造检测挑战,以促进对抗伪造技术的研发。 这份调查论文对检测管道的每个组成部分进行了全面的回顾,包括算法架构、优化技术、应用的可扩展性、评估指标、性能比较和可用数据集以及开源性。对每个方面,我们进行了对最近进展的系统评估,并讨论了现有的挑战。此外,我们还探讨了音频抗伪造的研究方向,包括部分伪造检测、跨数据集评估和防御性攻击,同时为未来的研究提出了有前途的研究方向。 这份调查论文不仅确定了当前的最先进水平,为未来的实验建立了强大的基线,而且还指导了未来研究人员理解并提高音频抗伪造检测机制的清晰路径。
https://arxiv.org/abs/2404.13914
With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired by retrieval-augmented generation (RAG), we propose a retrieval-augmented detection (RAD) framework that augments test samples with similar retrieved samples for enhanced detection. We also extend the multi-fusion attentive classifier to integrate it with our proposed RAD framework. Extensive experiments show the superior performance of the proposed RAD framework over baseline methods, achieving state-of-the-art results on the ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets. Further sample analysis indicates that the retriever consistently retrieves samples mostly from the same speaker with acoustic characteristics highly consistent with the query audio, thereby improving detection performance.
近年来,在语音合成方面的进步包括文本到语音(TTS)和语音转换(VC)系统,这些系统能够生成超现实主义的音频深度伪造,因此人们对它们可能被滥用的问题越来越担忧。然而,大多数深度伪造(DF)检测方法仅依赖单个模型的模糊知识,导致性能瓶颈和透明度问题。受到检索增强生成(RAG)的启发,我们提出了一个检索增强检测(RAD)框架,通过增加与检索样本相似的测试样本来增强检测。我们还将多融合注意分类器扩展到与我们的RAD框架相结合。大量实验证明,与基线方法相比,所提出的RAD框架具有卓越的性能,在ASVspoof 2021 DF集上实现了最先进的成果,同时在2019和2021 LA集上获得了竞争力的结果。进一步的样本分析表明,检索器总是从具有与查询音频相似的相同说话人检索样本,从而提高了检测性能。
https://arxiv.org/abs/2404.13892
Sequential DeepFake detection is an emerging task that aims to predict the manipulation sequence in order. Existing methods typically formulate it as an image-to-sequence problem, employing conventional Transformer architectures for detection. However, these methods lack dedicated design and consequently result in limited performance. In this paper, we propose a novel Texture-aware and Shape-guided Transformer to enhance detection performance. Our method features four major improvements. Firstly, we describe a texture-aware branch that effectively captures subtle manipulation traces with the Diversiform Pixel Difference Attention module. Then we introduce a Bidirectional Interaction Cross-attention module that seeks deep correlations among spatial and sequential features, enabling effective modeling of complex manipulation traces. To further enhance the cross-attention, we describe a Shape-guided Gaussian mapping strategy, providing initial priors of the manipulation shape. Finally, observing that the latter manipulation in a sequence may influence traces left in the earlier one, we intriguingly invert the prediction order from forward to backward, leading to notable gains as expected. Extensive experimental results demonstrate that our method outperforms others by a large margin, highlighting the superiority of our method.
序列深度伪造检测是一个新兴的任务,旨在预测操作序列。现有的方法通常将其表示为图像到序列问题,并采用传统的Transformer架构进行检测。然而,这些方法缺乏专门的设计,因此其性能有限。在本文中,我们提出了一种新颖的Texture-aware和Shape-guided Transformer,以提高检测性能。我们的方法具有四个主要改进。首先,我们描述了一个Texture-aware分支,通过Diversiform Pixel Difference Attention模块有效地捕捉到细微的操纵痕迹。然后我们引入了双向交互跨注意力模块,寻求空间和序列特征之间的深入关系,从而有效建模复杂的操纵痕迹。为了进一步提高跨注意,我们描述了Shape-guided Gaussian映射策略,为操纵形状提供初始概率。最后,观察到序列中的后一个操纵可能影响前面一个操纵留下的痕迹,我们有趣地颠倒预测顺序,从而带来预期的显著提升。大量的实验结果表明,与其他方法相比,我们的方法优势明显,突出了我们方法的优越性。
https://arxiv.org/abs/2404.13873
Generating synthetic fake faces, known as pseudo-fake faces, is an effective way to improve the generalization of DeepFake detection. Existing methods typically generate these faces by blending real or fake faces in color space. While these methods have shown promise, they overlook the simulation of frequency distribution in pseudo-fake faces, limiting the learning of generic forgery traces in-depth. To address this, this paper introduces {\em FreqBlender}, a new method that can generate pseudo-fake faces by blending frequency knowledge. Specifically, we investigate the major frequency components and propose a Frequency Parsing Network to adaptively partition frequency components related to forgery traces. Then we blend this frequency knowledge from fake faces into real faces to generate pseudo-fake faces. Since there is no ground truth for frequency components, we describe a dedicated training strategy by leveraging the inner correlations among different frequency knowledge to instruct the learning process. Experimental results demonstrate the effectiveness of our method in enhancing DeepFake detection, making it a potential plug-and-play strategy for other methods.
生成合成假脸(伪假脸)是提高DeepFake检测的一般化程度的有效方法。现有的方法通常通过在颜色空间中混合真实或假人脸来生成这些脸。虽然这些方法显示出一定的效果,但它们忽略了伪假脸上模拟频率分布,限制了对深度伪造迹的学习。为了解决这个问题,本文引入了{\em FreqBlender},一种通过在频率域中混合频率知识来生成伪假脸的新方法。具体来说,我们研究了主要频率成分,并提出了一种基于频率相关的伪造迹的频率解析网络,以便适应地分割与伪造痕迹相关的频率分量。然后我们将伪造脸上的频率知识与真实脸混合,生成伪假脸。由于没有频率成分的 ground truth,我们描述了一种通过利用不同频率知识之间的内关联来指导学习过程的专用训练策略。实验结果表明,我们的方法可以有效地增强DeepFake检测,成为其他方法的潜在插件和play策略。
https://arxiv.org/abs/2404.13872