Lung and colon cancer are serious worldwide health challenges that require early and precise identification to reduce mortality risks. However, diagnosis, which is mostly dependent on histopathologists' competence, presents difficulties and hazards when expertise is insufficient. While diagnostic methods like imaging and blood markers contribute to early detection, histopathology remains the gold standard, although time-consuming and vulnerable to inter-observer mistakes. Limited access to high-end technology further limits patients' ability to receive immediate medical care and diagnosis. Recent advances in deep learning have generated interest in its application to medical imaging analysis, specifically the use of histopathological images to diagnose lung and colon cancer. The goal of this investigation is to use and adapt existing pre-trained CNN-based models, such as Xception, DenseNet201, ResNet101, InceptionV3, DenseNet121, DenseNet169, ResNet152, and InceptionResNetV2, to enhance classification through better augmentation strategies. The results show tremendous progress, with all eight models reaching impressive accuracy ranging from 97% to 99%. Furthermore, attention visualization techniques such as GradCAM, GradCAM++, ScoreCAM, Faster Score-CAM, and LayerCAM, as well as Vanilla Saliency and SmoothGrad, are used to provide insights into the models' classification decisions, thereby improving interpretability and understanding of malignant and benign image classification.
肺癌和结直肠癌是全球性的健康挑战,需要早期和精确的识别以降低死亡率风险。然而,病理学家依赖的诊断方法在专业知识不足时会带来困难和风险。虽然影像技术和血液标记物等诊断方法有助于早期诊断,但组织学仍然是金标准,尽管时间漫长且易受操作者错误的影响。限制高端技术的访问程度进一步限制了患者获得及时医疗护理和诊断的能力。近年来在深度学习方面的进步引起了对其在医学影像分析中的应用的关注,特别是使用组织学图像诊断肺癌和结直肠癌。本研究的目标是利用和适应现有的预训练CNN模型,如Xception、DenseNet201、ResNet101、InceptionV3、DenseNet121、DenseNet169、ResNet152和InceptionResNetV2,通过更好的增强策略来增强分类。结果显示,所有模型都取得了巨大的进步,所有八个模型都达到了令人印象深刻的准确率,从97%到99%不等。此外,还使用了注意力图技术,如GradCAM、GradCAM++、ScoreCAM、Faster Score-CAM和LayerCAM,以及Vanilla Saliency和SmoothGrad,以提供模型分类决策的洞察,从而改善恶性和良性图像分类的可解释性和理解。
https://arxiv.org/abs/2405.04610
BACKGROUND: Lung cancer's high mortality rate can be mitigated by early detection, which is increasingly reliant on artificial intelligence (AI) for diagnostic imaging. However, the performance of AI models is contingent upon the datasets used for their training and validation. METHODS: This study developed and validated the DLCSD-mD and LUNA16-mD models utilizing the Duke Lung Cancer Screening Dataset (DLCSD), encompassing over 2,000 CT scans with more than 3,000 annotations. These models were rigorously evaluated against the internal DLCSD and external LUNA16 and NLST datasets, aiming to establish a benchmark for imaging-based performance. The assessment focused on creating a standardized evaluation framework to facilitate consistent comparison with widely utilized datasets, ensuring a comprehensive validation of the model's efficacy. Diagnostic accuracy was assessed using free-response receiver operating characteristic (FROC) and area under the curve (AUC) analyses. RESULTS: On the internal DLCSD set, the DLCSD-mD model achieved an AUC of 0.93 (95% CI:0.91-0.94), demonstrating high accuracy. Its performance was sustained on the external datasets, with AUCs of 0.97 (95% CI: 0.96-0.98) on LUNA16 and 0.75 (95% CI: 0.73-0.76) on NLST. Similarly, the LUNA16-mD model recorded an AUC of 0.96 (95% CI: 0.95-0.97) on its native dataset and showed transferable diagnostic performance with AUCs of 0.91 (95% CI: 0.89-0.93) on DLCSD and 0.71 (95% CI: 0.70-0.72) on NLST. CONCLUSION: The DLCSD-mD model exhibits reliable performance across different datasets, establishing the DLCSD as a robust benchmark for lung cancer detection and diagnosis. Through the provision of our models and code to the public domain, we aim to accelerate the development of AI-based diagnostic tools and encourage reproducibility and collaborative advancements within the medical machine-learning (ML) field.
背景:肺癌症的高致死率可以通过早期诊断来降低,这越来越依赖于人工智能(AI)进行诊断成像。然而,AI模型的性能取决于其训练和验证的数据集。方法:本研究利用杜克肺癌筛查数据集(DLCSD)开发和验证了DLCSD-mD和LUNA16-mD模型,包括超过2,000张CT扫描和超过3,000个注释。这些模型通过对内部DLCSD和外部LUNA16和NLST数据集的严格评估,旨在建立基于成像的性能基准。评估的重点是创建一个标准的评估框架,以促进与广泛使用的数据集的一致比较,确保模型有效性的全面验证。诊断准确性通过自由反应接收操作特征(FROC)和面积 Under the curve(AUC)分析进行评估。结果:在内部DLCSD数据集中,DLCSD-mD模型实现了一个AUC of 0.93(95% CI:0.91-0.94),表明其具有很高的准确度。其性能在 external datasets 上得到了维持,LUNA16 的 AUC 为 0.97(95% CI:0.96-0.98),NLST 的 AUC为 0.75(95% CI:0.73-0.76)。同样,LUNA16-mD模型在其 native dataset 上实现了 AUC of 0.96(95% CI:0.95-0.97),并在DLCSD和NLST上显示出可转移的诊断性能,其 AUC分别为 0.91(95% CI:0.89-0.93)和 0.71(95% CI:0.70-0.72)。结论:DLCSD-mD模型在各种数据集上都表现出可靠的表演,使DLCSD成为肺癌检测和诊断的一个 robust 基准。通过将我们的模型和代码公开领域提供,我们旨在加速 AI 基
https://arxiv.org/abs/2405.04605
In recent years, wide-area visual surveillance systems have been widely applied in various industrial and transportation scenarios. These systems, however, face significant challenges when implementing multi-object detection due to conflicts arising from the need for high-resolution imaging, efficient object searching, and accurate localization. To address these challenges, this paper presents a hybrid system that incorporates a wide-angle camera, a high-speed search camera, and a galvano-mirror. In this system, the wide-angle camera offers panoramic images as prior information, which helps the search camera capture detailed images of the targeted objects. This integrated approach enhances the overall efficiency and effectiveness of wide-area visual detection systems. Specifically, in this study, we introduce a wide-angle camera-based method to generate a panoramic probability map (PPM) for estimating high-probability regions of target object presence. Then, we propose a probability searching module that uses the PPM-generated prior information to dynamically adjust the sampling range and refine target coordinates based on uncertainty variance computed by the object detector. Finally, the integration of PPM and the probability searching module yields an efficient hybrid vision system capable of achieving 120 fps multi-object search and detection. Extensive experiments are conducted to verify the system's effectiveness and robustness.
近年来,随着各种工业和交通场景中大面积视觉监视系统的广泛应用,这些系统在实施多目标检测时遇到了显著的挑战。然而,由于需要高分辨率成像、高效目标搜索和准确的位置跟踪等冲突,这些系统在实施多目标检测时遇到了困难。为解决这些问题,本文提出了一种集成式系统,该系统包括一个 wide-angle 相机、一个高速搜索相机和一个 galvano-mirror。在这个系统中,广角相机提供全景图像作为先验信息,帮助搜索相机捕捉目标对象的详细图像。这种集成方法提高了大面积视觉检测系统的整体效率和效果。 具体来说,在本研究中,我们提出了一种基于广角相机的自适应方法,用于生成目标物体存在高概率区域的全景概率图(PPM)。然后,我们提出了一种基于PPM生成的先验信息的概率搜索模块,根据物体检测器计算的不确定性方差动态调整采样范围并优化目标坐标。最后,PPM 和概率搜索模块的集成产生了一种能够实现120 fps 多目标搜索和检测的高效混合视觉系统。 为了验证系统的有效性和稳健性,进行了大量实验。
https://arxiv.org/abs/2405.04589
Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by introducing a world model-based autonomous driving 4D representation learning framework, dubbed \emph{DriveWorld}, which is capable of pre-training from multi-camera driving videos in a spatio-temporal fashion. Specifically, we propose a Memory State-Space Model for spatio-temporal modelling, which consists of a Dynamic Memory Bank module for learning temporal-aware latent dynamics to predict future changes and a Static Scene Propagation module for learning spatial-aware latent statics to offer comprehensive scene contexts. We additionally introduce a Task Prompt to decouple task-aware features for various downstream tasks. The experiments demonstrate that DriveWorld delivers promising results on various autonomous driving tasks. When pre-trained with the OpenScene dataset, DriveWorld achieves a 7.5% increase in mAP for 3D object detection, a 3.0% increase in IoU for online mapping, a 5.0% increase in AMOTA for multi-object tracking, a 0.1m decrease in minADE for motion forecasting, a 3.0% increase in IoU for occupancy prediction, and a 0.34m reduction in average L2 error for planning.
近年来,随着其较低成本,视觉中心化的自动驾驶引起了广泛关注。预训练对于提取普遍表示至关重要。然而,目前视觉中心化的预训练通常依赖于2D或3D预训练任务,忽视了自动驾驶作为4D场景理解任务的时空特征。在本文中,我们通过引入基于世界模型的自动驾驶4D表示学习框架\emph{DriveWorld}来解决这一挑战。该框架能够以时空方式从多摄像头驾驶视频中进行预训练。具体来说,我们提出了一个记忆状态空间模型进行空间-时间建模,包括动态内存库模块用于学习时空感知到的潜在动态,静态场景传播模块用于学习空间感知到的潜在静态,以提供全面的场景上下文。我们还引入了一个任务提示,用于解耦各种下游任务的关注点特征。实验证明,DriveWorld在各种自动驾驶任务上取得了很好的效果。当使用OpenScene数据集进行预训练时,DriveWorld在3D物体检测上实现了7.5%的mAP增加,在在线地图上实现了3%的IoU增加,在多对象跟踪上实现了5%的AMOTA增加,在运动预测中降低了0.1m的minADE,在占用预测上实现了3%的IoU增加,在规划中减少了0.34m的L2误差。
https://arxiv.org/abs/2405.04390
The malware booming is a cyberspace equal to the effect of climate change to ecosystems in terms of danger. In the case of significant investments in cybersecurity technologies and staff training, the global community has become locked up in the eternal war with cyber security threats. The multi-form and changing faces of malware are continuously pushing the boundaries of the cybersecurity practitioners employ various approaches like detection and mitigate in coping with this issue. Some old mannerisms like signature-based detection and behavioral analysis are slow to adapt to the speedy evolution of malware types. Consequently, this paper proposes the utilization of the Deep Learning Model, LSTM networks, and GANs to amplify malware detection accuracy and speed. A fast-growing, state-of-the-art technology that leverages raw bytestream-based data and deep learning architectures, the AI technology provides better accuracy and performance than the traditional methods. Integration of LSTM and GAN model is the technique that is used for the synthetic generation of data, leading to the expansion of the training datasets, and as a result, the detection accuracy is improved. The paper uses the VirusShare dataset which has more than one million unique samples of the malware as the training and evaluation set for the presented models. Through thorough data preparation including tokenization, augmentation, as well as model training, the LSTM and GAN models convey the better performance in the tasks compared to straight classifiers. The research outcomes come out with 98% accuracy that shows the efficiency of deep learning plays a decisive role in proactive cybersecurity defense. Aside from that, the paper studies the output of ensemble learning and model fusion methods as a way to reduce biases and lift model complexity.
恶意软件爆炸是一种与气候变化对生态系统的影响相等的网络空间。在大量投入网络安全技术和员工培训的情况下,全球社区已经陷入了与网络安全威胁的永恒战争中。恶意软件的多形态和不断变化的面孔不断推动网络安全实践者采用各种检测和减轻方法应对这一问题。一些老方法如基于签名的检测和行为分析在应对恶意软件类型快速演变方面较慢。因此,本文提出了使用深度学习模型、LSTM网络和GANs来提高恶意软件检测精度和速度。一种利用原始字节流数据和深度学习架构快速生长的最先进技术,人工智能技术提供了比传统方法更好的准确性和性能。LSTM和GAN模型的集成是用于数据合成技术的方法,导致训练数据集的扩展,从而提高了检测精度。本文使用VirusShare数据集,该数据集有超过一百万个独特的恶意软件样本作为训练和评估集,通过包括分词、增强以及模型训练等彻底的数据准备,LSTM和GAN模型在任务表现上比直接分类器更好。通过研究结果呈现了98%的准确度,这表明在主动网络安全防御中,深度学习的有效性具有决定性的作用。此外,本文研究了集成学习方法和模型融合方法的结果,以减少偏见和提高模型复杂性。
https://arxiv.org/abs/2405.04373
The community plays a crucial role in understanding user behavior and network characteristics in social networks. Some users can use multiple social networks at once for a variety of objectives. These users are called overlapping users who bridge different social networks. Detecting communities across multiple social networks is vital for interaction mining, information diffusion, and behavior migration analysis among networks. This paper presents a community detection method based on nonnegative matrix tri-factorization for multiple heterogeneous social networks, which formulates a common consensus matrix to represent the global fused community. Specifically, the proposed method involves creating adjacency matrices based on network structure and content similarity, followed by alignment matrices which distinguish overlapping users in different social networks. With the generated alignment matrices, the method could enhance the fusion degree of the global community by detecting overlapping user communities across networks. The effectiveness of the proposed method is evaluated with new metrics on Twitter, Instagram, and Tumblr datasets. The results of the experiments demonstrate its superior performance in terms of community quality and community fusion.
社区在理解社交网络中的用户行为和网络特征中扮演着关键角色。一些用户可以同时使用多个社交网络来实现各种目标。这些用户被称为重叠用户,他们连接不同的社交网络。在多个社交网络中检测社区对于社交网络中的交互挖掘、信息传播和行为迁移分析至关重要。本文提出了一种基于非负矩阵三元分解的多异质社交网络中的社区检测方法,它构成了一个共同的共识矩阵来表示全局融合社区。具体来说,所提出的方法基于网络结构和内容相似性创建邻接矩阵,然后跟随 alignment matrices 来区分不同社交网络中的重叠用户。通过生成的 alignment matrices,该方法可以通过检测跨网络的重叠用户社区来增强全局社区的融合程度。用 Twitter、Instagram 和 Tumblr 数据集的新指标评估所提出方法的有效性。实验结果表明,其在社区质量和社区融合方面的表现优于其他方法。
https://arxiv.org/abs/2405.04371
Recent developments in large language models (LLMs), while offering a powerful foundation for developing natural language agents, raise safety concerns about them and the autonomous agents built upon them. Deception is one potential capability of AI agents of particular concern, which we refer to as an act or statement that misleads, hides the truth, or promotes a belief that is not true in its entirety or in part. We move away from the conventional understanding of deception through straight-out lying, making objective selfish decisions, or giving false information, as seen in previous AI safety research. We target a specific category of deception achieved through obfuscation and equivocation. We broadly explain the two types of deception by analogizing them with the rabbit-out-of-hat magic trick, where (i) the rabbit either comes out of a hidden trap door or (ii) (our focus) the audience is completely distracted to see the magician bring out the rabbit right in front of them using sleight of hand or misdirection. Our novel testbed framework displays intrinsic deception capabilities of LLM agents in a goal-driven environment when directed to be deceptive in their natural language generations in a two-agent adversarial dialogue system built upon the legislative task of "lobbying" for a bill. Along the lines of a goal-driven environment, we show developing deceptive capacity through a reinforcement learning setup, building it around the theories of language philosophy and cognitive psychology. We find that the lobbyist agent increases its deceptive capabilities by ~ 40% (relative) through subsequent reinforcement trials of adversarial interactions, and our deception detection mechanism shows a detection capability of up to 92%. Our results highlight potential issues in agent-human interaction, with agents potentially manipulating humans towards its programmed end-goal.
近年来,大型语言模型(LLMs)的发展为开发自然语言代理提供了强大的基础,但也引发了关于它们和基于它们的自主代理的安全问题。特别令人担忧的是欺骗,这是我们指的欺骗性行为或陈述,包括误导、隐瞒真相或促进不真实的信念。我们在前人工智能安全研究中,通过直言不讳地撒谎、做出客观的自私决策或提供虚假信息,远离了欺骗的传统理解。我们针对通过混淆和模棱两可达到的欺骗特定类别进行攻击。我们详细解释了两种欺骗类型。通过类比兔子脱出帽子魔术表演,我们(i)要么让兔子从隐藏的陷阱门中出来,要么(ii)我们的重点是,观众被魔术师利用魔术手法或误导观众带出现在他们面前的兔子所吸引。我们的新测试台框架在两个代理人的竞争对话系统中,当它们被设计为在自然语言生成中具有欺骗能力时,展示了LLM代理在目标导向环境中的内在欺骗能力。沿着目标导向环境的目标,我们通过强化学习框架开发了欺骗能力,并围绕语言哲学和认知心理学理论进行构建。我们发现,通过后续的对抗性交互试验, lobbyist代理的欺骗能力增加了~ 40%(相对),我们的欺骗检测机制具有高达92%的检测能力。我们的结果突出了人工智能代理与人类交互中潜在的问题,即代理可能会操纵人类以实现其预设目标。
https://arxiv.org/abs/2405.04325
Aphid infestations are one of the primary causes of extensive damage to wheat and sorghum fields and are one of the most common vectors for plant viruses, resulting in significant agricultural yield losses. To address this problem, farmers often employ the inefficient use of harmful chemical pesticides that have negative health and environmental impacts. As a result, a large amount of pesticide is wasted on areas without significant pest infestation. This brings to attention the urgent need for an intelligent autonomous system that can locate and spray sufficiently large infestations selectively within the complex crop canopies. We have developed a large multi-scale dataset for aphid cluster detection and segmentation, collected from actual sorghum fields and meticulously annotated to include clusters of aphids. Our dataset comprises a total of 54,742 image patches, showcasing a variety of viewpoints, diverse lighting conditions, and multiple scales, highlighting its effectiveness for real-world applications. In this study, we trained and evaluated four real-time semantic segmentation models and three object detection models specifically for aphid cluster segmentation and detection. Considering the balance between accuracy and efficiency, Fast-SCNN delivered the most effective segmentation results, achieving 80.46% mean precision, 81.21% mean recall, and 91.66 frames per second (FPS). For object detection, RT-DETR exhibited the best overall performance with a 61.63% mean average precision (mAP), 92.6% mean recall, and 72.55 on an NVIDIA V100 GPU. Our experiments further indicate that aphid cluster segmentation is more suitable for assessing aphid infestations than using detection models.
蚜虫灾害是导致小麦和玉米田遭受严重破坏的主要原因之一,也是植物病毒最普遍的传播媒介,导致大量农业产量损失。为解决这个问题,农民通常会采用对有害化学农药的低效利用,这对健康和环境都有负面影响。因此,在无重大病虫害的地区,大量的农药被浪费在无明显病虫害的区域上。这使得人们更加关注迫切需要一种智能自主系统,可以在复杂的作物叶片中准确、选择性地定位和喷洒足够大的蚜虫群。 我们开发了一个大型的多尺度蚜虫聚类检测和分割数据集,从实际的玉米田中收集,并精心注释以包括蚜虫聚类。我们的数据集包括54,742个图像补丁,展示了各种视角、不同的光照条件和多个尺度,突出了其在现实应用中的有效性。 在本研究中,我们训练和评估了四种实时语义分割模型和三种专门用于蚜虫聚类检测和检测的对象检测模型。在准确性和效率之间取得平衡后,Fast-SCNN取得了最有效的分割结果,达到80.46%的均方精度(mAP)、81.21%的均召回率和91.66帧每秒(FPS)。在物体检测方面,RT-DETR在平均精度(mAP)、总体召回率和NVIDIA V100 GPU上的得分均最高。我们的实验进一步表明,蚜虫聚类分割更适合评估蚜虫灾害,而不是使用检测模型。
https://arxiv.org/abs/2405.04305
3D occupancy, an advanced perception technology for driving scenarios, represents the entire scene without distinguishing between foreground and background by quantifying the physical space into a grid map. The widely adopted projection-first deformable attention, efficient in transforming image features into 3D representations, encounters challenges in aggregating multi-view features due to sensor deployment constraints. To address this issue, we propose our learning-first view attention mechanism for effective multi-view feature aggregation. Moreover, we showcase the scalability of our view attention across diverse multi-view 3D tasks, such as map construction and 3D object detection. Leveraging the proposed view attention as well as an additional multi-frame streaming temporal attention, we introduce ViewFormer, a vision-centric transformer-based framework for spatiotemporal feature aggregation. To further explore occupancy-level flow representation, we present FlowOcc3D, a benchmark built on top of existing high-quality datasets. Qualitative and quantitative analyses on this benchmark reveal the potential to represent fine-grained dynamic scenes. Extensive experiments show that our approach significantly outperforms prior state-of-the-art methods. The codes and benchmark will be released soon.
3D占有率,一种高级驾驶场景感知技术,通过将物理空间量化成一个网格图,表示整个场景,没有区分前景和背景。广泛采用的投影先验可变形注意力,在将图像特征转换为3D表示时高效,但在聚合多视图特征时遇到了传感器部署限制的挑战。为了应对这个问题,我们提出了学习先验视注意机制,用于有效的多视图特征聚合。此外,我们还展示了我们在不同多视图3D任务上的视注意力的可扩展性,如地图构建和3D物体检测。利用所提出的视注意力和额外的多帧流式时间注意,我们引入了ViewFormer,一个以视觉为中心的Transformer框架,用于空间和时间特征的聚合。为进一步探索占有率级流动表示,我们发布了FlowOcc3D,一个基于现有高质量数据集的基准。对这一基准的定性和定量分析揭示了潜在的表示细粒度动态场景的潜力。大量实验证明,我们的方法在性能上显著超过了先前的最先进方法。代码和基准将很快发布。
https://arxiv.org/abs/2405.04299
High-definition map with accurate lane-level information is crucial for autonomous driving, but the creation of these maps is a resource-intensive process. To this end, we present a cost-effective solution to create lane-level roadmaps using only the global navigation satellite system (GNSS) and a camera on customer vehicles. Our proposed solution utilizes a prior standard-definition (SD) map, GNSS measurements, visual odometry, and lane marking edge detection points, to simultaneously estimate the vehicle's 6D pose, its position within a SD map, and also the 3D geometry of traffic lines. This is achieved using a Bayesian simultaneous localization and multi-object tracking filter, where the estimation of traffic lines is formulated as a multiple extended object tracking problem, solved using a trajectory Poisson multi-Bernoulli mixture (TPMBM) filter. In TPMBM filtering, traffic lines are modeled using B-spline trajectories, and each trajectory is parameterized by a sequence of control points. The proposed solution has been evaluated using experimental data collected by a test vehicle driving on highway. Preliminary results show that the traffic line estimates, overlaid on the satellite image, generally align with the lane markings up to some lateral offsets.
具有高清晰度地图和准确的路级信息对于自动驾驶至关重要,但创建这些地图是一个资源密集的过程。为此,我们提出了一个成本有效的解决方案,使用仅是全球导航卫星系统(GNSS)和车辆上的摄像机来创建道路级地图。我们的解决方案利用了预定义的标准定义(SD)地图、GNSS测量、视觉观测里程和车道标记边缘检测点,同时估计车辆的6D姿态,其在SD地图上的位置以及交通线的3D几何形状。这是通过使用贝叶斯同时定位和多对象跟踪滤波器实现的,其中交通线的估计作为一个多扩展对象跟踪问题,通过轨迹的概率Multi-Bernoulli混合(TPMBM)滤波器求解。在TPMBM滤波器中,交通线通过B-spline轨迹建模,并且每个轨迹由一系列控制点参数化。所提出的解决方案已通过在高速公路上行驶的测试车辆的实验数据进行了评估。初步结果表明,交通线估计,叠加在卫星图像上,通常与车道标记在横向偏移量上相符。
https://arxiv.org/abs/2405.04290
The efficacy of an large language model (LLM) generated text detector depends substantially on the availability of sizable training data. White-box zero-shot detectors, which require no such data, are nonetheless limited by the accessibility of the source model of the LLM-generated text. In this paper, we propose an simple but effective black-box zero-shot detection approach, predicated on the observation that human-written texts typically contain more grammatical errors than LLM-generated texts. This approach entails computing the Grammar Error Correction Score (GECScore) for the given text to distinguish between human-written and LLM-generated text. Extensive experimental results show that our method outperforms current state-of-the-art (SOTA) zero-shot and supervised methods, achieving an average AUROC of 98.7% and showing strong robustness against paraphrase and adversarial perturbation attacks.
大型语言模型(LLM)生成文本检测器的有效性很大程度上取决于可用的训练数据。白盒零击检测器,无需这种数据,仍然受到其源模型(LLM生成的文本)的可访问性的限制。在本文中,我们提出了一种简单但有效的黑盒零击检测方法,基于观察到人类编写的文本通常包含更多的语法错误这一事实。这种方法计算给定文本的语法错误纠正得分(GECScore),以区分人类编写的文本和LLM生成的文本。大量实验结果表明,我们的方法超越了当前最先进的(SOTA)零击和监督方法,实现平均AUROC为98.7%,并表现出对同义词和对抗干扰攻击具有较强的鲁棒性。
https://arxiv.org/abs/2405.04286
Breast cancer is a significant global health concern, particularly for women. Early detection and appropriate treatment are crucial in mitigating its impact, with histopathology examinations playing a vital role in swift diagnosis. However, these examinations often require a substantial workforce and experienced medical experts for proper recognition and cancer grading. Automated image retrieval systems have the potential to assist pathologists in identifying cancerous tissues, thereby accelerating the diagnostic process. Nevertheless, due to considerable variability among the tissue and cell patterns in histological images, proposing an accurate image retrieval model is very challenging. This work introduces a novel attention-based adversarially regularized variational graph autoencoder model for breast histological image retrieval. Additionally, we incorporated cluster-guided contrastive learning as the graph feature extractor to boost the retrieval performance. We evaluated the proposed model's performance on two publicly available datasets of breast cancer histological images and achieved superior or very competitive retrieval performance, with average mAP scores of 96.5% for the BreakHis dataset and 94.7% for the BACH dataset, and mVP scores of 91.9% and 91.3%, respectively. Our proposed retrieval model has the potential to be used in clinical settings to enhance diagnostic performance and ultimately benefit patients.
乳腺癌是一个全球健康问题,特别是对女性的影响非常大。早期诊断和适当的治疗对减轻其影响至关重要,而组织病理学检查在迅速诊断中发挥着关键作用。然而,这些检查通常需要大量的工作力和经验丰富的医疗专家来进行适当的识别和癌症分级。自动图像检索系统有可能帮助病理学家识别出恶性组织,从而加速诊断过程。然而,由于组织和细胞在组织病理图中的变异很大,提出准确的组织病理图检索模型非常具有挑战性。本工作提出了一种新颖的关注基于对抗训练的变分图自编码器模型用于乳腺癌组织病理图检索。此外,我们还引入了聚类引导的对比学习作为图特征提取器,以提高检索性能。我们在两个公开可用的乳腺癌组织病理图数据集上评估所提出的模型的性能,实现了卓越或非常竞争力的检索性能,平均mAP得分分别为96.5%的BreakHis数据集和94.7%的BACH数据集,平均mVP得分分别为91.9%和91.3%。我们提出的检索模型有望在临床实践中提高诊断性能,最终为患者带来好处。
https://arxiv.org/abs/2405.04211
In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. The ability to create credible minute-long music deepfakes in a few seconds on user-friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of training classifiers on datasets comprising real audio and fake reconstructions, achieving a convincing accuracy of 99.8%. To our knowledge, this marks the first publication of a music deepfake detector, a tool that will help in the regulation of music forgery. Nevertheless, informed by decades of literature on forgery detection in other fields, we stress that a good test score is not the end of the story. We step back from the straightforward ML framework and expose many facets that could be problematic with such a deployed detector: calibration, robustness to audio manipulation, generalisation to unseen models, interpretability and possibility for recourse. This second part acts as a position for future research steps in the field and a caveat to a flourishing market of fake content checkers.
在面对全新一代生成模型的新时代,检测人造内容已成为至关重要的事。在用户友好平台上几秒钟内创建可信的分钟长度的AI音乐 deepfake,对流媒体服务的欺诈威胁和对人类艺术家的不公平竞争构成了真正的威胁。本文证明了在包含真实音频和假重建的数据集上训练分类器是可能的,并且令人惊讶地容易,达到了99.8%的准确度。据我们所知,这标志着音乐 deepfake 检测器的首次发布,这将有助于音乐欺诈的监管。然而,根据其他领域的伪造检测几十年的文献,我们强调一个好的测试分数并不是故事的结束。我们离开了简单的机器学习框架,揭示了可能存在问题的部署检测器的许多方面:校准,对音频操作的鲁棒性,对未见过的模型的泛化,可解释性和可诉性。第二部分在领域未来的研究步骤中扮演了立场,同时也是繁荣内容检查器市场的警示。
https://arxiv.org/abs/2405.04181
Nowadays, Information spreads at an unprecedented pace in social media and discerning truth from misinformation and fake news has become an acute societal challenge. Machine learning (ML) models have been employed to identify fake news but are far from perfect with challenging problems like limited accuracy, interpretability, and generalizability. In this paper, we enhance ML-based solutions with linguistics input and we propose LingML, linguistic-informed ML, for fake news detection. We conducted an experimental study with a popular dataset on fake news during the pandemic. The experiment results show that our proposed solution is highly effective. There are fewer than two errors out of every ten attempts with only linguistic input used in ML and the knowledge is highly explainable. When linguistics input is integrated with advanced large-scale ML models for natural language processing, our solution outperforms existing ones with 1.8% average error rate. LingML creates a new path with linguistics to push the frontier of effective and efficient fake news detection. It also sheds light on real-world multi-disciplinary applications requiring both ML and domain expertise to achieve optimal performance.
如今,社交媒体上信息的传播速度前所未有的迅速,从误传和假新闻中分辨真相已成为一个尖锐的社会挑战。为了识别假新闻,机器学习(ML)模型已经投入使用,但这些模型仍然存在一些挑战性的问题,如准确性、可解释性和可扩展性。在本文中,我们通过语言输入来增强基于ML的解决方案,并提出了LingML,用于检测假新闻。我们在疫情期间使用了一个流行数据集进行了实验研究。实验结果表明,我们提出的解决方案非常有效。使用仅基于语言输入的ML模型进行检测时,每10次尝试中的错误不到两个,知识具有高度可解释性。当语言输入与先进的自然语言处理大型ML模型相结合时,我们的解决方案在平均误差率为1.8%的情况下超过了现有解决方案。LingML通过语言开辟了新的道路,推动了有效且高效假新闻检测的前沿。它还揭示了需要ML和领域专业知识相结合的现实世界多学科应用的最佳性能。
https://arxiv.org/abs/2405.04165
The generative model has made significant advancements in the creation of realistic videos, which causes security issues. However, this emerging risk has not been adequately addressed due to the absence of a benchmark dataset for AI-generated videos. In this paper, we first construct a video dataset using advanced diffusion-based video generation algorithms with various semantic contents. Besides, typical video lossy operations over network transmission are adopted to generate degraded samples. Then, by analyzing local and global temporal defects of current AI-generated videos, a novel detection framework by adaptively learning local motion information and global appearance variation is constructed to expose fake videos. Finally, experiments are conducted to evaluate the generalization and robustness of different spatial and temporal domain detection methods, where the results can serve as the baseline and demonstrate the research challenge for future studies.
生成模型在创建逼真的视频方面取得了显著进展,这导致了安全问题。然而,由于缺乏用于人工智能生成的视频的基准数据集,这个问题尚未得到充分解决。在本文中,我们首先使用先进的扩散基视频生成算法构建了一个具有各种语义内容的视频数据集。此外,网络传输中的典型视频损失操作被采用以生成降低样本。然后,通过分析当前人工智能生成的视频的局部和全局时间缺陷,采用自适应学习局部运动信息和全局外观变化来构建一种新检测框架,以揭露伪视频。最后,我们进行了实验来评估不同空间和时间域检测方法的泛化能力和鲁棒性,其中结果可以作为基线,并展示未来研究的挑战。
https://arxiv.org/abs/2405.04133
The emergence of contemporary deepfakes has attracted significant attention in machine learning research, as artificial intelligence (AI) generated synthetic media increases the incidence of misinterpretation and is difficult to distinguish from genuine content. Currently, machine learning techniques have been extensively studied for automatically detecting deepfakes. However, human perception has been less explored. Malicious deepfakes could ultimately cause public and social problems. Can we humans correctly perceive the authenticity of the content of the videos we watch? The answer is obviously uncertain; therefore, this paper aims to evaluate the human ability to discern deepfake videos through a subjective study. We present our findings by comparing human observers to five state-ofthe-art audiovisual deepfake detection models. To this end, we used gamification concepts to provide 110 participants (55 native English speakers and 55 non-native English speakers) with a webbased platform where they could access a series of 40 videos (20 real and 20 fake) to determine their authenticity. Each participant performed the experiment twice with the same 40 videos in different random orders. The videos are manually selected from the FakeAVCeleb dataset. We found that all AI models performed better than humans when evaluated on the same 40 videos. The study also reveals that while deception is not impossible, humans tend to overestimate their detection capabilities. Our experimental results may help benchmark human versus machine performance, advance forensics analysis, and enable adaptive countermeasures.
当代深度伪造技术的出现引起了机器学习研究中的广泛关注,因为人工智能生成的合成媒体增加了误解的发生,且很难与真实内容区分开来。目前,机器学习技术已经广泛研究用于自动检测深度伪造。然而,人类感知却受到了较少关注。恶意深度伪造可能会最终导致公共和社交问题。我们人类能否正确理解我们观看的视频内容的真实性?答案显然是不确定的;因此,本文旨在通过主观研究评估人类辨别深度伪造视频的能力。我们通过将人类观察者与五个最先进的音频视觉深度伪造检测模型进行比较,得出我们的研究结果。为此,我们使用游戏化概念为55名参与者(55名母语为英语的参与者和55名非英语参与者)提供了一个基于网络的平台,让他们可以访问一系列40个视频(20个真实和20个伪造)来确定其真实性。每位参与者两次使用相同的40个视频进行实验,随机排列。视频是从FakeAVCeleb数据集中手动选择的。我们发现,所有人工智能模型在相同40个视频上评估时都表现得比人类更好。研究还揭示了,尽管欺骗并不罕见,但人类往往高估了他们的检测能力。我们的实验结果可能有助于衡量人类与机器的表现,促进法医分析,并实现自适应对策。
https://arxiv.org/abs/2405.04097
State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self-Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.
先进的深度学习系统通常基于发言者嵌入提取器。这些架构通常由一个特征提取器前端和一个池化层组成,将变长语音转换为固定长度的发言者向量。作者最近提出了使用双 multi-head 自注意力池化进行发言者识别,将自注意力机制放在基于 CNN 的前端和一系列全连接层之间。这已被证明是从语音信号中选择最相关特征的绝佳方法。在本文中,我们通过将此架构应用于其他不同的发言者特征识别任务,如情感识别、性别分类和 COVID-19 检测,展示了出色的实验结果。
https://arxiv.org/abs/2405.04096
Deep learning-based malware classifiers face significant challenges due to concept drift. The rapid evolution of malware, especially with new families, can depress classification accuracy to near-random levels. Previous research has primarily focused on detecting drift samples, relying on expert-led analysis and labeling for model retraining. However, these methods often lack a comprehensive understanding of malware concepts and provide limited guidance for effective drift adaptation, leading to unstable detection performance and high human labeling costs. To address these limitations, we introduce DREAM, a novel system designed to surpass the capabilities of existing drift detectors and to establish an explanatory drift adaptation process. DREAM enhances drift detection through model sensitivity and data autonomy. The detector, trained in a semi-supervised approach, proactively captures malware behavior concepts through classifier feedback. During testing, it utilizes samples generated by the detector itself, eliminating reliance on extensive training data. For drift adaptation, DREAM enlarges human intervention, enabling revisions of malware labels and concept explanations embedded within the detector's latent space. To ensure a comprehensive response to concept drift, it facilitates a coordinated update process for both the classifier and the detector. Our evaluation shows that DREAM can effectively improve the drift detection accuracy and reduce the expert analysis effort in adaptation across different malware datasets and classifiers.
基于深度学习的恶意分类器由于概念漂移而面临着显著的挑战。恶意软件的快速演变,特别是新家族的出现,可能会使分类准确性降低至近似随机的水平。之前的研究主要集中在检测漂移样本,依赖于专家主导的分析和模型重新训练。然而,这些方法往往缺乏对恶意软件概念的全面理解,并为有效的漂移适应提供有限指导,导致不稳定的检测性能和高的人类标注成本。为了克服这些限制,我们引入了DREAM,一种旨在超越现有漂移检测器的全新系统,以建立解释性漂移适应过程。DREAM通过模型的敏感性和数据自主性增强漂移检测。训练在半监督方法上的探测器通过分类器反馈主动捕捉恶意行为概念。在测试过程中,它利用探测器自己生成的样本,消除了对广泛训练数据的依赖。对于漂移适应,DREAM扩大了人类干预,使得对探测器隐含空间中包含的 malware 标签和概念解释进行修订。为了确保对概念漂移的全面响应,它促进了分类器和探测器之间的协同更新过程。我们的评估显示,DREAM可以有效地提高漂移检测精度,并在不同恶意软件数据集和分类器上减少专家分析工作量。
https://arxiv.org/abs/2405.04095
Object detection plays a critical role in autonomous driving, where accurately and efficiently detecting objects in fast-moving scenes is crucial. Traditional frame-based cameras face challenges in balancing latency and bandwidth, necessitating the need for innovative solutions. Event cameras have emerged as promising sensors for autonomous driving due to their low latency, high dynamic range, and low power consumption. However, effectively utilizing the asynchronous and sparse event data presents challenges, particularly in maintaining low latency and lightweight architectures for object detection. This paper provides an overview of object detection using event data in autonomous driving, showcasing the competitive benefits of event cameras.
物体检测在自动驾驶中扮演着关键角色,因为准确且高效地在快速移动的场景中检测物体至关重要。传统的基于帧的相机在平衡延迟和带宽方面面临挑战,需要采取创新解决方案。事件相机因低延迟、高动态范围和低功耗而成为自动驾驶的有前景的传感器。然而,有效地利用异步和稀疏事件数据存在挑战,特别是在维持低延迟和轻量架构进行物体检测方面。本文对使用事件数据进行物体检测在自动驾驶中的概述,展示了事件相机的竞争优势。
https://arxiv.org/abs/2405.03995
As a deep learning model, Visual Mamba (VMamba) has a low computational complexity and a global receptive field, which has been successful applied to image classification and detection. To extend its applications, we apply VMamba to crowd counting and propose a novel VMambaCC (VMamba Crowd Counting) model. Naturally, VMambaCC inherits the merits of VMamba, or global modeling for images and low computational cost. Additionally, we design a Multi-head High-level Feature (MHF) attention mechanism for VMambaCC. MHF is a new attention mechanism that leverages high-level semantic features to augment low-level semantic features, thereby enhancing spatial feature representation with greater precision. Building upon MHF, we further present a High-level Semantic Supervised Feature Pyramid Network (HS2PFN) that progressively integrates and enhances high-level semantic information with low-level semantic information. Extensive experimental results on five public datasets validate the efficacy of our approach. For example, our method achieves a mean absolute error of 51.87 and a mean squared error of 81.3 on the ShangHaiTech\_PartA dataset. Our code is coming soon.
作为一个深度学习模型,Visual Mamba (VMamba)具有较低的计算复杂性和全局接收域,这已经在图像分类和检测中取得了成功。为了扩展其应用,我们将VMamba应用于人群计数,并提出了一个新的VMambaCC(VMamba Crowd Counting)模型。VMambaCC继承了VMamba的优点,即全局建模图像和较低的计算成本。此外,我们还设计了一个多头高级特征(MHF)注意机制用于VMambaCC。MHF是一种新的注意机制,它利用高级语义特征来增加低级语义特征,从而通过提高精确度增强空间特征表示。在此基础上,我们进一步提出了一个高级语义监督特征金字塔网络(HS2PFN),该网络逐步整合和增强高级语义信息和低级语义信息。在五个公开数据集上的大量实验结果证实了我们的方法的有效性。例如,我们的方法在ShangHaiTech\_PartA数据集上的均绝对误差为51.87,均平方误差为81.3。我们的代码即将发布。
https://arxiv.org/abs/2405.03978