Medical imaging plays a crucial role in modern healthcare by providing non-invasive visualisation of internal structures and abnormalities, enabling early disease detection, accurate diagnosis, and treatment planning. This study aims to explore the application of deep learning models, particularly focusing on the UNet architecture and its variants, in medical image segmentation. We seek to evaluate the performance of these models across various challenging medical image segmentation tasks, addressing issues such as image normalization, resizing, architecture choices, loss function design, and hyperparameter tuning. The findings reveal that the standard UNet, when extended with a deep network layer, is a proficient medical image segmentation model, while the Res-UNet and Attention Res-UNet architectures demonstrate smoother convergence and superior performance, particularly when handling fine image details. The study also addresses the challenge of high class imbalance through careful preprocessing and loss function definitions. We anticipate that the results of this study will provide useful insights for researchers seeking to apply these models to new medical imaging problems and offer guidance and best practices for their implementation.
医学影像在现代医学中发挥着关键作用,通过非侵入性可视化内部结构和异常,能够早期检测疾病、准确诊断和治疗方案规划。本研究旨在探索深度学习模型在医学图像分割中的应用,特别是关注UNet架构及其变体的应用。我们希望评估这些模型在不同挑战性的医学图像分割任务中的表现,解决图像标准化、尺寸调整、架构选择、损失函数设计以及超参数调优等方面的问题。研究结果表明,标准UNet在加入深度网络层后是一种优秀的医学图像分割模型,而Res-UNet和Attention Res-UNet架构则表现出更平滑的收敛和提高性能,特别是在处理精细图像细节时。研究还通过仔细预处理和损失函数定义解决了高类别不平衡的挑战。我们预计,这项研究的结果将为寻求将这些模型应用于新的医学影像问题的研究提供有用的见解,并指导其实施。
https://arxiv.org/abs/2309.13013
In the realm of research, the detection/recognition of text within images/videos captured by cameras constitutes a highly challenging problem for researchers. Despite certain advancements achieving high accuracy, current methods still require substantial improvements to be applicable in practical scenarios. Diverging from text detection in images/videos, this paper addresses the issue of text detection within license plates by amalgamating multiple frames of distinct perspectives. For each viewpoint, the proposed method extracts descriptive features characterizing the text components of the license plate, specifically corner points and area. Concretely, we present three viewpoints: view-1, view-2, and view-3, to identify the nearest neighboring components facilitating the restoration of text components from the same license plate line based on estimations of similarity levels and distance metrics. Subsequently, we employ the CnOCR method for text recognition within license plates. Experimental results on the self-collected dataset (PTITPlates), comprising pairs of images in various scenarios, and the publicly available Stanford Cars Dataset, demonstrate the superiority of the proposed method over existing approaches.
在研究领域,相机捕获的图像/视频中的文字检测/识别构成了一个高度挑战的问题,尽管某些方法已经实现了高精度,但当前的方法仍然需要在实际应用场景中进行大量改进。与图像/视频中的文字检测不同,本文通过整合多个不同视角的图像帧来解决 license plate 中的文字检测问题。对于每个视角,该方法提取了描述性特征,描述了 license plate 中文字组件的特征,特别是角落点和区域。具体来说,我们展示了三个视角:view-1、view-2、view-3,以确定最接近的相邻组件,通过相似度和距离度量估计来实现文字组件从同一 license plate 线条的恢复。随后,我们采用 CnOCR 方法在 license plate 中的文字识别。对自收集的数据集(PTITPlates)进行了实验结果,该数据集包括各种场景下的两个图像对,以及公开可用的 Stanford 汽车数据集,证明了该方法相对于现有方法的优越性。
https://arxiv.org/abs/2309.12972
The detailed images produced by Magnetic Resonance Imaging (MRI) provide life-critical information for the diagnosis and treatment of prostate cancer. To provide standardized acquisition, interpretation and usage of the complex MRI images, the PI-RADS v2 guideline was proposed. An automated segmentation following the guideline facilitates consistent and precise lesion detection, staging and treatment. The guideline recommends a division of the prostate into four zones, PZ (peripheral zone), TZ (transition zone), DPU (distal prostatic urethra) and AFS (anterior fibromuscular stroma). Not every zone shares a boundary with the others and is present in every slice. Further, the representations captured by a single model might not suffice for all zones. This motivated us to design a dual-branch convolutional neural network (CNN), where each branch captures the representations of the connected zones separately. Further, the representations from different branches act complementary to each other at the second stage of training, where they are fine-tuned through an unsupervised loss. The loss penalises the difference in predictions from the two branches for the same class. We also incorporate multi-task learning in our framework to further improve the segmentation accuracy. The proposed approach improves the segmentation accuracy of the baseline (mean absolute symmetric distance) by 7.56%, 11.00%, 58.43% and 19.67% for PZ, TZ, DPU and AFS zones respectively.
磁共振成像(MRI)生成的详细图像为前列腺癌的诊断和治疗提供了生命中最重要的信息。为了提供标准化的获取、解释和使用复杂的MRI图像的标准操作,我们提出了PI-RADS v2指南。遵循指南的自动分割有助于一致性和精确的 Lesion 检测、分期和治疗。指南建议将前列腺癌分为四个区域,PZ(周围区域)、TZ(过渡区域)、DPU(远程前列腺癌尿管)和AFS(前部肌肉基质)。不是每个区域都与其他人共享边界,每个切片都包含。此外,一个模型 captured 的表示可能不足以涵盖所有区域。这激励我们设计一种双分支卷积神经网络(CNN),其中每个分支分别捕获连接区域的表示。此外,不同分支的表示在训练的第二阶段相互补充,通过无监督损失进行优化。该损失惩罚两个分支对同一类预测的差异。我们还在我们的框架中引入了多任务学习,以进一步改进分割精度。我们建议的方法改进了基线(平均绝对对称距离)的分割精度,分别为7.56%、11.00%、58.43%、19.67% PZ、TZ、DPU和AFS区域。
https://arxiv.org/abs/2309.12970
Open-set object detection aims at detecting arbitrary categories beyond those seen during training. Most recent advancements have adopted the open-vocabulary paradigm, utilizing vision-language backbones to represent categories with language. In this paper, we introduce DE-ViT, an open-set object detector that employs vision-only DINOv2 backbones and learns new categories through example images instead of language. To improve general detection ability, we transform multi-classification tasks into binary classification tasks while bypassing per-class inference, and propose a novel region propagation technique for localization. We evaluate DE-ViT on open-vocabulary, few-shot, and one-shot object detection benchmark with COCO and LVIS. For COCO, DE-ViT outperforms the open-vocabulary SoTA by 6.9 AP50 and achieves 50 AP50 in novel classes. DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms the open-vocabulary SoTA by 2.2 mask AP and reaches 34.3 mask APr. Code is available at this https URL.
开放集对象检测的目标是检测训练期间未观察到的任意类别。最近的进展都采用了开放词汇表范式,利用视觉语言骨架来表示类别用语言表示。在本文中,我们介绍了DE-ViT,它是一个开放集对象检测器,使用仅视觉的DINOv2骨架和通过绕过每个类别的推断来学习新类别,而不是使用语言。为了改善一般性检测能力,我们将多分类任务转换为二进制分类任务,而绕过每个类别的推断,并提出了一种新的区域传播技术来进行定位。我们评估了DE-ViT在开放词汇表、少量样本和一次性检测基准上的表现,与COCO和LVIS进行比较。对于COCO,DE-ViT在开放词汇表SoTA上比SoTA表现更好,在新类中达到了50AP50。DE-ViT在10次检测、30次检测和一次性检测SoTA上超过SoTA的15mAP、7.2mAP和2.8AP50。对于LVIS,DE-ViT比开放词汇表SoTA表现更好,达到了34.3 mask APr。代码在此httpsURL上可用。
https://arxiv.org/abs/2309.12969
Collaborative perception, which greatly enhances the sensing capability of connected and autonomous vehicles (CAVs) by incorporating data from external resources, also brings forth potential security risks. CAVs' driving decisions rely on remote untrusted data, making them susceptible to attacks carried out by malicious participants in the collaborative perception system. However, security analysis and countermeasures for such threats are absent. To understand the impact of the vulnerability, we break the ground by proposing various real-time data fabrication attacks in which the attacker delivers crafted malicious data to victims in order to perturb their perception results, leading to hard brakes or increased collision risks. Our attacks demonstrate a high success rate of over 86\% on high-fidelity simulated scenarios and are realizable in real-world experiments. To mitigate the vulnerability, we present a systematic anomaly detection approach that enables benign vehicles to jointly reveal malicious fabrication. It detects 91.5% of attacks with a false positive rate of 3% in simulated scenarios and significantly mitigates attack impacts in real-world scenarios.
协同感知技术通过从外部资源中整合数据,大大提高了连接和自主车辆(CAV)的感知能力,但也带来了潜在的安全风险。 CAV 的驾驶决策依赖于远程不可信的数据,使其容易受到在协同感知系统中恶意参与者的攻击。然而,对此类威胁的安全分析和对策却不存在。为了理解漏洞的影响,我们提出了各种实时数据伪造攻击,攻击者向受害者发送精心构造的恶意数据,以干扰其感知结果,导致强硬刹车或增加碰撞风险。我们的攻击在高保真的模拟场景中表现出超过 86% 的成功率和真实的实验可以实现。为了缓解漏洞的影响,我们提出了一种系统性异常检测方法,使良性车辆能够共同揭露恶意伪造。该方法在模拟场景中检测到 91.5% 的攻击,但假阳性率仅为 3%,在真实的场景中显著减轻了攻击影响。
https://arxiv.org/abs/2309.12955
This paper introduces a novel one-stage end-to-end detector specifically designed to detect small lesions in medical images. Precise localization of small lesions presents challenges due to their appearance and the diverse contextual backgrounds in which they are found. To address this, our approach introduces a new type of pixel-based anchor that dynamically moves towards the targeted lesion for detection. We refer to this new architecture as GravityNet, and the novel anchors as gravity points since they appear to be "attracted" by the lesions. We conducted experiments on two well-established medical problems involving small lesions to evaluate the performance of the proposed approach: microcalcifications detection in digital mammograms and microaneurysms detection in digital fundus images. Our method demonstrates promising results in effectively detecting small lesions in these medical imaging tasks.
这篇文章介绍了一种独特的单阶段端到端探测器,专门设计用于在医学图像中检测小型损伤。精确定位小型损伤面临着因为它们的外观和它们在不同背景中出现的多样性的挑战。为了解决这个问题,我们的方法引入了一种新的基于像素的基准点,该基准点动态地朝向目标的损伤进行探测。我们称之为GravityNet,并将新的基准点称为Gravity points,因为它们似乎受到损伤的“吸引”。我们对涉及小型损伤的两个现有医学问题进行了实验,以评估所提出的方法的性能:数字乳腺照相中的微骨折检测和数字胸片中的微血管破裂检测。我们的方法在这两个医学图像任务中表现出优异的结果,有效地检测小型损伤。
https://arxiv.org/abs/2309.12876
Most existing methods for unsupervised industrial anomaly detection train a separate model for each object category. This kind of approach can easily capture the category-specific feature distributions, but results in high storage cost and low training efficiency. In this paper, we propose a unified mixed-attention auto encoder (MAAE) to implement multi-class anomaly detection with a single model. To alleviate the performance degradation due to the diverse distribution patterns of different categories, we employ spatial attentions and channel attentions to effectively capture the global category information and model the feature distributions of multiple classes. Furthermore, to simulate the realistic noises on features and preserve the surface semantics of objects from different categories which are essential for detecting the subtle anomalies, we propose an adaptive noise generator and a multi-scale fusion module for the pre-trained features. MAAE delivers remarkable performances on the benchmark dataset compared with the state-of-the-art methods.
大多数现有的无监督工业异常检测方法为每个对象类别训练了单独的模型。这种方法可以轻松捕捉类别特定的特征分布,但会导致高存储成本和低训练效率。在本文中,我们提出了一种统一的混合注意力自动编码器(MAAE),以使用单个模型实现多分类异常检测。为了减轻不同类别不同分布 Pattern 的性能下降,我们使用了空间注意力和通道注意力,有效地捕捉了全球类别信息并模型了多个类别的特征分布。此外,为了模拟真实的噪声特征并保留不同类别物体表面的语义,我们提出了自适应噪声生成器和多尺度融合模块,用于训练预训练特征。与现有方法相比,MAAE在基准数据集上表现出优异的性能。
https://arxiv.org/abs/2309.12700
Enriching the robot representation of the operational environment is a challenging task that aims at bridging the gap between low-level sensor readings and high-level semantic understanding. Having a rich representation often requires computationally demanding architectures and pure point cloud based detection systems that struggle when dealing with everyday objects that have to be handled by the robot. To overcome these issues, we propose a graph-based representation that addresses this gap by providing a semantic representation of robot environments from multiple sources. In fact, to acquire information from the environment, the framework combines classical computer vision tools with modern computer vision cloud services, ensuring computational feasibility on onboard hardware. By incorporating an ontology hierarchy with over 800 object classes, the framework achieves cross-domain adaptability, eliminating the need for environment-specific tools. The proposed approach allows us to handle also small objects and integrate them into the semantic representation of the environment. The approach is implemented in the Robot Operating System (ROS) using the RViz visualizer for environment representation. This work is a first step towards the development of a general-purpose framework, to facilitate intuitive interaction and navigation across different domains.
增强机器人对操作环境的表示是一个挑战性的任务,旨在填补低级别传感器读数和高级别语义理解之间的差距。拥有丰富的表示通常需要计算密集型架构和纯点云 based 检测系统,在处理必须由机器人处理的日常物体时感到困难。为了克服这些问题,我们提议采用图表示法,以解决这一差距,并从多个来源提供机器人环境语义表示。实际上,为了从环境中获取信息,框架结合了传统的计算机视觉工具与现代计算机视觉云服务,确保在船体硬件上的计算可行性。通过包括超过800个物体类的主题级元知识 hierarchy,框架实现了跨域适应性,消除了需要特定环境工具的需求。提议的方法可以处理小型物体,并将其集成到环境语义表示中。该方法在机器人操作系统(ROS)中使用 RViz 可视化器实现,这是开发一个通用框架的第一步,以促进不同领域间的直观交互和导航。
https://arxiv.org/abs/2309.12692
AI-synthesized text and images have gained significant attention, particularly due to the widespread dissemination of multi-modal manipulations on the internet, which has resulted in numerous negative impacts on society. Existing methods for multi-modal manipulation detection and grounding primarily focus on fusing vision-language features to make predictions, while overlooking the importance of modality-specific features, leading to sub-optimal results. In this paper, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. To achieve this, we introduce visual/language pre-trained encoders and dual-branch cross-attention (DCA) to extract and fuse modality-unique features. Furthermore, we design decoupled fine-grained classifiers (DFC) to enhance modality-specific feature mining and mitigate modality competition. Moreover, we propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality using learnable queries, thereby improving the discovery of forged details. Extensive experiments on the $\rm DGM^4$ dataset demonstrate the superior performance of our proposed model compared to state-of-the-art approaches.
AI合成的文本和图像已经获得了广泛关注,特别是由于互联网上多模态操纵的广泛传播,对社会造成了众多负面影响。现有的多模态操纵检测和grounding方法主要关注融合视觉语言特征来做出预测,而忽视了模态特定特征的重要性,导致结果不佳。在本文中,我们构建了一个简单而新颖的Transformer框架,用于多模态操纵检测和grounding任务。我们的框架同时探索模态特定特征,并保留多模态对齐的能力。为了实现这一点,我们引入了视觉/语言预训练编码器和双重分支交叉注意力(DCA)来提取和融合模态特定特征。此外,我们还设计分离的细粒度分类器(DFC)来增强模态特定特征挖掘和减轻模态竞争。此外,我们提出了一种隐含操纵查询(IMQ),使用可学习查询自适应地聚合每个模态的全球上下文线索,从而提高伪造细节的发现。在DGM^4数据集上进行广泛的实验表明,我们提出的模型相比现有方法具有更好的性能。
https://arxiv.org/abs/2309.12657
Surface defect inspection is a very challenging task in which surface defects usually show weak appearances or exist under complex backgrounds. Most high-accuracy defect detection methods require expensive computation and storage overhead, making them less practical in some resource-constrained defect detection applications. Although some lightweight methods have achieved real-time inference speed with fewer parameters, they show poor detection accuracy in complex defect scenarios. To this end, we develop a Global Context Aggregation Network (GCANet) for lightweight saliency detection of surface defects on the encoder-decoder structure. First, we introduce a novel transformer encoder on the top layer of the lightweight backbone, which captures global context information through a novel Depth-wise Self-Attention (DSA) module. The proposed DSA performs element-wise similarity in channel dimension while maintaining linear complexity. In addition, we introduce a novel Channel Reference Attention (CRA) module before each decoder block to strengthen the representation of multi-level features in the bottom-up path. The proposed CRA exploits the channel correlation between features at different layers to adaptively enhance feature representation. The experimental results on three public defect datasets demonstrate that the proposed network achieves a better trade-off between accuracy and running efficiency compared with other 17 state-of-the-art methods. Specifically, GCANet achieves competitive accuracy (91.79% $F_{\beta}^{w}$, 93.55% $S_\alpha$, and 97.35% $E_\phi$) on SD-saliency-900 while running 272fps on a single gpu.
表面缺陷检查是一项极具挑战性的任务,通常表面缺陷呈现出较弱的外观或存在于复杂的背景中。大多数高精度的缺陷检测方法都需要昂贵的计算和存储 overhead,因此在一些资源受限的缺陷检测应用中不太实用。虽然一些轻量级方法在仅有几个参数的情况下已经可以实现实时推断速度,但在复杂的缺陷场景中表现出较差的检测精度。为此,我们开发了一种全球上下文聚合网络(GCANet),用于在编码器和解码器结构中 lightweight saliency检测表面缺陷。我们首先在轻量级骨架的顶部引入了一个新的transformer编码器,该编码器通过 novel Depth-wise Self-Attention (DSA)模块实现了全球上下文信息捕捉。 proposed DSA 在通道维度上进行元素相似性计算,同时保持线性复杂性。此外,我们在每个解码块前引入了一个 novel Channel Reference Attention (CRA)模块,以加强从bottom-up路径上下来的多级特征表示。 proposed CRA利用不同层上特征之间的通道相关性,自适应地增强特征表示。在三个公开缺陷数据集上的实验结果显示,与另外17个先进方法相比, proposed 网络在准确性和运行效率之间的更好权衡。具体来说,GCANet 在SD-saliency-900上实现了 competitive accuracy(91.79% $F_{\beta}^{w}$,93.55% $S_\alpha$,97.35% $E_\phi$),同时运行在单个GPU上的帧率为272fps。
https://arxiv.org/abs/2309.12641
Surface defect inspection is of great importance for industrial manufacture and production. Though defect inspection methods based on deep learning have made significant progress, there are still some challenges for these methods, such as indistinguishable weak defects and defect-like interference in the background. To address these issues, we propose a transformer network with multi-stage CNN (Convolutional Neural Network) feature injection for surface defect segmentation, which is a UNet-like structure named CINFormer. CINFormer presents a simple yet effective feature integration mechanism that injects the multi-level CNN features of the input image into different stages of the transformer network in the encoder. This can maintain the merit of CNN capturing detailed features and that of transformer depressing noises in the background, which facilitates accurate defect detection. In addition, CINFormer presents a Top-K self-attention module to focus on tokens with more important information about the defects, so as to further reduce the impact of the redundant background. Extensive experiments conducted on the surface defect datasets DAGM 2007, Magnetic tile, and NEU show that the proposed CINFormer achieves state-of-the-art performance in defect detection.
表面缺陷检测对于工业制造和生产非常重要。虽然基于深度学习的缺陷检测方法已经取得了重大进展,但这些方法仍然面临一些挑战,例如难以区分的弱缺陷和背景中的类似缺陷干扰。为了解决这些问题,我们提出了一种包含多级卷积神经网络(CNN)特征注入的Transformer网络,称为CIN former,它是一种类似于UNet的结构。CIN former提供了一个简单但有效的特征集成机制,将输入图像的多层次CNN特征注入到编码器中的Transformer网络的不同阶段。这样可以保持CNN捕捉详细特征的优点,以及Transformer在背景中抑制噪声的优点,从而有利于准确的缺陷检测。此外,CINformer还提供了一个K注意力模块,专注于包含更多有关缺陷重要信息的代币,进一步减少了冗余背景的影响。在表面缺陷数据集DAGM 2007、磁贴和NEU上进行广泛的实验表明,提出的CIN Former在缺陷检测方面取得了最先进的性能。
https://arxiv.org/abs/2309.12639
During the COVID-19 pandemic, medical imaging techniques like computed tomography (CT) scans have demonstrated effectiveness in combating the rapid spread of the virus. Therefore, it is crucial to conduct research on computerized models for the detection of COVID-19 using CT imaging. A novel processing method has been developed, utilizing radiomic features, to assist in the CT-based diagnosis of COVID-19. Given the lower specificity of traditional features in distinguishing between different causes of pulmonary diseases, the objective of this study is to develop a CT-based radiomics framework for the differentiation of COVID-19 from other lung diseases. The model is designed to focus on outlining COVID-19 lesions, as traditional features often lack specificity in this aspect. The model categorizes images into three classes: COVID-19, non-COVID-19, or normal. It employs enhancement auto-segmentation principles using intensity dark channel prior (IDCP) and deep neural networks (ALS-IDCP-DNN) within a defined range of analysis thresholds. A publicly available dataset comprising COVID-19, normal, and non-COVID-19 classes was utilized to validate the proposed model's effectiveness. The best performing classification model, Residual Neural Network with 50 layers (Resnet-50), attained an average accuracy, precision, recall, and F1-score of 98.8%, 99%, 98%, and 98% respectively. These results demonstrate the capability of our model to accurately classify COVID-19 images, which could aid radiologists in diagnosing suspected COVID-19 patients. Furthermore, our model's performance surpasses that of more than 10 current state-of-the-art studies conducted on the same dataset.
在新冠病毒大流行期间,像核磁共振成像(CT)扫描这样的医学影像技术已经表明在遏制病毒快速传播方面非常有效。因此,进行计算机化模型的研究,使用CT扫描技术检测COVID-19是非常关键的。我们开发了一种新的处理方法,利用射频特征来帮助进行CT诊断。由于传统特征在区分不同呼吸系统疾病的原因方面的局限性,本研究的目标是开发一种基于CT射频特征的框架,以区分COVID-19和其他呼吸系统疾病。该模型旨在专注于描述COVID-19病变,因为传统特征在这方面的特异性通常较低。该模型将图像分为三个类别:COVID-19、非COVID-19或正常。它使用增强自动分割原则,利用暗通道前(IDCP)和深度神经网络(ALS-IDCP-DNN)在一个定义的分析阈值范围内进行。一个包含COVID-19、正常和非COVID-19三个类别的公开数据集被使用来验证所提出模型的有效性。最优的分类模型是50层的残差神经网络(Resnet-50),其平均准确性、精度、召回率和F1得分分别为98.8%、99%、98%和98%。这些结果表明我们的模型可以准确地分类COVID-19图像,这有助于诊断疑似COVID-19患者的医生。此外,我们的模型的表现超过了当前在同一数据集上开展的其他10多项先进技术的表现。
https://arxiv.org/abs/2309.12638
You Only Look Once (YOLO)-based object detectors have shown remarkable accuracy for automated brain tumor detection. In this paper, we develop a novel BGFG-YOLO architecture by incorporating Bi-level Routing Attention (BRA), Generalized feature pyramid networks (GFPN), Forth detecting head, and Generalized-IoU (GIoU) bounding box regression loss into YOLOv8. BGFG-YOLO contains an attention mechanism to focus more on important features, and feature pyramid networks to enrich feature representation by merging high-level semantic features with spatial details. Furthermore, we investigate the effect of different attention mechanisms and feature fusions, detection head architectures on brain tumor detection accuracy. Experimental results show that BGFG-YOLO gives a 3.4% absolute increase of mAP50 compared to YOLOv8x, and achieves state-of-the-art on the brain tumor detection dataset Br35H. The code is available at this https URL.
You Only Look Once (YOLO)-based object detectors have shown remarkable accuracy for automated brain tumor detection. In this paper, we develop a novel BGFG-YOLO architecture by incorporating Bi-level Routing Attention (BRA), Generalized featurePyramid networks ( GFPN), Forth detecting head, and Generalized-IoU (GIoU) bounding box regression loss into YOLOv8. BGFG-YOLO contains an attention mechanism to focus more on important features, and featurePyramid networks to enrich feature representation by merging high-level semantic features with spatial details. Furthermore, we investigate the effect of different attention mechanisms and feature fusions, detection head architectures on brain tumor detection accuracy. Experimental results show that BGFG-YOLO gives a 3.4% absolute increase of mAP50 compared to YOLOv8x, and achieves state-of-the-art on the brain tumor detection dataset Br35H. The code is available at this https URL.
https://arxiv.org/abs/2309.12585
Existing research has shown the potential of classifying Alzheimers Disease (AD) from eye-tracking (ET) data with classifiers that rely on task-specific engineered features. In this paper, we investigate whether we can improve on existing results by using a Deep-Learning classifier trained end-to-end on raw ET data. This classifier (VTNet) uses a GRU and a CNN in parallel to leverage both visual (V) and temporal (T) representations of ET data and was previously used to detect user confusion while processing visual displays. A main challenge in applying VTNet to our target AD classification task is that the available ET data sequences are much longer than those used in the previous confusion detection task, pushing the limits of what is manageable by LSTM-based models. We discuss how we address this challenge and show that VTNet outperforms the state-of-the-art approaches in AD classification, providing encouraging evidence on the generality of this model to make predictions from ET data.
现有的研究已经表明,可以从眼睛跟踪数据(ET)数据中分类阿尔茨海默病(AD),而该分类方法依赖于特定的工程特征任务。在本文中,我们探讨了是否可以使用一种深度学习分类器,该分类器是基于 raw ET数据的端到端训练。该分类器(VTNet)同时使用GRU和CNN来利用ET数据的可视化(V)和时间表示(T)。它以前用于处理视觉显示时检测用户混淆。在应用VTNet到我们的目标AD分类任务时,主要挑战是可用的ET数据序列比先前混淆检测任务中使用的序列更长,这使得LSTM模型难以处理。我们讨论了如何克服这个挑战,并表明VTNet在AD分类方面优于当前最佳方法,提供了从ET数据进行预测的令人鼓舞的证据。
https://arxiv.org/abs/2309.12574
Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method over the input signal. This paper proposes an extension to TS-VAD, called Profile-Error-Tolerant TS-VAD (PET-TSVAD), which is robust to such speaker profile errors. This is achieved by employing transformer-based TS-VAD that can handle a variable number of speakers and further introducing a set of additional pseudo-speaker profiles to handle speakers undetected during the first pass diarization. During training, we use speaker profiles estimated by multiple different clustering algorithms to reduce the mismatch between the training and testing conditions regarding speaker profiles. Experimental results show that PET-TSVAD consistently outperforms the existing TS-VAD method on both the VoxConverse and DIHARD-I datasets.
目标音频信号语音活动检测(TS-VAD)使用一组 speaker profiles 与输入音频信号一起执行 speaker 去声化。虽然其相对于传统方法的优势已经得到证明,但方法可能会受到 speaker profiles 的错误,因为这些 profiles 通常是通过运行传统的基于簇聚类的去声化方法在输入信号上得到的。本文提出了一个扩展,称为 profiles-error-容忍的 TS-VAD(PET-TSVAD),能够 robustly 应对 such speaker profiles 的错误。这通过使用能够处理可变数量的 speaker 的 transformer-based TS-VAD 实现,并引入了一组额外的伪 speaker profiles,以处理在第一遍去声化中未被发现的演讲者。在训练期间,我们使用多个不同的簇聚类算法估计的演讲者 profiles 以减少训练和测试条件之间的不匹配。实验结果显示,PET-TSVAD 在 VoxConverse 和 DIHARD-I 数据集上 consistently 优于现有的 TS-VAD 方法。
https://arxiv.org/abs/2309.12521
The Video and Image Processing (VIP) Cup is a student competition that takes place each year at the IEEE International Conference on Image Processing. The 2022 IEEE VIP Cup asked undergraduate students to develop a system capable of distinguishing pristine images from generated ones. The interest in this topic stems from the incredible advances in the AI-based generation of visual data, with tools that allows the synthesis of highly realistic images and videos. While this opens up a large number of new opportunities, it also undermines the trustworthiness of media content and fosters the spread of disinformation on the internet. Recently there was strong concern about the generation of extremely realistic images by means of editing software that includes the recent technology on diffusion models. In this context, there is a need to develop robust and automatic tools for synthetic image detection.
视频和图像处理(VIP Cup)是一个每年在IEEE图像处理国际会议中举办的学生比赛。2022年IEEE VIP Cup要求本科生开发一种系统,能够分辨原始图像和生成图像。对这个话题的兴趣源于基于AI生成的视觉数据惊人的进展,以及允许合成高度逼真的图像和视频的工具。尽管这带来了大量新机会,但它也破坏了媒体内容的可靠性,并促进了互联网上的虚假信息的传播。最近,存在着强烈的关注,担心使用包括最新扩散模型技术的编辑软件生成极度逼真的图像。在这种情况下,需要开发 robust 和自动的合成图像检测工具。
https://arxiv.org/abs/2309.12428
Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of various structures of ViTs, ViTs are increasingly advantageous for many vision tasks. However, the quadratic complexity of self-attention renders ViTs computationally intensive, and their lack of inductive biases of locality and translation equivariance demands larger model sizes compared to CNNs to effectively learn visual features. In this paper, we propose a light-weight and efficient vision transformer model called DualToken-ViT that leverages the advantages of CNNs and ViTs. DualToken-ViT effectively fuses the token with local information obtained by convolution-based structure and the token with global information obtained by self-attention-based structure to achieve an efficient attention structure. In addition, we use position-aware global tokens throughout all stages to enrich the global information, which further strengthening the effect of DualToken-ViT. Position-aware global tokens also contain the position information of the image, which makes our model better for vision tasks. We conducted extensive experiments on image classification, object detection and semantic segmentation tasks to demonstrate the effectiveness of DualToken-ViT. On the ImageNet-1K dataset, our models of different scales achieve accuracies of 75.4% and 79.4% with only 0.5G and 1.0G FLOPs, respectively, and our model with 1.0G FLOPs outperforms LightViT-T using global tokens by 0.7%.
自注意力视觉转换器(ViTs)在计算机视觉中已成为一种高度竞争的架构。与卷积神经网络(CNNs)不同,ViTs能够进行全球信息共享。随着ViTs各种结构的开发,ViTs在许多视觉任务中变得越来越有利。然而,自注意力的平方复杂度使得ViTs的计算密集型,且它们的局部和翻译等移量偏见要求比CNNs更大的模型大小来有效地学习视觉特征。在本文中,我们提出了一种轻量级高效的视觉转换器模型,称为“双重代币-ViT”,利用CNNs和ViTs的优点。双重代币-ViT有效地融合代币与通过卷积结构获取的局部信息和通过自注意力结构获取的全球信息,以构建高效的注意结构。此外,我们在整个过程中使用位置aware全球代币来丰富全球信息,这进一步加强了双重代币-ViT的效果。位置aware全球代币也包含图像的位置信息,这使得我们的模型更适合视觉任务。我们进行了广泛的实验,对图像分类、物体检测和语义分割任务进行测试,以证明双重代币-ViT的有效性。在ImageNet-1K数据集上,不同大小的模型在仅0.5G和1.0G FLOP的情况下,分别实现了75.4%和79.4%的准确率,而使用1.0G FLOPs的模型比使用全球代币的LightViT-T高出0.7%。
https://arxiv.org/abs/2309.12424
The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full segments where a person on the screen is actually speaking. This encourages the model to learn effective representations through the natural correspondence of speech and facial movements. Our loss can be jointly optimized with the existing objectives for training ASD models without the need for additional supervision or training data. The experiments demonstrate that our loss can be easily integrated into the existing ASD frameworks, improving their performance. Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets.
本研究的目标是 active speaker detection(ASD),该任务是在一系列视频帧中确定一个人是否在说话。以往的研究通过探索网络架构来完成该任务,而学习有效的表示方法则较少被探索。在这项工作中,我们提出了TalkNCE,这是一种具有意识对话感知 contrastive 损失的新损失函数。该损失只应用于整个片段中,其中屏幕上的人在实际上说话的部分。这鼓励模型通过自然的对话和面部表情映射来学习有效的表示方法。我们的损失可以与现有的训练目标一起优化,无需额外的监督或训练数据。实验表明,我们的损失可以轻松地与现有的 ASD 框架集成,提高其性能。我们的方法和在AVA-Active Speaker和ASW数据集上取得了最先进的性能。
https://arxiv.org/abs/2309.12306
We tackle the problem of robust novelty detection, where we aim to detect novelties in terms of semantic content while being invariant to changes in other, irrelevant factors. Specifically, we operate in a setup with multiple environments, where we determine the set of features that are associated more with the environments, rather than to the content relevant for the task. Thus, we propose a method that starts with a pretrained embedding and a multi-env setup and manages to rank the features based on their environment-focus. First, we compute a per-feature score based on the feature distribution variance between envs. Next, we show that by dropping the highly scored ones, we manage to remove spurious correlations and improve the overall performance by up to 6%, both in covariance and sub-population shift cases, both for a real and a synthetic benchmark, that we introduce for this task.
我们解决了 robust novelty detection 问题,该问题旨在通过语义内容来检测新奇性,同时不受到其他无关因素的变化影响。具体来说,我们在一个有多个环境的 setup 中操作,确定与环境更相关的特征,而不是与任务相关的内容。因此,我们提出了一种方法,它从预训练嵌入和多环境 setup 开始,并通过环境焦点来排名特征。首先,我们计算每个特征的分数,基于不同环境特征分布的差异。接下来,我们表明,通过删除高分特征,我们能够删除伪相关关系,并提高整体表现,无论是covance 和子群体移动 cases,我们对为该任务引入的真实和合成基准进行了测试。
https://arxiv.org/abs/2309.12301
Detecting fake news requires both a delicate sense of diverse clues and a profound understanding of the real-world background, which remains challenging for detectors based on small language models (SLMs) due to their knowledge and capability limitations. Recent advances in large language models (LLMs) have shown remarkable performance in various tasks, but whether and how LLMs could help with fake news detection remains underexplored. In this paper, we investigate the potential of LLMs in fake news detection. First, we conduct an empirical study and find that a sophisticated LLM such as GPT 3.5 could generally expose fake news and provide desirable multi-perspective rationales but still underperforms the basic SLM, fine-tuned BERT. Our subsequent analysis attributes such a gap to the LLM's inability to select and integrate rationales properly to conclude. Based on these findings, we propose that current LLMs may not substitute fine-tuned SLMs in fake news detection but can be a good advisor for SLMs by providing multi-perspective instructive rationales. To instantiate this proposal, we design an adaptive rationale guidance network for fake news detection (ARG), in which SLMs selectively acquire insights on news analysis from the LLMs' rationales. We further derive a rationale-free version of ARG by distillation, namely ARG-D, which services cost-sensitive scenarios without inquiring LLMs. Experiments on two real-world datasets demonstrate that ARG and ARG-D outperform three types of baseline methods, including SLM-based, LLM-based, and combinations of small and large language models.
检测假新闻不仅需要敏锐的多线索感知,还需要对现实世界的背景进行深入的理解。由于小语言模型(SLMs)的知识和能力限制,基于SLMs的探测器仍然面临挑战。最近,大型语言模型(LLMs)取得了显著表现,在各种任务中表现良好,但如何帮助检测假新闻仍然有待探索。在本文中,我们探讨了LLMs在假新闻检测中的潜力。我们首先进行了一项实证研究,并发现,像GPT 3.5这样的 sophisticated LLM 通常能够揭示假新闻,并提供理想的多视角 rationales,但 still underperforms the basic SLM, fine-tunes BERT。我们的后续分析将这种差异归因于 LLM 无法正确选择和整合 rationales,以得出结论。基于这些发现,我们提出当前LLMs可能无法替代微调的SLMs在假新闻检测中替代SLMs,但可以作为一个良好的指导方针,为SLMs提供多视角有用的 rationales。为了实施这一建议,我们设计了自适应 rationale guidance network 用于假新闻检测(ARG),其中SLMs 选择性地从LLMs的 rationales 中获取新闻分析 insights。我们还可以通过蒸馏(distillation)方法生成无 rationale 的 ARG-D,该方法为成本敏感的场景提供服务,而不需要询问LLMs。在两个实际数据集上的实验表明,ARG 和 ARG-D 比三种基准方法(基于 SLM、基于 LLM 和大型语言模型和大型语言模型和大型语言模型的组合)更有效。
https://arxiv.org/abs/2309.12247