The complex challenge of detecting sarcasm in Arabic speech on social media is increased by the language diversity and the nature of sarcastic expressions. There is a significant gap in the capability of existing models to effectively interpret sarcasm in Arabic, which mandates the necessity for more sophisticated and precise detection methods. In this paper, we investigate the impact of a fundamental preprocessing component on sarcasm speech detection. While emojis play a crucial role in mitigating the absence effect of body language and facial expressions in modern communication, their impact on automated text analysis, particularly in sarcasm detection, remains underexplored. We investigate the impact of emoji exclusion from datasets on the performance of sarcasm detection models in social media content for Arabic as a vocabulary-super rich language. This investigation includes the adaptation and enhancement of AraBERT pre-training models, specifically by excluding emojis, to improve sarcasm detection capabilities. We use AraBERT pre-training to refine the specified models, demonstrating that the removal of emojis can significantly boost the accuracy of sarcasm detection. This approach facilitates a more refined interpretation of language, eliminating the potential confusion introduced by non-textual elements. The evaluated AraBERT models, through the focused strategy of emoji removal, adeptly navigate the complexities of Arabic sarcasm. This study establishes new benchmarks in Arabic natural language processing and presents valuable insights for social media platforms.
社会媒体中检测讽刺语的复杂性增加了语言多样性和讽刺表达的性质。现有模型有效解释阿拉伯语中的讽刺的能力存在显著的差距,这迫使需要更复杂和精确的检测方法。在本文中,我们研究了基本预处理组件对讽刺语音检测的影响。尽管表情符号在减轻现代通信中肢体语言和面部表情缺失效应方面起着关键作用,但它们对自动文本分析(特别是讽刺检测)的影响仍没有被深入研究。我们研究了表情符号从数据集中排除对阿拉伯语讽刺检测模型性能的影响。这项调查包括使用AraBERT预训练模型进行调整和增强,特别是通过排除表情符号,以提高讽刺检测能力。我们使用AraBERT预训练来优化指定模型,证明删除表情符号可以显著提高讽刺检测的准确性。这种方法使得对语言的解读更加精准,消除了非文本元素可能引起的混淆。评估的AraBERT模型通过移除表情符号,巧妙地处理了阿拉伯语讽刺的复杂性。本研究为阿拉伯自然语言处理设立了新的基准,并为社交媒体平台提供了宝贵的洞见。
https://arxiv.org/abs/2405.02195
Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained this http URL this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific fake detection or speaker verification datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets widespread in the community show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.
泛化是当前音频深度伪造检测器面临的主要问题,它们在非分布数据上提供不可靠的结果。鉴于越来越精确的合成方法正在开发,在数据集他们并未训练过的这个http网址上,设计一些在数据集上表现良好的技术对提高深度伪造检测器的性能具有至关重要的意义。为此,我们将检测问题重新表述为说话人验证框架,通过测试语音样本与声称身份的语音之间的不匹配来暴露假音频。在这种范式下,在训练过程中不需要任何假音频样本,切断与生成方法之间的联系,并确保充分的泛化能力。大型通用预训练模型的特征是由其提取的,无需在特定的伪造检测或说话人验证数据集上进行训练或微调。在检测时只需要检测身份的有限个语音片段。在社区中广泛使用的几个数据集的实验表明,基于预训练模型的检测器实现卓越的性能,表现出很强的泛化能力,与在分布数据上的监督方法相媲美,在离散数据上大大超过了它们。
https://arxiv.org/abs/2405.02179
The accuracy and robustness of 3D human pose estimation (HPE) are limited by 2D pose detection errors and 2D to 3D ill-posed challenges, which have drawn great attention to Multi-Hypothesis HPE research. Most existing MH-HPE methods are based on generative models, which are computationally expensive and difficult to train. In this study, we propose a Probabilistic Restoration 3D Human Pose Estimation framework (PRPose) that can be integrated with any lightweight single-hypothesis model. Specifically, PRPose employs a weakly supervised approach to fit the hidden probability distribution of the 2D-to-3D lifting process in the Single-Hypothesis HPE model and then reverse-map the distribution to the 2D pose input through an adaptive noise sampling strategy to generate reasonable multi-hypothesis samples effectively. Extensive experiments on 3D HPE benchmarks (Human3.6M and MPI-INF-3DHP) highlight the effectiveness and efficiency of PRPose. Code is available at: this https URL.
3D人体姿态估计(HPE)的准确性和鲁棒性受到二维姿态检测错误和二维到三维非线性挑战的限制,这些已经引起了多假设性HPE研究的广泛关注。现有的MH-HPE方法都是基于生成模型的,这些模型计算代价高且训练困难。在这项研究中,我们提出了一个概率修复3D人体姿态估计框架(PRPose),可以与任何轻量级的单假设模型集成。具体来说,PRPose采用了一种弱监督方法来适应单假设HPE模型中2D-to-3D提升过程的隐藏概率分布,然后通过自适应噪声采样策略将分布反向映射到2D姿态输入,从而有效地生成合理的多个假设样本。在3D HPE基准(Human3.6M和MPI-INF-3DHP)上的大量实验揭示了PRPose的有效性和效率。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2405.02114
Motivation: Alzheimer's Disease hallmarks include amyloid-beta deposits and brain atrophy, detectable via PET and MRI scans, respectively. PET is expensive, invasive and exposes patients to ionizing radiation. MRI is cheaper, non-invasive, and free from ionizing radiation but limited to measuring brain atrophy. Goal: To develop an 3D image translation model that synthesizes amyloid-beta PET images from T1-weighted MRI, exploiting the known relationship between amyloid-beta and brain atrophy. Approach: The model was trained on 616 PET/MRI pairs and validated with 264 pairs. Results: The model synthesized amyloid-beta PET images from T1-weighted MRI with high-degree of similarity showing high SSIM and PSNR metrics (SSIM>0.95&PSNR=28). Impact: Our model proves the feasibility of synthesizing amyloid-beta PET images from structural MRI ones, significantly enhancing accessibility for large-cohort studies and early dementia detection, while also reducing cost, invasiveness, and radiation exposure.
动机:阿尔茨海默病的关键特征包括淀粉样蛋白β(amyloid-β)沉积和脑萎缩,可以通过PET和MRI扫描检测到。PET费用昂贵,侵入性较强,且会暴露患者接受放射线治疗。MRI虽然比PET便宜,非侵入性,但只能测量脑萎缩,有限制。目标:开发一个3D图像翻译模型,从T1加权MRI合成amyloid-β PET图像,利用已知amyloid-β和脑萎缩之间的关系。方法:该模型在616个PET/MRI对上进行训练,并通过264个对进行验证。结果:该模型从T1加权MRI上合成了高程度的amyloid-β PET图像,具有很高的SSIM和PSNR指标(SSIM>0.95&PSNR=28)。影响:我们的模型证明了从结构MRI合成amyloid-β PET图像的可能性,显著增强了大型队列研究和早期痴呆症检测的可用性,同时降低了成本、侵入性和放射线暴露。
https://arxiv.org/abs/2405.02109
The capability of accurately determining code similarity is crucial in many tasks related to software development. For example, it might be essential to identify code duplicates for performing software maintenance. This research introduces a novel ensemble learning approach for code similarity assessment, combining the strengths of multiple unsupervised similarity measures. The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses, leading to improved performance. Preliminary results show that while Transformers-based CodeBERT and its variant GraphCodeBERT are undoubtedly the best option in the presence of abundant training data, in the case of specific small datasets (up to 500 samples), our ensemble achieves similar results, without prejudice to the interpretability of the resulting solution, and with a much lower associated carbon footprint due to training. The source code of this novel approach can be downloaded from this https URL.
准确确定代码相似性的能力在许多与软件开发相关的任务中至关重要。例如,识别代码复制是进行软件维护必不可少的。这项研究引入了一种新颖的集成学习方法来评估代码相似性,结合多个无监督相似度的优势。关键思想是,多样化的相似度度量具有相互补充的优势,并减轻了各个弱点的负面影响,从而提高了性能。初步结果表明,尽管Transformer-based CodeBERT和其变体GraphCodeBERT在大量训练数据存在的情况下无疑是最佳选择,但在具体小数据集(最多500个样本)情况下,我们的集成方法同样可以达到类似的结果,而不会影响所得到解决方案的可解释性,同时训练导致的附带碳足迹要低得多。这种新颖方法的源代码可以从以下链接下载:https://github.com/your-username/your-repo-name
https://arxiv.org/abs/2405.02095
With the wide application of knowledge distillation between an ImageNet pre-trained teacher model and a learnable student model, industrial anomaly detection has witnessed a significant achievement in the past few years. The success of knowledge distillation mainly relies on how to keep the feature discrepancy between the teacher and student model, in which it assumes that: (1) the teacher model can jointly represent two different distributions for the normal and abnormal patterns, while (2) the student model can only reconstruct the normal distribution. However, it still remains a challenging issue to maintain these ideal assumptions in practice. In this paper, we propose a simple yet effective two-stage industrial anomaly detection framework, termed as AAND, which sequentially performs Anomaly Amplification and Normality Distillation to obtain robust feature discrepancy. In the first anomaly amplification stage, we propose a novel Residual Anomaly Amplification (RAA) module to advance the pre-trained teacher encoder. With the exposure of synthetic anomalies, it amplifies anomalies via residual generation while maintaining the integrity of pre-trained model. It mainly comprises a Matching-guided Residual Gate and an Attribute-scaling Residual Generator, which can determine the residuals' proportion and characteristic, respectively. In the second normality distillation stage, we further employ a reverse distillation paradigm to train a student decoder, in which a novel Hard Knowledge Distillation (HKD) loss is built to better facilitate the reconstruction of normal patterns. Comprehensive experiments on the MvTecAD, VisA, and MvTec3D-RGB datasets show that our method achieves state-of-the-art performance.
知识蒸馏在工业异常检测中的应用已经取得了显著成就。知识蒸馏的成功主要依赖于如何保持教师和 student模型之间的特征差异,其中它假设:(1)教师模型可以共同表示正常和异常模式的两种不同分布,而(2)学生模型只能重构正常分布。然而,在实践中仍然存在一个具有挑战性的问题,即维持这些理想假设。在本文中,我们提出了一个简单而有效的工业异常检测框架,称为AAND,它分为两个阶段依次执行异常增强和正常分化。在第一个异常增强阶段,我们提出了一个新的残差异常增强(RAA)模块,以提高预训练教师编码器的性能。通过暴露合成异常,它通过残差生成来放大异常,同时保持预训练模型的完整性。它主要由一个匹配引导的残差门和一个属性缩放的残差生成器组成,可以分别确定残差的比率和特征。在第二个正则化分化阶段,我们进一步采用反向蒸馏范式训练学生解码器,其中构建了一种新的硬知识蒸馏(HKD)损失,以更好地促进对正常模式的重建。在MvTecAD、VisA和MvTec3D-RGB数据集上进行全面的实验证明,我们的方法达到了最先进的性能水平。
https://arxiv.org/abs/2405.02068
Social bots play a significant role in many online social networks (OSN) as they imitate human behavior. This fact raises difficult questions about their capabilities and potential risks. Given the recent advances in Generative AI (GenAI), social bots are capable of producing highly realistic and complex content that mimics human creativity. As the malicious social bots emerge to deceive people with their unrealistic content, identifying them and distinguishing the content they produce has become an actual challenge for numerous social platforms. Several approaches to this problem have already been proposed in the literature, but the proposed solutions have not been widely evaluated. To address this issue, we evaluate the behavior of a text-based bot detector in a competitive environment where some scenarios are proposed: \textit{First}, the tug-of-war between a bot and a bot detector is examined. It is interesting to analyze which party is more likely to prevail and which circumstances influence these expectations. In this regard, we model the problem as a synthetic adversarial game in which a conversational bot and a bot detector are engaged in strategic online interactions. \textit{Second}, the bot detection model is evaluated under attack examples generated by a social bot; to this end, we poison the dataset with attack examples and evaluate the model performance under this condition. \textit{Finally}, to investigate the impact of the dataset, a cross-domain analysis is performed. Through our comprehensive evaluation of different categories of social bots using two benchmark datasets, we were able to demonstrate some achivement that could be utilized in future works.
社交机器人在很多在线社交网络(OSN)中扮演着重要的角色,因为它们模仿人类行为。这一事实引发了关于其能力和潜在风险的困难问题。考虑到最近的生成人工智能(GenAI)进步,社交机器人能够产生高度逼真和复杂的內容,模仿人类的创造力。随着恶意社交机器人通过不现实的內容欺骗人们的出现,识别它们并区分它们产生的内容已成为许多社交平台的实际挑战。 在文献中已经提出了几种解决这个问题的方法,但所提出的解决方案尚未得到广泛评估。为了解决这个问题,我们在一个竞争的环境中评估了一个基于文本的机器人检测器的行为:\textit{首先},我们研究了机器人之间的拉锯战。分析哪个 party 更有可能获胜以及哪些情况会影响这些期望很有趣。在这方面,我们将问题建模为一个合成对抗游戏,其中聊天机器人和机器人检测器进行 strategic online interactions。\textit{其次},我们分析了由社交机器人生成的攻击样本来评估机器人检测器的表现。因此,我们用攻击样本来污染数据集,并在此条件下评估了模型性能。\textit{最后},为了研究数据集的影响,进行跨领域分析。通过使用两个基准数据集全面评估不同种类的社交机器人,我们能够证明未来工作中可以利用的一些成就。
https://arxiv.org/abs/2405.02016
Early detection of cancer can help improve patient prognosis by early intervention. Head and neck cancer is diagnosed in specialist centres after a surgical biopsy, however, there is a potential for these to be missed leading to delayed diagnosis. To overcome these challenges, we present an attention based pipeline that identifies suspected lesions, segments, and classifies them as non-dysplastic, dysplastic and cancerous lesions. We propose (a) a vision transformer based Mask R-CNN network for lesion detection and segmentation of clinical images, and (b) Multiple Instance Learning (MIL) based scheme for classification. Current results show that the segmentation model produces segmentation masks and bounding boxes with up to 82% overlap accuracy score on unseen external test data and surpassing reviewed segmentation benchmarks. Next, a classification F1-score of 85% on the internal cohort test set. An app has been developed to perform lesion segmentation taken via a smart device. Future work involves employing endoscopic video data for precise early detection and prognosis.
癌症的早期检测可以帮助改善患者的预后,通过早期干预。头颈部癌在专业中心进行手术活检后才能确诊,然而,这些病变有可能会被忽视,导致延迟诊断。为了克服这些挑战,我们提出了一个基于注意力的管道,该管道通过识别可疑的病变、段和分类它们为非多能、多能和肿瘤性病变来确定。我们提出了(a)基于Mask R-CNN网络的病变检测和分割视觉Transformer,以及(b)基于Multi Instance Learning(MIL)的分类方案。当前结果表明,分割模型在未见过的外部测试数据上的分割掩码和边界框准确度分数达到82%,超过了 review 的分割基准。接下来,在内部队列测试集中的分类F1分数为85%。已经开发了一个通过智能手机进行病变分割的应用程序。未来的工作包括利用内窥镜视频数据进行精确的早期检测和预后。
https://arxiv.org/abs/2405.01937
Detection of changes in heterogeneous remote sensing images is vital, especially in response to emergencies like earthquakes and floods. Current homogenous transformation-based change detection (CD) methods often suffer from high computation and memory costs, which are not friendly to edge-computation devices like onboard CD devices at satellites. To address this issue, this paper proposes a new lightweight CD method for heterogeneous remote sensing images that employs the online all-integer pruning (OAIP) training strategy to efficiently fine-tune the CD network using the current test data. The proposed CD network consists of two visual geometry group (VGG) subnetworks as the backbone architecture. In the OAIP-based training process, all the weights, gradients, and intermediate data are quantized to integers to speed up training and reduce memory usage, where the per-layer block exponentiation scaling scheme is utilized to reduce the computation errors of network parameters caused by quantization. Second, an adaptive filter-level pruning method based on the L1-norm criterion is employed to further lighten the fine-tuning process of the CD network. Experimental results show that the proposed OAIP-based method attains similar detection performance (but with significantly reduced computation complexity and memory usage) in comparison with state-of-the-art CD methods.
检测异质遥感图像中的变化对地震和水灾等紧急情况至关重要。目前基于同质变换的变形检测(CD)方法通常导致计算和内存成本较高,这对卫星上的车载CD设备来说并不友好。为解决这个问题,本文提出了一种用于异质遥感图像的新型轻量级CD方法,该方法采用在线所有整数平展(OAIP)训练策略来有效地对CD网络进行微调,利用当前测试数据。所提出的CD网络由两个视觉几何组(VGG)子网络作为基本架构。在OAIP基于训练过程中,所有权重、梯度和中间数据都被量化为整数,以加速训练并减少内存消耗,其中每层模块指数缩放方案被用于减少由于量化引起的网络参数计算误差。其次,采用L1范数 criterion 的自适应滤波器级别剪枝方法进一步减轻了微调过程。实验结果表明,与最先进的CD方法相比,基于OAIP的轻量级方法在检测性能上具有相似的效果(但计算复杂性和内存消耗大大降低)
https://arxiv.org/abs/2405.01920
The detection of traversable regions on staircases and the physical modeling constitutes pivotal aspects of the mobility of legged robots. This paper presents an onboard framework tailored to the detection of traversable regions and the modeling of physical attributes of staircases by point cloud data. To mitigate the influence of illumination variations and the overfitting due to the dataset diversity, a series of data augmentations are introduced to enhance the training of the fundamental network. A curvature suppression cross-entropy(CSCE) loss is proposed to reduce the ambiguity of prediction on the boundary between traversable and non-traversable regions. Moreover, a measurement correction based on the pose estimation of stairs is introduced to calibrate the output of raw modeling that is influenced by tilted perspectives. Lastly, we collect a dataset pertaining to staircases and introduce new evaluation criteria. Through a series of rigorous experiments conducted on this dataset, we substantiate the superior accuracy and generalization capabilities of our proposed method. Codes, models, and datasets will be available at this https URL.
楼梯上的可通行区域的检测和楼梯的物理建模构成了机器人运动的关键方面。本文提出了一种专为检测可通行区域和建模楼梯物理属性通过点云数据而定制的车载框架。为了减轻光照变化和数据集差异导致的过拟合问题,一系列数据增强技术被引入以提高基本网络的训练。我们提出了一个曲率抑制交叉熵(CSCE)损失来降低在可通行和非可通行区域边界上的预测不确定性。此外,基于楼梯姿态估计的测量校正方法被引入,以校准受到倾斜视角影响的原始建模的输出。最后,我们还收集了一个楼梯数据的集合,并引入了新的评估标准。通过在对这个数据集的严谨实验过程中,我们证明了我们提出的方法的优越准确性和泛化能力。代码、模型和数据集将在这个[https:// URL]处提供。
https://arxiv.org/abs/2405.01918
To address the limitations of current hate speech detection models, we introduce \textsf{SGHateCheck}, a novel framework designed for the linguistic and cultural context of Singapore and Southeast Asia. It extends the functional testing approach of HateCheck and MHC, employing large language models for translation and paraphrasing into Singapore's main languages, and refining these with native annotators. \textsf{SGHateCheck} reveals critical flaws in state-of-the-art models, highlighting their inadequacy in sensitive content moderation. This work aims to foster the development of more effective hate speech detection tools for diverse linguistic environments, particularly for Singapore and Southeast Asia contexts.
为解决当前仇恨言论检测模型的局限性,我们引入了 \textsf{SGHateCheck},一个针对新加坡和东南亚文化语境的新型框架。它扩展了 hateCheck 和 MHC 的功能测试方法,使用大量语言模型进行翻译和的同义转述,并使用本土注释者对其进行细化。\textsf{SGHateCheck}揭示了现有模型的关键缺陷,突出了它们在敏感内容审查方面的不足。这项工作旨在为多样语言环境下的仇恨言论检测工具的发展培养,特别是为新加坡和东南亚语境。
https://arxiv.org/abs/2405.01842
Facial Expression Recognition (FER) plays a pivotal role in understanding human emotional cues. However, traditional FER methods based on visual information have some limitations, such as preprocessing, feature extraction, and multi-stage classification procedures. These not only increase computational complexity but also require a significant amount of computing resources. Considering Convolutional Neural Network (CNN)-based FER schemes frequently prove inadequate in identifying the deep, long-distance dependencies embedded within facial expression images, and the Transformer's inherent quadratic computational complexity, this paper presents the FER-YOLO-Mamba model, which integrates the principles of Mamba and YOLO technologies to facilitate efficient coordination in facial expression image recognition and localization. Within the FER-YOLO-Mamba model, we further devise a FER-YOLO-VSS dual-branch module, which combines the inherent strengths of convolutional layers in local feature extraction with the exceptional capability of State Space Models (SSMs) in revealing long-distance dependencies. To the best of our knowledge, this is the first Vision Mamba model designed for facial expression detection and classification. To evaluate the performance of the proposed FER-YOLO-Mamba model, we conducted experiments on two benchmark datasets, RAF-DB and SFEW. The experimental results indicate that the FER-YOLO-Mamba model achieved better results compared to other models. The code is available from this https URL.
面部表情识别(FER)在理解人类情感线索方面起着关键作用。然而,基于视觉信息的传统FER方法存在一些限制,例如预处理、特征提取和多级分类过程。这些不仅增加了计算复杂度,而且需要大量的计算资源。考虑到基于卷积神经网络(CNN)的FER方案在识别面部表情图像中的深度、长距离依赖方面常常表现不足,以及Transformer的固有二次计算复杂度,本文提出了FER-YOLO-Mamba模型,该模型将Mamba和YOLO技术的原则整合起来,以促进面部表情图像识别和定位的高效协调。在FER-YOLO-Mamba模型中,我们进一步设计了一个FER-YOLO-VSS双分支模块,将局部特征提取的卷积层固有优势与State Space Models(SSM)在揭示长距离依赖的非凡能力相结合。据我们所知,这是第一个针对面部表情检测和分类设计的视觉Mamba模型。为了评估所提出的FER-YOLO-Mamba模型的性能,我们在两个基准数据集上进行了实验:RAF-DB和SFEW。实验结果表明,FER-YOLO-Mamba模型取得了比其他模型更好的效果。代码可以从该链接的URL中获取。
https://arxiv.org/abs/2405.01828
We introduce an innovative, simple, effective segmentation-free approach for outcome prediction in head \& neck cancer (HNC) patients. By harnessing deep learning-based feature extraction techniques and multi-angle maximum intensity projections (MA-MIPs) applied to Fluorodeoxyglucose Positron Emission Tomography (FDG-PET) volumes, our proposed method eliminates the need for manual segmentations of regions-of-interest (ROIs) such as primary tumors and involved lymph nodes. Instead, a state-of-the-art object detection model is trained to perform automatic cropping of the head and neck region on the PET volumes. A pre-trained deep convolutional neural network backbone is then utilized to extract deep features from MA-MIPs obtained from 72 multi-angel axial rotations of the cropped PET volumes. These deep features extracted from multiple projection views of the PET volumes are then aggregated and fused, and employed to perform recurrence-free survival analysis on a cohort of 489 HNC patients. The proposed approach outperforms the best performing method on the target dataset for the task of recurrence-free survival analysis. By circumventing the manual delineation of the malignancies on the FDG PET-CT images, our approach eliminates the dependency on subjective interpretations and highly enhances the reproducibility of the proposed survival analysis method.
我们提出了一个创新、简单、有效的无分割方法,用于预测头颈部癌症(HNC)患者的 outcomes。通过利用基于深度学习的特征提取技术和应用到氟氧葡萄糖正电子发射断层扫描(FDG-PET)卷面的多角度最大强度投影(MA-MIPs),我们的方法消除了对感兴趣区域(ROIs)如原发肿瘤和涉及淋巴结的手动分割的需求。相反,通过训练一种最先进的物体检测模型来自动裁剪头颈部PET卷面,该模型从72个多角度轴向旋转的裁剪PET卷面上提取深度特征。这些从PET卷面的多个投影视图中提取的深度特征被汇总和融合,并用于对489名HNC患者进行无复发的生存分析。与目标数据集上最佳表现的方法相比,我们的方法在无复发生存分析任务上表现优异。通过避免在FDG-PET-CT图像上手动划分肿瘤,我们的方法消除了对主观解释的依赖,极大地提高了所提出的生存分析方法的可重复性。
https://arxiv.org/abs/2405.01756
In the context of Intelligent Transportation Systems (ITS), efficient data compression is crucial for managing large-scale point cloud data acquired by roadside LiDAR sensors. The demand for efficient storage, streaming, and real-time object detection capabilities for point cloud data is substantial. This work introduces PointCompress3D, a novel point cloud compression framework tailored specifically for roadside LiDARs. Our framework addresses the challenges of compressing high-resolution point clouds while maintaining accuracy and compatibility with roadside LiDAR sensors. We adapt, extend, integrate, and evaluate three cutting-edge compression methods using our real-world-based TUMTraf dataset family. We achieve a frame rate of 10 FPS while keeping compression sizes below 105 Kb, a reduction of 50 times, and maintaining object detection performance on par with the original data. In extensive experiments and ablation studies, we finally achieved a PSNR d2 of 94.46 and a BPP of 6.54 on our dataset. Future work includes the deployment on the live system. The code is available on our project website: this https URL.
在智能交通系统(ITS)的背景下,高效的数据压缩对于通过路边激光雷达传感器获取的大规模点云数据的管理至关重要。对于点云数据的高效存储、流式处理和实时物体检测功能的需求非常大。本文介绍了一个专门为路边激光雷达设计的点云压缩框架——PointCompress3D。我们的框架通过在TUMTraf现实数据集家族上使用,解决了对高分辨率点云压缩保持准确性和与路边激光雷达传感器兼容性的挑战。我们使用基于现实数据的TUMTraf数据集家族,采用三种最先进的压缩方法进行调整、扩展、集成并评估。我们达到10 FPS的帧率,同时将压缩大小保持在105 Kb以下,压缩比降低了50倍,并保持与原始数据相同的物体检测性能。在广泛的实验和消融研究中,最终我们在数据集上实现了94.46 PSNR和6.54 BPP的值。未来的工作包括将该系统部署到实际环境中。代码可在本项目网站上获取:https://www.tum.de/。
https://arxiv.org/abs/2405.01750
Diabetic Retinopathy (DR), a prevalent complication in diabetes patients, can lead to vision impairment due to lesions formed on the retina. Detecting DR at an advanced stage often results in irreversible blindness. The traditional process of diagnosing DR through retina fundus images by ophthalmologists is not only time-intensive but also expensive. While classical transfer learning models have been widely adopted for computer-aided detection of DR, their high maintenance costs can hinder their detection efficiency. In contrast, Quantum Transfer Learning offers a more effective solution to this challenge. This approach is notably advantageous because it operates on heuristic principles, making it highly optimized for the task. Our proposed methodology leverages this hybrid quantum transfer learning technique to detect DR. To construct our model, we utilize the APTOS 2019 Blindness Detection dataset, available on Kaggle. We employ the ResNet-18, ResNet34, ResNet50, ResNet101, ResNet152 and Inception V3, pre-trained classical neural networks, for the initial feature extraction. For the classification stage, we use a Variational Quantum Classifier. Our hybrid quantum model has shown remarkable results, achieving an accuracy of 97% for ResNet-18. This demonstrates that quantum computing, when integrated with quantum machine learning, can perform tasks with a level of power and efficiency unattainable by classical computers alone. By harnessing these advanced technologies, we can significantly improve the detection and diagnosis of Diabetic Retinopathy, potentially saving many from the risk of blindness. Keywords: Diabetic Retinopathy, Quantum Transfer Learning, Deep Learning
糖尿病视网膜病变(DR),是糖尿病患者常见的并发症,可能导致视网膜形成病变,从而导致视力减退。在DR的晚期阶段,常常会导致不可逆的失明。通过眼科医生对视网膜 fundus 图像的诊断方式来传统地诊断DR,不仅费时,而且费用高昂。虽然经典的迁移学习模型已广泛应用于DR的计算机辅助检测,但它们的高维护成本可能会阻碍其检测效率。相比之下,量子迁移学习为解决这一挑战提供了更有效的解决方案。由于这种方法基于启发式原则,因此它在任务上具有高度优化能力。在我们的研究中,我们利用基于APTOS 2019盲人检测数据集的混合量子迁移学习方法来检测DR。为了构建我们的模型,我们利用了Kaggle上可用的APTOS 2019盲人检测数据集。我们使用预训练的ResNet-18、ResNet34、ResNet50、ResNet101、ResNet152和Inception V3古典神经网络进行初始特征提取。在分类阶段,我们使用变分量子分类器。我们的混合量子模型已经取得了显著的成果,ResNet-18的准确率达到了97%。这表明,结合量子计算与量子机器学习,可以实现与经典计算机无法比拟的权力和效率。通过利用这些先进技术,我们可以显著提高糖尿病视网膜病变的检测和诊断,从而可能拯救许多患者免于失明之险。关键词:糖尿病视网膜病变,量子迁移学习,深度学习
https://arxiv.org/abs/2405.01734
Deep learning has made significant progress in computer vision, specifically in image classification, object detection, and semantic segmentation. The skip connection has played an essential role in the architecture of deep neural networks,enabling easier optimization through residual learning during the training stage and improving accuracy during testing. Many neural networks have inherited the idea of residual learning with skip connections for various tasks, and it has been the standard choice for designing neural networks. This survey provides a comprehensive summary and outlook on the development of skip connections in deep neural networks. The short history of skip connections is outlined, and the development of residual learning in deep neural networks is surveyed. The effectiveness of skip connections in the training and testing stages is summarized, and future directions for using skip connections in residual learning are discussed. Finally, we summarize seminal papers, source code, models, and datasets that utilize skip connections in computer vision, including image classification, object detection, semantic segmentation, and image reconstruction. We hope this survey could inspire peer researchers in the community to develop further skip connections in various forms and tasks and the theory of residual learning in deep neural networks. The project page can be found at this https URL
深度学习在计算机视觉领域取得了显著进展,尤其是在图像分类、目标检测和语义分割方面。跳转连接在深度神经网络的架构中发挥了关键作用,通过在训练阶段通过残差学习进行更简单的优化,并在测试阶段提高准确性。许多神经网络都继承了残差学习与跳转连接的想法,将其作为设计神经网络的标准选择。 本次调查对跳转连接在深度神经网络中的发展进行了全面的概括和展望。首先简要介绍了跳转连接的短史,然后调查了在深度神经网络中残差学习的开发。总结了跳转连接在训练和测试阶段的有效性,并讨论了在残差学习中将跳转连接用于未来研究的方向。最后,我们总结了在计算机视觉领域使用跳转连接的一些论文、源代码、模型和数据集。我们希望能激励社区中的同行研究者在各种形式和任务上进一步发展跳转连接,并探讨深度神经网络中残差学习的理论。项目页面可以通过这个链接找到:https://github.com/your_username/project_name
https://arxiv.org/abs/2405.01725
Cell image segmentation is usually implemented using fully supervised deep learning methods, which heavily rely on extensive annotated training data. Yet, due to the complexity of cell morphology and the requirement for specialized knowledge, pixel-level annotation of cell images has become a highly labor-intensive task. To address the above problems, we propose an active learning framework for cell segmentation using bounding box annotations, which greatly reduces the data annotation cost of cell segmentation algorithms. First, we generate a box-supervised learning method (denoted as YOLO-SAM) by combining the YOLOv8 detector with the Segment Anything Model (SAM), which effectively reduces the complexity of data annotation. Furthermore, it is integrated into an active learning framework that employs the MC DropBlock method to train the segmentation model with fewer box-annotated samples. Extensive experiments demonstrate that our model saves more than ninety percent of data annotation time compared to mask-supervised deep learning methods.
细胞图像分割通常采用完全监督的深度学习方法来实现,这依赖于大量的注释训练数据。然而,由于细胞形态的复杂性和需要专业知识的要求,细胞图像的每个像素级注释已成为一个高度费力的工作。为解决上述问题,我们提出了一个使用边界框注释的细胞分割框架,这大大减少了细胞分割算法的数据注释成本。首先,通过将YOLOv8检测器与Segment Anything模型(SAM)结合,我们生成了一种框监督学习方法(表示为YOLO-SAM),这有效地减少了数据注释的复杂性。此外,它还集成到一个采用MC DropBlock方法来训练分割模型的主动学习框架中。大量实验证明,我们的模型比掩膜监督的深度学习方法节省了90%以上的数据注释时间。
https://arxiv.org/abs/2405.01701
Small object detection in aerial imagery presents significant challenges in computer vision due to the minimal data inherent in small-sized objects and their propensity to be obscured by larger objects and background noise. Traditional methods using transformer-based models often face limitations stemming from the lack of specialized databases, which adversely affect their performance with objects of varying orientations and scales. This underscores the need for more adaptable, lightweight models. In response, this paper introduces two innovative approaches that significantly enhance detection and segmentation capabilities for small aerial objects. Firstly, we explore the use of the SAHI framework on the newly introduced lightweight YOLO v9 architecture, which utilizes Programmable Gradient Information (PGI) to reduce the substantial information loss typically encountered in sequential feature extraction processes. The paper employs the Vision Mamba model, which incorporates position embeddings to facilitate precise location-aware visual understanding, combined with a novel bidirectional State Space Model (SSM) for effective visual context modeling. This State Space Model adeptly harnesses the linear complexity of CNNs and the global receptive field of Transformers, making it particularly effective in remote sensing image classification. Our experimental results demonstrate substantial improvements in detection accuracy and processing efficiency, validating the applicability of these approaches for real-time small object detection across diverse aerial scenarios. This paper also discusses how these methodologies could serve as foundational models for future advancements in aerial object recognition technologies. The source code will be made accessible here.
小的目标检测在无人机图像中具有显著的计算机视觉挑战,因为小规模物体固有的少量数据以及它们倾向于被较大物体和背景噪声遮挡,传统使用Transformer-based模型的方法常常受到缺乏专业数据库的局限,这会影响其对不同方向和尺寸的目标的检测和分割效果。这凸显了需要更灵活、轻便的模型的需要。因此,本文介绍了两种创新方法,显著增强了小无人机目标检测和分割的能力。首先,我们探讨了在轻量级YOLO v9架构上使用SAHI框架的效果,该架构利用可编程梯度信息(PGI)来减少在序列特征提取过程中通常遇到的大量信息损失。本文采用Vision Mamba模型,该模型包含位置嵌入,以促进精确的位置感知视觉理解,结合了一种新颖的双向状态空间模型(SSM)进行有效的视觉上下文建模。这种状态空间模型巧妙地利用了CNN的线性复杂性和Transformer的全局接收场,使其在远红外图像分类中特别有效。我们的实验结果表明,这些方法在检测精度和处理效率方面都有显著的提高,验证了这些方法在各种无人机场景下的实时小目标检测的适用性。本文还讨论了这些方法如何成为未来无人机目标识别技术进步的基础模型。源代码将在此处公开。
https://arxiv.org/abs/2405.01699
Out-of-distribution (OOD) detection is essential in autonomous driving, to determine when learning-based components encounter unexpected inputs. Traditional detectors typically use encoder models with fixed settings, thus lacking effective human interaction capabilities. With the rise of large foundation models, multimodal inputs offer the possibility of taking human language as a latent representation, thus enabling language-defined OOD detection. In this paper, we use the cosine similarity of image and text representations encoded by the multimodal model CLIP as a new representation to improve the transparency and controllability of latent encodings used for visual anomaly detection. We compare our approach with existing pre-trained encoders that can only produce latent representations that are meaningless from the user's standpoint. Our experiments on realistic driving data show that the language-based latent representation performs better than the traditional representation of the vision encoder and helps improve the detection performance when combined with standard representations.
离散(OOD)检测在自动驾驶中至关重要,以确定学习基础组件何时遇到意外输入。传统的检测方法通常使用固定设置的编码器模型,因此缺乏有效的人机交互能力。随着大型基础模型的发展,多模态输入提供了将人类语言作为潜在表示的机会,从而实现了语言定义的OOD检测。在本文中,我们将多模态模型CLIP中编码器模型生成的图像和文本表示的余弦相似性作为新的表示,以提高用于视觉异常检测的潜在表示的可视化和可控制性。我们将我们的方法与只能从用户角度产生无意义潜在表示的现有预训练编码器进行比较。我们对现实驾驶数据的实验结果表明,基于语言的潜在表示 perform better than the traditional representation of the vision encoder and helps improve detection performance when combined with standard representations.
https://arxiv.org/abs/2405.01691
Out-of-distribution (OOD) detection, crucial for reliable pattern classification, discerns whether a sample originates outside the training distribution. This paper concentrates on the high-dimensional features output by the final convolutional layer, which contain rich image features. Our key idea is to project these high-dimensional features into two specific feature subspaces, leveraging the dimensionality reduction capacity of the network's linear layers, trained with Predefined Evenly-Distribution Class Centroids (PEDCC)-Loss. This involves calculating the cosines of three projection angles and the norm values of features, thereby identifying distinctive information for in-distribution (ID) and OOD data, which assists in OOD detection. Building upon this, we have modified the batch normalization (BN) and ReLU layer preceding the fully connected layer, diminishing their impact on the output feature distributions and thereby widening the distribution gap between ID and OOD data features. Our method requires only the training of the classification network model, eschewing any need for input pre-processing or specific OOD data pre-tuning. Extensive experiments on several benchmark datasets demonstrates that our approach delivers state-of-the-art performance. Our code is available at this https URL.
离散(OD)检测,对于可靠的模式分类至关重要,它鉴别一个样本是否来源于训练分布之外。本文重点关注最终卷积层输出的高维特征,这些特征含有丰富的图像特征。我们的关键想法是将这些高维特征投影到两个特定的特征子空间中,利用网络线性层的维度减缩能力,通过预先定义的等距分布分类中心(PEDCC)损失进行训练。这包括计算三个投影角度的余弦值和特征的范数值,从而确定离散(ID)和OD数据之间的显着信息。在此基础上,我们对前馈神经网络模型的批归一化(BN)和ReLU层进行了修改,减小它们对输出特征分布的影响,从而扩大ID和OD数据特征之间的分布差距。我们的方法只需要对分类网络模型进行训练,无需进行输入预处理或特定OD数据预调节。在多个基准数据集上的大量实验证明,我们的方法实现了最先进的性能。我们的代码可在此处访问:https://url.cn/xyz6h
https://arxiv.org/abs/2405.01662