Objectives: This work aims to explore the impact of multicenter data heterogeneity on deep learning brain metastases (BM) autosegmentation performance, and assess the efficacy of an incremental transfer learning technique, namely learning without forgetting (LWF), to improve model generalizability without sharing raw data. Materials and methods: A total of six BM datasets from University Hospital Erlangen (UKER), University Hospital Zurich (USZ), Stanford, UCSF, NYU and BraTS Challenge 2023 on BM segmentation were used for this evaluation. First, the multicenter performance of a convolutional neural network (DeepMedic) for BM autosegmentation was established for exclusive single-center training and for training on pooled data, respectively. Subsequently bilateral collaboration was evaluated, where a UKER pretrained model is shared to another center for further training using transfer learning (TL) either with or without LWF. Results: For single-center training, average F1 scores of BM detection range from 0.625 (NYU) to 0.876 (UKER) on respective single-center test data. Mixed multicenter training notably improves F1 scores at Stanford and NYU, with negligible improvement at other centers. When the UKER pretrained model is applied to USZ, LWF achieves a higher average F1 score (0.839) than naive TL (0.570) and single-center training (0.688) on combined UKER and USZ test data. Naive TL improves sensitivity and contouring accuracy, but compromises precision. Conversely, LWF demonstrates commendable sensitivity, precision and contouring accuracy. When applied to Stanford, similar performance was observed. Conclusion: Data heterogeneity results in varying performance in BM autosegmentation, posing challenges to model generalizability. LWF is a promising approach to peer-to-peer privacy-preserving model training.
目标:本研究旨在探讨多中心数据异质性对深度学习脑转移瘤(BM)自分割性能的影响,并评估增广转移学习技术(学习不遗忘,LWF)对提高模型泛化能力而不共享原始数据的有效性。材料和方法:本研究使用了英国埃尔兰大学医院(UKER)、瑞士苏黎世大学医院(USZ)、斯坦福大学、加州大学旧金山分校(UCSF)、纽约大学(NYU)和2023年BrTS挑战赛BM分割数据集进行评估。首先,对于单中心训练和基于池化数据的训练,建立了DeepMedic卷积神经网络(BM)自分割的异中心性能。然后,评估了双边合作,其中 UKER 预训练模型在一个中心进行共享,用于进一步训练,无论是否使用 LWF。结果:对于单中心训练,单中心测试数据的 BM 检测范围平均分数从0.625(NYU)到0.876(UKER)不等。混合多中心训练在斯坦福和纽约大学上显著提高了 F1 分数,而在其他中心上影响较小。当 UKER 预训练模型应用于 USZ 时,LWF 获得的平均 F1 分数(0.839)高于 naive TL(0.570)和单中心训练(0.688)的总和。 naive TL 提高了灵敏度和轮廓准确性,但牺牲了精确性。相反,LWF 表现出值得称赞的灵敏度、精度和轮廓准确性。当应用于斯坦福时,观察到了类似的表现。结论:数据异质性导致 BM 自分割性能存在差异,对模型的泛化能力构成挑战。LWF 是保护隐私的同时实现模型协同训练的有前景的方法。
https://arxiv.org/abs/2405.10870
Convolutional neural networks (CNNs) are among the most widely used machine learning models for computer vision tasks, such as image classification. To improve the efficiency of CNNs, many CNNs compressing approaches have been developed. Low-rank methods approximate the original convolutional kernel with a sequence of smaller convolutional kernels, which leads to reduced storage and time complexities. In this study, we propose a novel low-rank CNNs compression method that is based on reduced storage direct tensor ring decomposition (RSDTR). The proposed method offers a higher circular mode permutation flexibility, and it is characterized by large parameter and FLOPS compression rates, while preserving a good classification accuracy of the compressed network. The experiments, performed on the CIFAR-10 and ImageNet datasets, clearly demonstrate the efficiency of RSDTR in comparison to other state-of-the-art CNNs compression approaches.
卷积神经网络(CNNs)是计算机视觉任务中最广泛使用的机器学习模型之一,如图像分类。为了提高CNN的效率,已经提出了许多CNN压缩方法。低秩方法用较小的卷积内核序列近似原始卷积内核,导致存储和时间复杂度的降低。在这项研究中,我们提出了一种基于减少存储直接张量环分解(RSDTR)的新型低秩CNNs压缩方法。所提出的方法具有更高的环模式变换灵活性,其参数和FLOPS压缩率较大,同时保持压缩网络的分类精度。在CIFAR-10和ImageNet数据集上进行的实验明确证明了RSDTR与其他最先进的CNN压缩方法相比具有更高的效率。
https://arxiv.org/abs/2405.10802
Place recognition is a fundamental task for robotic application, allowing robots to perform loop closure detection within simultaneous localization and mapping (SLAM), and achieve relocalization on prior maps. Current range image-based networks use single-column convolution to maintain feature invariance to shifts in image columns caused by LiDAR viewpoint change.However, this raises the issues such as "restricted receptive fields" and "excessive focus on local regions", degrading the performance of networks. To address the aforementioned issues, we propose a lightweight circular convolutional Transformer network denoted as CCTNet, which boosts performance by capturing structural information in point clouds and facilitating crossdimensional interaction of spatial and channel information. Initially, a Circular Convolution Module (CCM) is introduced, expanding the network's perceptual field while maintaining feature consistency across varying LiDAR perspectives. Then, a Range Transformer Module (RTM) is proposed, which enhances place recognition accuracy in scenarios with movable objects by employing a combination of channel and spatial attention mechanisms. Furthermore, we propose an Overlap-based loss function, transforming the place recognition task from a binary loop closure classification into a regression problem linked to the overlap between LiDAR frames. Through extensive experiments on the KITTI and Ford Campus datasets, CCTNet surpasses comparable methods, achieving Recall@1 of 0.924 and 0.965, and Recall@1% of 0.990 and 0.993 on the test set, showcasing a superior performance. Results on the selfcollected dataset further demonstrate the proposed method's potential for practical implementation in complex scenarios to handle movable objects, showing improved generalization in various datasets.
定位是一个机器人应用的基本任务,使机器人能够在同时定位和映射(SLAM)过程中执行闭环检测,并在先验地图上实现重新定位。目前,基于范围图像的网络使用单列卷积来保持特征不变,以应对由于激光雷达视点变化引起的图像列的位移。然而,这导致了诸如“受限制的接收域”和“过度关注局部区域”等问题,降低了网络的性能。为了应对上述问题,我们提出了一个轻量级的环状卷积Transformer网络,称为CCTNet,通过捕获点云中的结构信息并促进空间和通道信息的跨维度交互来提高性能。首先,引入环状卷积模块(CCM),在扩展网络的感知场的同时保持特征一致性,在不同的激光雷达视点下保持特征一致性。接着,我们提出了一个范围Transformer模块(RTM),通过结合通道和空间注意机制,在可移动物体场景中提高地点识别准确性。此外,我们提出了一个基于重叠损失函数的地点识别问题,将二进制环闭合分类问题转化为与激光雷达帧之间的重叠的回归问题。通过在KITTI和福特校园数据集上的广泛实验,CCTNet超越了类似方法,实现了召回率@1为0.924和0.965,以及召回率@1%为0.990和0.993在测试集上的成绩。结果在自收集数据集上进一步证明了该方法在复杂场景中进行实际应用的潜力,各种数据集上的泛化能力得到提高。
https://arxiv.org/abs/2405.10793
Modern diffusion MRI sequences commonly acquire a large number of volumes with diffusion sensitization gradients of differing strengths or directions. Such sequences rely on echo-planar imaging (EPI) to achieve reasonable scan duration. However, EPI is vulnerable to off-resonance effects, leading to tissue susceptibility and eddy-current induced distortions. The latter is particularly problematic because it causes misalignment between volumes, disrupting downstream modelling and analysis. The essential correction of eddy distortions is typically done post-acquisition, with image registration. However, this is non-trivial because correspondence between volumes can be severely disrupted due to volume-specific signal attenuations induced by varying directions and strengths of the applied gradients. This challenge has been successfully addressed by the popular FSL~Eddy tool but at considerable computational cost. We propose an alternative approach, leveraging recent advances in image processing enabled by deep learning (DL). It consists of two convolutional neural networks: 1) An image translator to restore correspondence between images; 2) A registration model to align the translated images. Results demonstrate comparable distortion estimates to FSL~Eddy, while requiring only modest training sample sizes. This work, to the best of our knowledge, is the first to tackle this problem with deep learning. Together with recently developed DL-based susceptibility correction techniques, they pave the way for real-time preprocessing of diffusion MRI, facilitating its wider uptake in the clinic.
现代扩散MRI序列通常使用扩散敏感度梯度的大小或方向不同的数量来获取大量体积数据。这些序列依赖于回波平面成像(EPI)来实现合理的扫描时间。然而,EPI对谐波影响敏感,导致组织敏度和涡流诱导的畸变。后一种情况尤其成问题,因为它会导致体积之间的错位,破坏下游建模和分析。对涡流畸变的根本纠正通常在收购后进行,通过图像配准实现。但是,这并不容易,因为应用的梯度方向和强度引起的体积特定信号衰减会导致图像之间的对应关系严重破坏。这个问题已经被流行的FSL~Eddy工具成功解决,但代价是相当高的计算成本。我们提出了一个利用深度学习(DL)成像处理 recent 进展的方法来解决这个问题。它包括两个卷积神经网络:1)一个图像转换器来恢复图像之间的对应关系;2)一个配准模型来对平移后的图像进行对齐。结果表明,与FSL~Eddy相当的组织畸变估计,而只需要很小的训练样本量。据我们所知,这是第一个利用深度学习来解决这个问题的。与最近开发基于DL的敏感性纠正技术相结合,它们为扩散MRI的实时预处理铺平了道路,促进了其在临床上的更广泛应用。
https://arxiv.org/abs/2405.10723
In cardiac Magnetic Resonance Imaging (MRI) analysis, simultaneous myocardial segmentation and T2 quantification are crucial for assessing myocardial pathologies. Existing methods often address these tasks separately, limiting their synergistic potential. To address this, we propose SQNet, a dual-task network integrating Transformer and Convolutional Neural Network (CNN) components. SQNet features a T2-refine fusion decoder for quantitative analysis, leveraging global features from the Transformer, and a segmentation decoder with multiple local region supervision for enhanced accuracy. A tight coupling module aligns and fuses CNN and Transformer branch features, enabling SQNet to focus on myocardium regions. Evaluation on healthy controls (HC) and acute myocardial infarction patients (AMI) demonstrates superior segmentation dice scores (89.3/89.2) compared to state-of-the-art methods (87.7/87.9). T2 quantification yields strong linear correlations (Pearson coefficients: 0.84/0.93) with label values for HC/AMI, indicating accurate mapping. Radiologist evaluations confirm SQNet's superior image quality scores (4.60/4.58 for segmentation, 4.32/4.42 for T2 quantification) over state-of-the-art methods (4.50/4.44 for segmentation, 3.59/4.37 for T2 quantification). SQNet thus offers accurate simultaneous segmentation and quantification, enhancing cardiac disease diagnosis, such as AMI.
在心脏磁共振成像(MRI)分析中,同时进行心肌分割和T2定量分析对于评估心肌病非常重要。现有的方法通常将这些任务分别处理,限制了它们的协同作用潜力。为了解决这个问题,我们提出了SQNet,一种集成Transformer和卷积神经网络(CNN)组件的双任务网络。SQNet具有用于定量分析的T2精炼融合解码器,利用Transformer的全局特征,并具有多个局部区域监督的分割解码器,以提高准确性。一个紧耦合模块将CNN和Transformer分支特征对齐和融合,使SQNet能够专注于心肌区域。在健康对照(HC)和急性心肌梗死患者(AMI)上的评估表明,SQNet的分割散点得分(89.3/89.2)优于最先进的方法(87.7/87.9)。T2定量分析与HC/AMI标签值具有很强的线性相关性(Pearson系数:0.84/0.93),表明准确的映射。放射科医生的评估证实了SQNet在分割(4.60/4.58)和T2定量分析(4.32/4.42)方面的优越图像质量评分超过最先进的方法(4.50/4.44)。因此,SQNet能够准确同时进行分割和定量分析,提高心脏病的诊断,如AMI。
https://arxiv.org/abs/2405.10570
Locating an object in a sequence of frames, given its appearance in the first frame of the sequence, is a hard problem that involves many stages. Usually, state-of-the-art methods focus on bringing novel ideas in the visual encoding or relational modelling phases. However, in this work, we show that bounding box regression from learned joint search and template features is of high importance as well. While previous methods relied heavily on well-learned features representing interactions between search and template, we hypothesize that the receptive field of the input convolutional bounding box network plays an important role in accurately determining the object location. To this end, we introduce two novel bounding box regression networks: inception and deformable. Experiments and ablation studies show that our inception module installed on the recent ODTrack outperforms the latter on three benchmarks: the GOT-10k, the UAV123 and the OTB2015.
在序列帧中定位一个对象,已知其在序列的第一个帧中的出现,是一个具有很多阶段的复杂问题。通常,最先进的方法会关注在视觉编码或关系建模阶段引入新颖想法。然而,在我们的工作中,我们证明了从学习联合搜索和模板特征中进行边界框回归非常重要。虽然以前的方法主要依赖经过良好训练的表示搜索和模板之间交互的特征,但我们假设输入卷积边界框网络的接收域在准确确定对象位置方面发挥着重要作用。为此,我们引入了两个新颖的边界框回归网络:Inception和Deformable。实验和消融研究结果表明,我们安装在最新OdTrack上的Inception模块在三个基准测试(GOT-10k、UAV123和OTB2015)上的表现优于后者。
https://arxiv.org/abs/2405.10444
We demonstrate the capabilities of an attention-based end-to-end approach for high-speed quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional vision-based navigation via independent mapping, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end planning and control networks have shown to be effective for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer models for depth-based end-to-end control, in a photorealistic, high-physics-fidelity simulator as well as in hardware, and observe that the attention-based models are more effective as quadrotor speeds increase, while recurrent models with many layers provide smoother commands at lower speeds. To the best of our knowledge, this is the first work to utilize vision transformers for end-to-end vision-based quadrotor control.
我们证明了基于注意力的端到端方法在高速 quadrotor 避障中的能力,与各种先进的架构进行比较。当 quadrotor 飞行速度增加时,传统的通过独立映射、规划和控制模块的视觉基线导航会因为传感器噪声、复合误差和处理延迟的增加而失效。因此,基于学习的端到端规划与控制网络在复杂环境中通过这些快速机器人进行在线控制已经取得了成功。我们用训练和比较卷积神经网络、U-Net 和循环神经网络与视觉变换器模型进行深度基于端到端控制,在光实时模拟器和硬件上进行实验,并观察到,随着 quadrotor 速度的增加,基于注意力的模型效果更好,而具有多层递归的循环模型在较低速度下提供更平滑的指令。据我们所知,这是第一个利用视觉变换器进行端到端高速 quadrotor 控制的工作。
https://arxiv.org/abs/2405.10391
Deep learning models, particularly Convolutional Neural Networks (CNNs), have demonstrated exceptional performance in diagnosing skin diseases, often outperforming dermatologists. However, they have also unveiled biases linked to specific demographic traits, notably concerning diverse skin tones or gender, prompting concerns regarding fairness and limiting their widespread deployment. Researchers are actively working to ensure fairness in AI-based solutions, but existing methods incur an accuracy loss when striving for fairness. To solve this issue, we propose a `two-biased teachers' (i.e., biased on different sensitive attributes) based approach to transfer fair knowledge into the student network. Our approach mitigates biases present in the student network without harming its predictive accuracy. In fact, in most cases, our approach improves the accuracy of the baseline model. To achieve this goal, we developed a weighted loss function comprising biasing and debiasing loss terms. We surpassed available state-of-the-art approaches to attain fairness and also improved the accuracy at the same time. The proposed approach has been evaluated and validated on two dermatology datasets using standard accuracy and fairness evaluation measures. We will make source code publicly available to foster reproducibility and future research.
深度学习模型,特别是卷积神经网络(CNNs),在诊断皮肤疾病方面表现出色,往往超过皮肤科医生。然而,它们也揭示了与特定人口特征相关的偏见,尤其是关于不同肤色或性别,引发了关于公平性和平行分布的担忧,限制了它们的应用范围。研究人员正在积极努力确保基于AI的解决方案具有公平性,但现有的方法在追求公平的同时会导致准确性下降。为解决这个问题,我们提出了一个“两个有偏见的老师”(即在不同的敏感属性上存在偏见)的方法,将公平知识传递给学生网络。我们的方法在不影响预测准确性的同时缓解了学生网络中的偏见。事实上,在大多数情况下,我们的方法提高了基线模型的准确性。为了实现这一目标,我们开发了一个由偏度和去偏度损失项组成的加权损失函数。我们超过了现有最先进的方法的公平性,同时提高了准确性。所提出的方法已在两个皮肤病数据集上进行了评估和验证,使用了标准准确性和公平性评估指标。我们将公开源代码,以促进可重复性和未来的研究。
https://arxiv.org/abs/2405.10256
Deformable image registration (alignment) is highly sought after in numerous clinical applications, such as computer aided diagnosis and disease progression analysis. Deep Convolutional Neural Network (DCNN)-based image registration methods have demonstrated advantages in terms of registration accuracy and computational speed. However, while most methods excel at global alignment, they often perform worse in aligning local regions. To address this challenge, this paper proposes a mask-guided encoder-decoder DCNN-based image registration method, named as MrRegNet. This approach employs a multi-resolution encoder for feature extraction and subsequently estimates multi-resolution displacement fields in the decoder to handle the substantial deformation of images. Furthermore, segmentation masks are employed to direct the model's attention toward aligning local regions. The results show that the proposed method outperforms traditional methods like Demons and a well-known deep learning method, VoxelMorph, on a public 3D brain MRI dataset (OASIS) and a local 2D brain MRI dataset with large deformations. Importantly, the image alignment accuracies are significantly improved at local regions guided by segmentation masks. Github link:this https URL.
塑形图像注册(对齐)在许多临床应用中受到高度关注,如计算机辅助诊断和疾病进展分析。基于深度卷积神经网络(DCNN)的图像注册方法在注册准确性和计算速度方面表现出了优势。然而,虽然大多数方法在全局对齐方面表现出色,但它们在局部区域对齐方面往往表现得更差。为解决这个问题,本文提出了一种基于mask-guided encoder-decoder DCNN图像注册方法,称为MrRegNet。该方法采用多分辨率编码器用于特征提取,并随后在解码器中估计多分辨率位移场,以处理图像的巨额变形。此外,还使用分割掩码来引导模型的注意力指向对齐局部区域。结果表明,与传统方法如Demons和著名的深度学习方法VoxelMorph相比,所提出的方法在公共3D脑MRI数据集(OASIS)和具有较大变形 locally的2D脑MRI数据集上显著表现出色。重要的是,在由分割掩码引导的局部区域,图像对齐准确度得到了显著提高。Github链接:this <https://github.com/>.
https://arxiv.org/abs/2405.10068
Monaural Speech enhancement on drones is challenging because the ego-noise from the rotating motors and propellers leads to extremely low signal-to-noise ratios at onboard microphones. Although recent masking-based deep neural network methods excel in monaural speech enhancement, they struggle in the challenging drone noise scenario. Furthermore, existing drone noise datasets are limited, causing models to overfit. Considering the harmonic nature of drone noise, this paper proposes a frequency domain bottleneck adapter to enable transfer learning. Specifically, the adapter's parameters are trained on drone noise while retaining the parameters of the pre-trained Frequency Recurrent Convolutional Recurrent Network (FRCRN) fixed. Evaluation results demonstrate the proposed method can effectively enhance speech quality. Moreover, it is a more efficient alternative to fine-tuning models for various drone types, which typically requires substantial computational resources.
在无人机上进行单声道语音增强是一个具有挑战性的任务,因为旋转电机和螺旋桨的自元噪声导致机载麦克风中的信号-噪声比非常低。尽管基于遮罩的深度神经网络方法在单声道语音增强方面表现出色,但在具有挑战性的无人机噪音场景中,它们的表现不佳。此外,现有的无人机噪音数据集有限,导致模型过拟合。考虑到无人机噪音的谐波特性,本文提出了一种频率域瓶颈适配器,以实现迁移学习。具体来说,适配器的参数在保留前预训练的FRCRN参数的同时,在无人机噪音上进行训练。评估结果表明,与单声道语音增强相比,所提出的方法可以有效增强语音质量。此外,它是为各种无人机类型进行模型微调的更有效选择,而通常需要大量的计算资源。
https://arxiv.org/abs/2405.10022
With the development of deep neural network generative models in recent years, significant progress has been made in the research of depth estimation in lane scenes. However, current research achievements are mainly focused on clear daytime scenarios. In complex rainy environments, the influence of rain streaks and local fog effects often leads to erroneous increases in the overall depth estimation values in images. Moreover, these natural factors can introduce disturbances to the accurate prediction of depth boundaries in images. In this paper, we investigate lane depth estimation in complex rainy environments. Based on the concept of convolutional kernel prediction, we propose a dual-layer pixel-wise convolutional kernel prediction network trained on offline data. By predicting two sets of independent convolutional kernels for the target image, we restore the depth information loss caused by complex environmental factors and address the issue of rain streak artifacts generated by a single convolutional kernel set. Furthermore, considering the lack of real rainy lane data currently available, we introduce an image synthesis algorithm, RCFLane, which comprehensively considers the darkening of the environment due to rainfall and local fog effects. We create a synthetic dataset containing 820 experimental images, which we refer to as RainKITTI, on the commonly used depth estimation dataset KITTI. Extensive experiments demonstrate that our proposed depth estimation framework achieves favorable results in highly complex lane rainy environments.
近年来,随着深度神经网络生成模型的快速发展,在车道场景中的深度估计研究取得了显著的进展。然而,目前的研究成果主要集中在清晰的白昼场景。在复杂的多雨环境中,雨丝和局部雾效的影响通常会导致图像中整体深度估计值的误差增加。此外,这些自然因素可能会干扰图像中深度边界的准确预测。在本文中,我们研究复杂多雨环境中的车道深度估计。基于卷积核预测的概念,我们提出了一种基于离线数据的二层像素级卷积核预测网络。通过预测目标图像的两个独立卷积核,我们恢复了由复杂环境因素引起的深度信息损失,并解决了单卷积核集引起的雨丝伪影问题。此外,考虑到目前缺乏真实的雨天车道数据,我们引入了图像合成算法RCFLane,它全面考虑了降雨和局部雾效对环境的影响。我们在常用的深度估计数据集KITTI上创建了一个包含820个实验图像的合成数据集,我们称之为RainKITTI。大量实验证明,我们提出的深度估计框架在高度复杂的车道雨天环境中取得了良好的结果。
https://arxiv.org/abs/2405.09964
The maturity classification of specialty crops such as strawberries and tomatoes is an essential agricultural downstream activity for selective harvesting and quality control (QC) at production and packaging sites. Recent advancements in Deep Learning (DL) have produced encouraging results in color images for maturity classification applications. However, hyperspectral imaging (HSI) outperforms methods based on color vision. Multivariate analysis methods and Convolutional Neural Networks (CNN) deliver promising results; however, a large amount of input data and the associated preprocessing requirements cause hindrances in practical application. Conventionally, the reflectance intensity in a given electromagnetic spectrum is employed in estimating fruit maturity. We present a feature extraction method to empirically demonstrate that the peak reflectance in subbands such as 500-670 nm (pigment band) and the wavelength of the peak position, and contrarily, the trough reflectance and its corresponding wavelength within 671-790 nm (chlorophyll band) are convenient to compute yet distinctive features for the maturity classification. The proposed feature selection method is beneficial because preprocessing, such as dimensionality reduction, is avoided before every prediction. The feature set is designed to capture these traits. The best SOTA methods, among 3D-CNN, 1D-CNN, and SVM, achieve at most 90.0 % accuracy for strawberries and 92.0 % for tomatoes on our dataset. Results show that the proposed method outperforms the SOTA as it yields an accuracy above 98.0 % in strawberry and 96.0 % in tomato classification. A comparative analysis of the time efficiency of these methods is also conducted, which shows the proposed method performs prediction at 13 Frames Per Second (FPS) compared to the maximum 1.16 FPS attained by the full-spectrum SVM classifier.
草莓和西红柿等特色作物的成熟度分类是生产和包装站点选择性收获和质量控制(QC)过程中必不可少的重要农业下游活动。最近在深度学习(DL)方面的进步在颜色图像的成熟度分类应用中产生了鼓舞人心的结果。然而,基于色彩视觉的方法在成熟度分类上劣后于 hyperspectral imaging(HSI)。多变量分析方法和卷积神经网络(CNN)产生了积极的结果;然而,大量的输入数据及其相关预处理要求给应用带来障碍。通常,在给定的电磁频谱中的反射强度被用来估计果实成熟度。我们提出了一个特征提取方法,以经验证明在色素带(500-670纳米,色素带)子频段和最大峰位波长以及相反,在671-790纳米(叶绿素带)中的峰谷反射强度和其相应的波长是方便计算且具有区分性的特征,用于成熟度分类。所提出的特征选择方法有益处,因为预测之前,预处理,例如降维,被避免了。特征集旨在捕捉这些特征。在3D-CNN、1D-CNN和SVM中,最好的SOTA方法,即草莓和西红柿数据集中的3D-CNN,在草莓和西红柿上的准确度分别为90.0%和92.0%。结果表明,与SOTA相比,所提出的方法具有更高的准确度,草莓的准确度为98.0%,西红柿的准确度为96.0%。还进行了这些方法的比较分析,比较了它们的预测时间效率,结果表明,与 full-spectral SVM 分类器达到的最大1.16 FPS 相比,所提出的 method 在13 FPS 的预测速度上表现出色。
https://arxiv.org/abs/2405.09955
Previous unsupervised anomaly detection (UAD) methods often struggle with significant intra-class diversity; i.e., a class in a dataset contains multiple subclasses, which we categorize as Feature-Rich Anomaly Detection Datasets (FRADs). This is evident in applications such as unified setting and unmanned supermarket scenarios. To address this challenge, we developed MiniMaxAD: a lightweight autoencoder designed to efficiently compress and memorize extensive information from normal images. Our model utilizes a large kernel convolutional network equipped with a Global Response Normalization (GRN) unit and employs a multi-scale feature reconstruction strategy. The GRN unit significantly increases the upper limit of the network's capacity, while the large kernel convolution facilitates the extraction of highly abstract patterns, leading to compact normal feature modeling. Additionally, we introduce an Adaptive Contraction Loss (ADCLoss), tailored to FRADs to overcome the limitations of global cosine distance loss. MiniMaxAD was comprehensively tested across six challenging UAD benchmarks, achieving state-of-the-art results in four and highly competitive outcomes in the remaining two. Notably, our model achieved a detection AUROC of up to 97.0\% in ViSA under the unified setting. Moreover, it not only achieved state-of-the-art performance in unmanned supermarket tasks but also exhibited an inference speed 37 times faster than the previous best method, demonstrating its effectiveness in complex UAD tasks.
之前无监督异常检测(UAD)方法通常在数据集中的类内多样性显著受限;即数据集中的一个类别可能包含多个亚类,我们称之为特征丰富异常检测数据集(FRADs)。这在统一设置和无人超市场景等应用中是显而易见的。为解决这个挑战,我们开发了MiniMaxAD:一种轻量级的自编码器,旨在有效地压缩和记忆丰富的图像信息。我们的模型采用了一个大核卷积神经网络,配备了一个全局响应归一化(GRN)单元,并采用多尺度特征重构策略。GRN单元显著增加了网络的容量上限,而大核卷积有助于提取高度抽象的模式,导致紧凑的正常特征建模。此外,我们还引入了自适应收缩损失(ADCLoss),专门针对FRADs来克服全局余弦距离损失。MiniMaxAD在六个具有挑战性的UAD基准测试中进行了全面测试,在四个基准测试中实现了最先进的性能,在另外两个基准测试中具有极具竞争力的结果。值得注意的是,在统一设置下,我们的模型在ViSA上的检测AUROC可以达到97.0%。此外,它不仅在无人超市任务中实现了最先进的表现,而且具有比之前最佳方法快37倍的应用速度,表明其对于复杂UAD任务的处理效果非常出色。
https://arxiv.org/abs/2405.09933
Multi-line LiDAR is widely used in autonomous vehicles, so point cloud-based 3D detectors are essential for autonomous driving. Extracting rich multi-scale features is crucial for point cloud-based 3D detectors in autonomous driving due to significant differences in the size of different types of objects. However, due to the real-time requirements, large-size convolution kernels are rarely used to extract large-scale features in the backbone. Current 3D detectors commonly use feature pyramid networks to obtain large-scale features; however, some objects containing fewer point clouds are further lost during downsampling, resulting in degraded performance. Since pillar-based schemes require much less computation than voxel-based schemes, they are more suitable for constructing real-time 3D detectors. Hence, we propose PillarNeXt, a pillar-based scheme. We redesigned the feature encoding, the backbone, and the neck of the 3D detector. We propose Voxel2Pillar feature encoding, which uses a sparse convolution constructor to construct pillars with richer point cloud features, especially height features. Moreover, additional learnable parameters are added, which enables the initial pillar to achieve higher performance capabilities. We extract multi-scale and large-scale features in the proposed fully sparse backbone, which does not utilize large-size convolutional kernels; the backbone consists of the proposed multi-scale feature extraction module. The neck consists of the proposed sparse ConvNeXt, whose simple structure significantly improves the performance. The effectiveness of the proposed PillarNeXt is validated on the Waymo Open Dataset, and object detection accuracy for vehicles, pedestrians, and cyclists is improved; we also verify the effectiveness of each proposed module in detail.
多线激光雷达在自动驾驶中得到了广泛应用,因此基于点云的3D检测器对自动驾驶至关重要。由于不同类型物体的大小差异很大,因此从点云中提取丰富的多尺度特征对于自动驾驶中的点云测距器至关重要。然而,由于实时要求,大型卷积核通常不会用于从骨干网络中提取大尺度特征。目前,大多数3D检测器使用特征金字塔网络获取大尺度特征;然而,在 downsampling 过程中,一些包含较少点云的对象进一步丢失,导致性能下降。由于基于柱面的方案需要比基于体素的方案更少的计算,因此它们更适合用于构建实时3D检测器。因此,我们提出了PillarNeXt,一种基于柱面的方案。我们重新设计了3D检测器的特征编码器、骨干网络和颈部。我们提出了Voxel2Pillar特征编码器,它使用稀疏卷积构建具有丰富点云特征的支柱,特别是高度特征。此外,还增加了可学习的参数,使得初始支柱能够实现更高的性能能力。我们在提出的完全稀疏骨干中提取多尺度和大尺度特征,这并没有使用大型卷积核;骨干由提出的多尺度特征提取模块组成。颈部分别由提出的稀疏ConvNeXt组成,其简单的结构显著提高了性能。提出的PillarNeXt的有效性在Waymo Open Dataset上得到了验证,并对车辆、行人、自行车等的对象检测精度进行了提高。我们还详细验证了每个提出的模块的有效性。
https://arxiv.org/abs/2405.09828
Spoken language interaction is at the heart of interpersonal communication, and people flexibly adapt their speech to different individuals and environments. It is surprising that robots, and by extension other digital devices, are not equipped to adapt their speech and instead rely on fixed speech parameters, which often hinder comprehension by the user. We conducted a speech comprehension study involving 39 participants who were exposed to different environmental and contextual conditions. During the experiment, the robot articulated words using different vocal parameters, and the participants were tasked with both recognising the spoken words and rating their subjective impression of the robot's speech. The experiment's primary outcome shows that spaces with good acoustic quality positively correlate with intelligibility and user experience. However, increasing the distance between the user and the robot exacerbated the user experience, while distracting background sounds significantly reduced speech recognition accuracy and user satisfaction. We next built an adaptive voice for the robot. For this, the robot needs to know how difficult it is for a user to understand spoken language in a particular setting. We present a prediction model that rates how annoying the ambient acoustic environment is and, consequentially, how hard it is to understand someone in this setting. Then, we develop a convolutional neural network model to adapt the robot's speech parameters to different users and spaces, while taking into account the influence of ambient acoustics on intelligibility. Finally, we present an evaluation with 27 users, demonstrating superior intelligibility and user experience with adaptive voice parameters compared to fixed voice.
口语交互是人际交往的核心,而人们会根据不同的个体和环境灵活调整自己的讲话。令人惊讶的是,机器人以及其他数字设备并没有具备适应讲话的能力,而是依赖固定的讲话参数,这往往阻碍了用户的理解。我们对39名参与者进行了一项口语理解研究,让他们暴露于不同的环境和情境中。在实验过程中,机器人使用不同的语音参数表达单词,参与者被要求识别出听到的单词,并对机器人的讲话进行主观评价。实验的主要结果表明,具有良好的声学质量的空间与可理解性和用户体验正相关。然而,用户与机器人之间的距离增加会加剧用户体验,而分散的背景声音会显著降低语音识别准确性和用户满意度。接下来,我们为机器人构建了一个自适应的语音。为此,机器人需要知道用户在特定环境中理解口语语言的困难程度。我们提出了一种预测模型,用于评估环境声学对可理解性的影响程度,从而对机器人的讲话参数进行调整。最后,我们展示了使用自适应语音参数的评估结果,证明了与固定语音相比,具有更好的智能度和用户体验。
https://arxiv.org/abs/2405.09708
We develop a theory of neural synaptic balance and how it can emerge or be enforced in neural networks. For a given additive cost function $R$ (regularizer), a neuron is said to be in balance if the total cost of its input weights is equal to the total cost of its output weights. The basic example is provided by feedforward networks of ReLU units trained with $L_2$ regularizers, which exhibit balance after proper training. The theory explains this phenomenon and extends it in several directions. The first direction is the extension to bilinear and other activation functions. The second direction is the extension to more general regularizers, including all $L_p$ ($p>0$) regularizers. The third direction is the extension to non-layered architectures, recurrent architectures, convolutional architectures, as well as architectures with mixed activation functions. The theory is based on two local neuronal operations: scaling which is commutative, and balancing which is not commutative. Finally, and most importantly, given any initial set of weights, when local balancing operations are applied to each neuron in a stochastic manner, global order always emerges through the convergence of the stochastic balancing algorithm to the same unique set of balanced weights. The reason for this convergence is the existence of an underlying strictly convex optimization problem where the relevant variables are constrained to a linear, only architecture-dependent, manifold. The theory is corroborated through various simulations carried out on benchmark data sets. Scaling and balancing operations are entirely local and thus physically plausible in biological and neuromorphic networks.
我们发展了一个关于神经元突触平衡的理论以及如何在神经网络中产生或强制这种平衡。对于给定的加权成本函数 $R$(正则化),如果一个神经元的输入权重总和等于其输出权重的总和,则该神经元处于平衡状态。这个基本例子是由使用 $L_2$ 正则器的反向传播网络训练的 ReLU 单元,在适当训练后表现出平衡。理论解释了这种现象,并在几个方向上进行了扩展。 第一个方向是向一维和其它激活函数的扩展。第二个方向是对更一般的正则器,包括所有 $L_p$ ($p>0$) 正则器进行扩展。第三个方向是对非层状架构、循环架构、卷积架构以及具有混合激活函数的架构进行扩展。理论基于两个局部神经元操作:缩放,这是可交换的;平衡,这是非可交换的。最后,最重要的是,对于给定的初始权重集,通过随机方式对每个神经元应用局部平衡操作,全局顺序总是通过随机平衡算法的收敛到达相同的一组平衡权重而产生。这种收敛的原因是在严格凸优化问题中,相关变量被约束为线性的、仅架构相关的多维空间。理论通过在基准数据集上进行的各种模拟得到证实。缩放和平衡操作是完全局部操作,在生物和神经形态网络中具有物理合理性。
https://arxiv.org/abs/2405.09688
The multi-scale receptive field and large kernel attention (LKA) module have been shown to significantly improve performance in the lightweight image super-resolution task. However, existing lightweight super-resolution (SR) methods seldom pay attention to designing efficient building block with multi-scale receptive field for local modeling, and their LKA modules face a quadratic increase in computational and memory footprints as the convolutional kernel size increases. To address the first issue, we propose the multi-scale blueprint separable convolutions (MBSConv) as highly efficient building block with multi-scale receptive field, it can focus on the learning for the multi-scale information which is a vital component of discriminative representation. As for the second issue, we revisit the key properties of LKA in which we find that the adjacent direct interaction of local information and long-distance dependencies is crucial to provide remarkable performance. Thus, taking this into account and in order to mitigate the complexity of LKA, we propose a large coordinate kernel attention (LCKA) module which decomposes the 2D convolutional kernels of the depth-wise convolutional layers in LKA into horizontal and vertical 1-D kernels. LCKA enables the adjacent direct interaction of local information and long-distance dependencies not only in the horizontal direction but also in the vertical. Besides, LCKA allows for the direct use of extremely large kernels in the depth-wise convolutional layers to capture more contextual information, which helps to significantly improve the reconstruction performance, and it incurs lower computational complexity and memory footprints. Integrating MBSConv and LCKA, we propose a large coordinate kernel attention network (LCAN).
多尺度 receptive 场和大型内核注意 (LKA) 模块已经被证明在轻量图像超分辨率任务中显著提高了性能。然而,现有的轻量级超分辨率(SR)方法很少关注设计具有多尺度 receptive 场的有效构建模块,并且随着卷积核大小的增加,它们的 LKA 模块的计算和内存足迹呈指数增长。为解决第一个问题,我们提出了多尺度蓝色模板分离卷积(MBSConv)作为具有多尺度 receptive 场的非常高效构建模块,它可以关注多尺度信息,这是判别表示的重要组成部分。对于第二个问题,我们重新审视了 LKA 的关键特性,我们发现邻近信息之间的直接相互作用和长距离依赖关系对提供出色的性能至关重要。因此,考虑到这一点,为了减轻 LKA 的复杂性,我们提出了大型坐标卷积注意(LCKA)模块,它将 LKA 的深度卷积层中的 2D 卷积核拆分为水平和垂直 1D 卷积核。LCKA 不仅使相邻直接相互作用于局部信息和长距离依赖关系,而且在水平和垂直方向上都有。此外,LCKA 允许在深度卷积层中直接使用极其大的卷积核来捕捉更多的上下文信息,从而显著提高重构性能,并使其计算复杂性和内存足迹更低。将 MBSConv 和 LCKA 集成起来,我们提出了大型坐标卷积注意网络 (LCAN)。
https://arxiv.org/abs/2405.09353
AI-based analysis of histopathology whole slide images (WSIs) is central in computational pathology. However, image quality can impact model performance. Here, we investigate to what extent unsharp areas of WSIs impact deep convolutional neural network classification performance. We propose a multi-model approach, i.e. DeepBlurMM, to alleviate the impact of unsharp image areas and improve the model performance. DeepBlurMM uses the sigma cut-offs to determine the most suitable model for predicting tiles with various levels of blurring within a single WSI, where sigma is the standard deviation of the Gaussian distribution. Specifically, the cut-offs categorise the tiles into sharp or slight blur, moderate blur, and high blur. Each blur level has a corresponding model to be selected for tile-level predictions. Throughout the simulation study, we demonstrated the application of DeepBlurMM in a binary classification task for breast cancer Nottingham Histological Grade 1 vs 3. Performance, evaluated over 5-fold cross-validation, showed that DeepBlurMM outperformed the base model under moderate blur and mixed blur conditions. Unsharp image tiles (local blurriness) at prediction time reduced model performance. The proposed multi-model approach improved performance under some conditions, with the potential to improve quality in both research and clinical applications.
基于AI的病理学全切片图像(WSIs)分析在计算病理学中具有核心地位。然而,图像质量会 impact 模型性能。在这里,我们研究了 WSIs 非锐利区域对深度卷积神经网络分类性能的影响程度。我们提出了一个多模型方法,即 DeepBlurMM,以减轻非锐利图像区域对模型性能的影响,并提高模型性能。DeepBlurMM 使用高斯分布的σ截止值来确定在单个 WSI 中预测具有各种模糊程度的贴片的最合适的模型。具体来说,截止值将贴片分类为锐利、轻微模糊、中度模糊和高模糊。对于每个模糊级别,都有相应的模型用于预测贴片级别的结果。在模拟研究中,我们证明了 DeepBlurMM 在乳腺癌诺丁山病理 grade 1 与 3 的二分类任务中的应用。性能通过 5 倍交叉验证评估,在 moderate blur 和 mixed blur 条件下,DeepBlurMM 超过了基线模型。预测时间内的非锐利图像贴片(局部模糊)降低了模型性能。所提出的多模型方法在某些条件下改善了性能,具有在研究和临床应用中提高质量的潜力。
https://arxiv.org/abs/2405.09298
We propose a new graph convolutional block, called MusGConv, specifically designed for the efficient processing of musical score data and motivated by general perceptual principles. It focuses on two fundamental dimensions of music, pitch and rhythm, and considers both relative and absolute representations of these components. We evaluate our approach on four different musical understanding problems: monophonic voice separation, harmonic analysis, cadence detection, and composer identification which, in abstract terms, translate to different graph learning problems, namely, node classification, link prediction, and graph classification. Our experiments demonstrate that MusGConv improves the performance on three of the aforementioned tasks while being conceptually very simple and efficient. We interpret this as evidence that it is beneficial to include perception-informed processing of fundamental musical concepts when developing graph network applications on musical score data.
我们提出了一个新的图形卷积块,称为MusGConv,专门为音乐分数数据的高效处理而设计,并受到一般感知原则的启发。它专注于音乐的两个基本维度,即音高和节奏,并考虑这两个组件的相对和绝对表示。我们对我们的方法在四个不同的音乐理解问题进行了评估:单声道声音分离,和弦分析,句末检测和作曲家识别。用抽象的话,这些问题翻译为不同的图学习问题,即节点分类,链预测和图分类。我们的实验结果表明,MusGConv在提高上述三个任务的同时,在概念上非常简单和高效。我们将这一结果解释为在开发基于音乐分数数据的图形网络应用时,有意识地处理基本音乐概念的感知信息是有益的。
https://arxiv.org/abs/2405.09224
Due to the increasing need for effective security measures and the integration of cameras in commercial products, a hugeamount of visual data is created today. Law enforcement agencies (LEAs) are inspecting images and videos to findradicalization, propaganda for terrorist organizations and illegal products on darknet markets. This is time consuming.Instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specificlocations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deepconvolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has fivecontributions. The first contribution allows image-based geo-localization to estimate the origin of an image. CNNs andgeotagged images are used to create a model that determines the location of an image by its pixel values. The secondcontribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposedmethod encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition ofperson attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attributeproblem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotationtool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimalannotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion.Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectableconcepts is required for the users. The methods are validated on data with varying locations (popular and non-touristiclocations), varying person attributes (CelebA dataset), and varying number of annotations.
由于对有效安全措施的需求不断增加以及摄像头在商业产品中的应用,如今产生了大量的视觉数据。执法机构(LEAs)正在检查图像和视频以寻找极端化、恐怖主义组织和非法商品在暗网市场上的传播。这需要耗费大量时间。 instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specific locations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deep convolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has five contributions. The first contribution allows image-based geolocation to estimate the origin of an image. CNNs and geotagged images are used to create a model that determines the location of an image by its pixel values. The second contribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposed method encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition of person attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attribute problem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotation tool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimal annotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion. Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectable concepts is required for the users. The methods are validated on data with varying locations (popular and non-tourist locations), varying person attributes (CelebA dataset), and varying number of annotations.
https://arxiv.org/abs/2405.09194