Federated learning (FL) represents a pivotal shift in machine learning (ML) as it enables collaborative training of local ML models coordinated by a central aggregator, all without the need to exchange local data. However, its application on edge devices is hindered by limited computational capabilities and data communication challenges, compounded by the inherent complexity of Deep Learning (DL) models. Model pruning is identified as a key technique for compressing DL models on devices with limited resources. Nonetheless, conventional pruning techniques typically rely on manually crafted heuristics and demand human expertise to achieve a balance between model size, speed, and accuracy, often resulting in sub-optimal solutions. In this study, we introduce an automated federated learning approach utilizing informed pruning, called AutoFLIP, which dynamically prunes and compresses DL models within both the local clients and the global server. It leverages a federated loss exploration phase to investigate model gradient behavior across diverse datasets and losses, providing insights into parameter significance. Our experiments showcase notable enhancements in scenarios with strong non-IID data, underscoring AutoFLIP's capacity to tackle computational constraints and achieve superior global convergence.
联邦学习(FL)在机器学习(ML)中具有关键性的转变,因为它允许由中央聚合器协调本地ML模型的协同训练,而无需交换本地数据。然而,在边缘设备上应用FL存在计算能力和数据通信挑战的限制,再加上Deep Learning(DL)模型的固有复杂性。模型剪枝被认为是压缩具有有限资源设备的DL模型的关键技术。然而,传统的剪枝技术通常依赖于人工创建的启发式,并需要人类专业知识来达到模型大小、速度和准确性的平衡,往往导致次优解决方案。在本研究中,我们引入了一种自动化的联邦学习方法,利用智能剪枝,称为AutoFLIP,它可以在本地客户端和全局服务器上动态地剪枝和压缩DL模型。它利用联邦损失探索阶段研究了模型梯度行为,提供了对参数重要性的洞察。我们的实验展示了在具有强大非IID数据的情况下显著的增强,突出了AutoFLIP解决计算限制和实现卓越全局收敛的能力。
https://arxiv.org/abs/2405.10271
Deep learning models, particularly Convolutional Neural Networks (CNNs), have demonstrated exceptional performance in diagnosing skin diseases, often outperforming dermatologists. However, they have also unveiled biases linked to specific demographic traits, notably concerning diverse skin tones or gender, prompting concerns regarding fairness and limiting their widespread deployment. Researchers are actively working to ensure fairness in AI-based solutions, but existing methods incur an accuracy loss when striving for fairness. To solve this issue, we propose a `two-biased teachers' (i.e., biased on different sensitive attributes) based approach to transfer fair knowledge into the student network. Our approach mitigates biases present in the student network without harming its predictive accuracy. In fact, in most cases, our approach improves the accuracy of the baseline model. To achieve this goal, we developed a weighted loss function comprising biasing and debiasing loss terms. We surpassed available state-of-the-art approaches to attain fairness and also improved the accuracy at the same time. The proposed approach has been evaluated and validated on two dermatology datasets using standard accuracy and fairness evaluation measures. We will make source code publicly available to foster reproducibility and future research.
深度学习模型,特别是卷积神经网络(CNNs),在诊断皮肤疾病方面表现出色,往往超过皮肤科医生。然而,它们也揭示了与特定人口特征相关的偏见,尤其是关于不同肤色或性别,引发了关于公平性和平行分布的担忧,限制了它们的应用范围。研究人员正在积极努力确保基于AI的解决方案具有公平性,但现有的方法在追求公平的同时会导致准确性下降。为解决这个问题,我们提出了一个“两个有偏见的老师”(即在不同的敏感属性上存在偏见)的方法,将公平知识传递给学生网络。我们的方法在不影响预测准确性的同时缓解了学生网络中的偏见。事实上,在大多数情况下,我们的方法提高了基线模型的准确性。为了实现这一目标,我们开发了一个由偏度和去偏度损失项组成的加权损失函数。我们超过了现有最先进的方法的公平性,同时提高了准确性。所提出的方法已在两个皮肤病数据集上进行了评估和验证,使用了标准准确性和公平性评估指标。我们将公开源代码,以促进可重复性和未来的研究。
https://arxiv.org/abs/2405.10256
The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at this https URL
学习匹配(LTM)框架被证明是一种有效的反向最优传输方法,用于在两个数据源之间学习潜在的地面度量,从而促进后续匹配。然而,传统的LTM框架面临可扩展性挑战,每次更新地面度量参数时,需要使用整个数据集。将LTM适应深度学习场景,我们引入了音频-文本检索问题的小批次学习匹配(m-LTM)框架。该框架利用了 mini-batch 子采样和 Mahalanobis 增强的地面度量家族。此外,为了应对实际实践中存在的训练数据对齐问题,我们提出了一个使用部分最优传输的变体,以减轻对齐数据对训练数据的影响。我们在三个数据集(AudioCaps、Clotho和ESC-50)上对音频-文本匹配问题进行了广泛的实验。结果表明,我们提出的方法可以学习丰富和富有表现力的联合嵌入空间,实现最佳性能。此外,与仅基于对比损失的零 shot 声事件检测任务相比,所提出的 m-LTM 框架在AudioCaps数据集上的模态差距可以实现更大的提升。值得注意的是,我们使用部分最优传输与 m-LTM 的策略表明,与对比损失相比,噪声容忍度更高,特别是在 AudioCaps 数据集上,训练数据中的噪声比值变化时。我们的代码可以从该链接https://www.oskari.org/es/docs/latest/html/index.html获取。
https://arxiv.org/abs/2405.10084
Deformable image registration (alignment) is highly sought after in numerous clinical applications, such as computer aided diagnosis and disease progression analysis. Deep Convolutional Neural Network (DCNN)-based image registration methods have demonstrated advantages in terms of registration accuracy and computational speed. However, while most methods excel at global alignment, they often perform worse in aligning local regions. To address this challenge, this paper proposes a mask-guided encoder-decoder DCNN-based image registration method, named as MrRegNet. This approach employs a multi-resolution encoder for feature extraction and subsequently estimates multi-resolution displacement fields in the decoder to handle the substantial deformation of images. Furthermore, segmentation masks are employed to direct the model's attention toward aligning local regions. The results show that the proposed method outperforms traditional methods like Demons and a well-known deep learning method, VoxelMorph, on a public 3D brain MRI dataset (OASIS) and a local 2D brain MRI dataset with large deformations. Importantly, the image alignment accuracies are significantly improved at local regions guided by segmentation masks. Github link:this https URL.
塑形图像注册(对齐)在许多临床应用中受到高度关注,如计算机辅助诊断和疾病进展分析。基于深度卷积神经网络(DCNN)的图像注册方法在注册准确性和计算速度方面表现出了优势。然而,虽然大多数方法在全局对齐方面表现出色,但它们在局部区域对齐方面往往表现得更差。为解决这个问题,本文提出了一种基于mask-guided encoder-decoder DCNN图像注册方法,称为MrRegNet。该方法采用多分辨率编码器用于特征提取,并随后在解码器中估计多分辨率位移场,以处理图像的巨额变形。此外,还使用分割掩码来引导模型的注意力指向对齐局部区域。结果表明,与传统方法如Demons和著名的深度学习方法VoxelMorph相比,所提出的方法在公共3D脑MRI数据集(OASIS)和具有较大变形 locally的2D脑MRI数据集上显著表现出色。重要的是,在由分割掩码引导的局部区域,图像对齐准确度得到了显著提高。Github链接:this <https://github.com/>.
https://arxiv.org/abs/2405.10068
The accelerated progress of artificial intelligence (AI) has popularized deep learning models across domains, yet their inherent opacity poses challenges, notably in critical fields like healthcare, medicine and the geosciences. Explainable AI (XAI) has emerged to shed light on these "black box" models, helping decipher their decision making process. Nevertheless, different XAI methods yield highly different explanations. This inter-method variability increases uncertainty and lowers trust in deep networks' predictions. In this study, for the first time, we propose a novel framework designed to enhance the explainability of deep networks, by maximizing both the accuracy and the comprehensibility of the explanations. Our framework integrates various explanations from established XAI methods and employs a non-linear "explanation optimizer" to construct a unique and optimal explanation. Through experiments on multi-class and binary classification tasks in 2D object and 3D neuroscience imaging, we validate the efficacy of our approach. Our explanation optimizer achieved superior faithfulness scores, averaging 155% and 63% higher than the best performing XAI method in the 3D and 2D applications, respectively. Additionally, our approach yielded lower complexity, increasing comprehensibility. Our results suggest that optimal explanations based on specific criteria are derivable and address the issue of inter-method variability in the current XAI literature.
人工智能(AI)的快速发展已经在各个领域普及了深度学习模型,然而其固有的不透明性在关键领域如医疗、医学和地质学等领域提出了挑战。可解释人工智能(XAI)应运而生,帮助揭示这些“黑盒子”模型,并解释其决策过程。然而,不同的XAI方法得出的解释高度不同。这种方法间的变异性增加了不确定性,降低了 deep 网络预测的信任度。在这项研究中,我们首次提出了一个新框架,旨在提高 deep 网络的可解释性,通过最大化 both 解释的准确性和全面性来实现。我们的框架整合了现有的 XAI 方法的各个解释,并采用了一个非线性的“解释优化器”来构建独特的最优解释。通过在 2D 物体和 3D 神经科学成像的多分类和二分类任务上的实验,我们验证了我们的方法的有效性。我们的解释优化器实现了比最佳应用在 3D 和 2D 领域的 XAI 方法更高的忠实度分数,分别平均高于 155% 和 63%。此外,我们的方法还产生了较低的复杂性,提高了可理解性。我们的结果表明,基于特定标准的最优解释是可导出的,并解决了当前 XAI 文献中方法间变异性问题的难题。
https://arxiv.org/abs/2405.10008
Automated medical image analysis systems often require large amounts of training data with high quality labels, which are difficult and time consuming to generate. This paper introduces Radiology Object in COntext version 2 (ROCOv2), a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access subset. It is an updated version of the ROCO dataset published in 2018, and adds 35,705 new images added to PMC since 2018. It further provides manually curated concepts for imaging modalities with additional anatomical and directional concepts for X-rays. The dataset consists of 79,789 images and has been used, with minor modifications, in the concept detection and caption prediction tasks of ImageCLEFmedical Caption 2023. The dataset is suitable for training image annotation models based on image-caption pairs, or for multi-label image classification using Unified Medical Language System (UMLS) concepts provided with each image. In addition, it can serve for pre-training of medical domain models, and evaluation of deep learning models for multi-task learning.
自动医学图像分析系统通常需要大量高质量的训练数据,这很难且耗时。本文介绍了一种名为“Context v2”的Radiology Object(ROCOv2)多模态数据集,该数据集由来自PMC开放访问子集的放射性图像和相关医疗概念和注释组成。这是2018年发表的ROCO数据集中的更新版本,并增加了2018年以来在PMC上新增的35,705个图像。它进一步提供了含有人工编写的成像模式的概念,以及增加了X射线方面的解剖和方向概念。数据集包括79,789个图像,已在ImageCLEF medical caption 2023中的概念检测和预测任务中使用,尽管略微有所修改。该数据集还适用于基于图像-摘要对的训练图像注释模型,或使用每个图像提供的统一医疗语言系统(UMLS)概念进行多标签图像分类。此外,它还可以用于医学领域模型的预训练,以及评估用于多任务学习的深度学习模型。
https://arxiv.org/abs/2405.10004
The maturity classification of specialty crops such as strawberries and tomatoes is an essential agricultural downstream activity for selective harvesting and quality control (QC) at production and packaging sites. Recent advancements in Deep Learning (DL) have produced encouraging results in color images for maturity classification applications. However, hyperspectral imaging (HSI) outperforms methods based on color vision. Multivariate analysis methods and Convolutional Neural Networks (CNN) deliver promising results; however, a large amount of input data and the associated preprocessing requirements cause hindrances in practical application. Conventionally, the reflectance intensity in a given electromagnetic spectrum is employed in estimating fruit maturity. We present a feature extraction method to empirically demonstrate that the peak reflectance in subbands such as 500-670 nm (pigment band) and the wavelength of the peak position, and contrarily, the trough reflectance and its corresponding wavelength within 671-790 nm (chlorophyll band) are convenient to compute yet distinctive features for the maturity classification. The proposed feature selection method is beneficial because preprocessing, such as dimensionality reduction, is avoided before every prediction. The feature set is designed to capture these traits. The best SOTA methods, among 3D-CNN, 1D-CNN, and SVM, achieve at most 90.0 % accuracy for strawberries and 92.0 % for tomatoes on our dataset. Results show that the proposed method outperforms the SOTA as it yields an accuracy above 98.0 % in strawberry and 96.0 % in tomato classification. A comparative analysis of the time efficiency of these methods is also conducted, which shows the proposed method performs prediction at 13 Frames Per Second (FPS) compared to the maximum 1.16 FPS attained by the full-spectrum SVM classifier.
草莓和西红柿等特色作物的成熟度分类是生产和包装站点选择性收获和质量控制(QC)过程中必不可少的重要农业下游活动。最近在深度学习(DL)方面的进步在颜色图像的成熟度分类应用中产生了鼓舞人心的结果。然而,基于色彩视觉的方法在成熟度分类上劣后于 hyperspectral imaging(HSI)。多变量分析方法和卷积神经网络(CNN)产生了积极的结果;然而,大量的输入数据及其相关预处理要求给应用带来障碍。通常,在给定的电磁频谱中的反射强度被用来估计果实成熟度。我们提出了一个特征提取方法,以经验证明在色素带(500-670纳米,色素带)子频段和最大峰位波长以及相反,在671-790纳米(叶绿素带)中的峰谷反射强度和其相应的波长是方便计算且具有区分性的特征,用于成熟度分类。所提出的特征选择方法有益处,因为预测之前,预处理,例如降维,被避免了。特征集旨在捕捉这些特征。在3D-CNN、1D-CNN和SVM中,最好的SOTA方法,即草莓和西红柿数据集中的3D-CNN,在草莓和西红柿上的准确度分别为90.0%和92.0%。结果表明,与SOTA相比,所提出的方法具有更高的准确度,草莓的准确度为98.0%,西红柿的准确度为96.0%。还进行了这些方法的比较分析,比较了它们的预测时间效率,结果表明,与 full-spectral SVM 分类器达到的最大1.16 FPS 相比,所提出的 method 在13 FPS 的预测速度上表现出色。
https://arxiv.org/abs/2405.09955
3D face registration is an important process in which a 3D face model is aligned and mapped to a template face. However, the task of 3D face registration becomes particularly challenging when dealing with partial face data, where only limited facial information is available. To address this challenge, this paper presents a novel deep learning-based approach that combines quasi-conformal geometry with deep neural networks for partial face registration. The proposed framework begins with a Landmark Detection Network that utilizes curvature information to detect the presence of facial features and estimate their corresponding coordinates. These facial landmark features serve as essential guidance for the registration process. To establish a dense correspondence between the partial face and the template surface, a registration network based on quasiconformal theories is employed. The registration network establishes a bijective quasiconformal surface mapping aligning corresponding partial faces based on detected landmarks and curvature values. It consists of the Coefficients Prediction Network, which outputs the optimal Beltrami coefficient representing the surface mapping. The Beltrami coefficient quantifies the local geometric distortion of the mapping. By controlling the magnitude of the Beltrami coefficient through a suitable activation function, the bijectivity and geometric distortion of the mapping can be controlled. The Beltrami coefficient is then fed into the Beltrami solver network to reconstruct the corresponding mapping. The surface registration enables the acquisition of corresponding regions and the establishment of point-wise correspondence between different partial faces, facilitating precise shape comparison through the evaluation of point-wise geometric differences at these corresponding regions. Experimental results demonstrate the effectiveness of the proposed method.
3D面部配准是将3D面部模型与模板面部对齐并映射的过程。然而,在处理部分面部数据时,3D面部配准变得特别具有挑战性,因为在这种情况下,面部信息有限。为了应对这一挑战,本文提出了一种基于深度学习的全新方法,将准同构几何与深度神经网络相结合用于部分面部配准。该框架首先使用特征检测网络利用曲率信息来检测面部特征并估计其对应坐标。这些面部特征用于指导配准过程。为了建立部分面部与模板表面之间的密集对应关系,采用基于准同构理论的注册网络。该注册网络基于检测到的特征点和曲率值建立双射的准同构表面映射。它由系数预测网络组成,该网络输出表面映射的最优Beltrami系数。Beltrami系数衡量映射的局部几何变形。通过通过适当的激活函数控制Beltrami系数的幅度,可以控制映射的准同构性和几何变形。然后将Beltrami系数输入Beltrami求解网络以重构相应的映射。表面配准使相应的区域获得获取,不同部分面部的点对之间建立点对点关系,从而通过评估这些相应区域中的点对几何差异来精确形状比较。实验结果证明了所提出方法的有效性。
https://arxiv.org/abs/2405.09880
The study of astronomical phenomena through ground-based observations is always challenged by the distorting effects of Earth's atmosphere. Traditional methods of post-facto image correction, essential for correcting these distortions, often rely on simplifying assumptions that limit their effectiveness, particularly in the presence of spatially variant atmospheric turbulence. Such cases are often solved by partitioning the field-of-view into small patches, deconvolving each patch independently, and merging all patches together. This approach is often inefficient and can produce artifacts. Recent advancements in computational techniques and the advent of deep learning offer new pathways to address these limitations. This paper introduces a novel framework leveraging a deep neural network to emulate spatially variant convolutions, offering a breakthrough in the efficiency and accuracy of astronomical image deconvolution. By training on a dataset of images convolved with spatially invariant point spread functions and validating its generalizability to spatially variant conditions, this approach presents a significant advancement over traditional methods. The convolution emulator is used as a forward model in a multi-object multi-frame blind deconvolution algorithm for solar images. The emulator enables the deconvolution of solar observations across large fields of view without resorting to patch-wise mosaicking, thus avoiding artifacts associated with such techniques. This method represents a significant computational advantage, reducing processing times by orders of magnitude.
通过地面观测研究天文现象始终受到地球大气层扭曲效应的挑战。传统的方法,用于纠正这些扭曲,通常依赖于简化假设,限制其效果,特别是在存在局域变化的大气扰动时更为明显。 such cases are often solved by dividing the field of view into small patches, deconvolving each patch independently, and merging all patches together. 这种方法通常效率低下并会产生伪影。 计算技术和深度学习的出现为解决这些局限提供了新的途径。本文介绍了一种利用深度神经网络模仿局域变化卷积的新框架, 在效率和准确性上取得了突破性的进展。通过在 convolution 模型的训练数据集上训练,并验证其对局域变化条件的泛化,该方法在传统方法上取得了显著的进展。 convolution 模型用作多目标多帧盲解像算法的前向模型,使得解像度无需采用局部拼接, 从而避免了这种技术产生的伪影。 该方法代表了明显的计算优势, 将处理时间降低到原来的几十分之一。
https://arxiv.org/abs/2405.09864
Box-free model watermarking is an emerging technique to safeguard the intellectual property of deep learning models, particularly those for low-level image processing tasks. Existing works have verified and improved its effectiveness in several aspects. However, in this paper, we reveal that box-free model watermarking is prone to removal attacks, even under the real-world threat model such that the protected model and the watermark extractor are in black boxes. Under this setting, we carry out three studies. 1) We develop an extractor-gradient-guided (EGG) remover and show its effectiveness when the extractor uses ReLU activation only. 2) More generally, for an unknown extractor, we leverage adversarial attacks and design the EGG remover based on the estimated gradients. 3) Under the most stringent condition that the extractor is inaccessible, we design a transferable remover based on a set of private proxy models. In all cases, the proposed removers can successfully remove embedded watermarks while preserving the quality of the processed images, and we also demonstrate that the EGG remover can even replace the watermarks. Extensive experimental results verify the effectiveness and generalizability of the proposed attacks, revealing the vulnerabilities of the existing box-free methods and calling for further research.
box-free 模型水印是一种新兴的保卫星性深度学习模型知识产权的技术。现有的工作已经证实和提高了其在多个方面的效果。然而,在本文中,我们揭示了即使在现实世界的威胁模型下,box-free 模型水印也会被删除攻击,即使提取器仅使用 ReLU 激活。在这种情况下,我们进行了三个研究。 1) 我们开发了一个提取器级导数引导(EGG)去噪器,并证明了当提取器仅使用 ReLU 激活时,其效果是有效的。 2) 对于未知提取器,我们利用对抗攻击并根据估计梯度设计 EGG 去噪器。 3) 在最严格的情况下,提取器无法访问,我们基于一系列私用代理模型设计了一个可传输的去噪器。在所有情况下,所提出的去噪器都能成功删除嵌入的水印,同时保留处理图像的质量,我们还证明了 EGG 去噪器甚至可以替代水印。大量的实验结果证实了所提出攻击的有效性和普遍性,揭示了现有 box-free 方法的漏洞,并呼吁进一步研究。
https://arxiv.org/abs/2405.09863
While neural approaches using deep learning are the state-of-the-art for natural language processing (NLP) today, pre-neural algorithms and approaches still find a place in NLP textbooks and courses of recent years. In this paper, we compare two introductory NLP courses taught in Australia and India, and examine how Transformer and pre-neural approaches are balanced within the lecture plan and assessments of the courses. We also draw parallels with the objects-first and objects-later debate in CS1 education. We observe that pre-neural approaches add value to student learning by building an intuitive understanding of NLP problems, potential solutions and even Transformer-based models themselves. Despite pre-neural approaches not being state-of-the-art, the paper makes a case for their inclusion in NLP courses today.
尽管深度学习方法在自然语言处理(NLP)领域今天是最新和最先进的,但预先神经算法和 approach 仍然在近几年的自然语言处理教科书和课程中占据一席之地。在本文中,我们比较了澳大利亚和印度教授的两门自然语言处理入门课程,并探讨了Transformer 和预先神经方法在课程计划和评估中的平衡。我们还探讨了与 CS1 教育中的物体先验和物体后来方法辩论的相似之处。我们观察到,预先神经方法通过构建对 NLP 问题的直观理解、潜在解决方案以及甚至 Transformer 模型的认识,为学生的学习增添了价值。尽管预先神经方法不是最先进的,但本文支持它们在当前自然语言处理课程中的 inclusion。
https://arxiv.org/abs/2405.09854
In this work, we present Semantic Gesticulator, a novel framework designed to synthesize realistic gestures accompanying speech with strong semantic correspondence. Semantically meaningful gestures are crucial for effective non-verbal communication, but such gestures often fall within the long tail of the distribution of natural human motion. The sparsity of these movements makes it challenging for deep learning-based systems, trained on moderately sized datasets, to capture the relationship between the movements and the corresponding speech semantics. To address this challenge, we develop a generative retrieval framework based on a large language model. This framework efficiently retrieves suitable semantic gesture candidates from a motion library in response to the input speech. To construct this motion library, we summarize a comprehensive list of commonly used semantic gestures based on findings in linguistics, and we collect a high-quality motion dataset encompassing both body and hand movements. We also design a novel GPT-based model with strong generalization capabilities to audio, capable of generating high-quality gestures that match the rhythm of speech. Furthermore, we propose a semantic alignment mechanism to efficiently align the retrieved semantic gestures with the GPT's output, ensuring the naturalness of the final animation. Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit, as evidenced by a comprehensive collection of examples. User studies confirm the quality and human-likeness of our results, and show that our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
在这项工作中,我们提出了一个名为Semantic Gestulator的新框架,旨在通过强烈的语义匹配来合成真实伴随说话的肢体动作。语义上意义的肢体动作对于有效的非言语交流至关重要,但这些动作通常位于自然人类运动的分布长尾。这些运动的稀疏性使得基于数据规模中训练的深度学习系统难以捕捉运动和相应语义之间的关系。为了应对这个挑战,我们基于一个大型语言模型开发了一个生成检索框架。这个框架能够根据输入的语音 efficiently检索出合适的语义肢体动作候选者。为了构建这个运动库,我们总结了基于语言学发现的一系列常用的语义肢体动作,并收集了一个包含身体和手部动作的高质量数据集。我们还设计了一个基于GPT的新型模型,具有很强的泛化能力,能够生成与说话节奏相符的高质量肢体动作。此外,我们还提出了一个语义对齐机制,以有效地将检索到的语义肢体动作与GPT的输出对齐,确保最终动画的自然性。我们的系统在生成具有节奏感和语义明确的肢体动作方面表现出鲁棒性,这可以通过一系列广泛的例子得到证实。用户研究证实了我们的结果具有质量和人性化,并且我们的系统在语义适应性方面明显优于最先进的系统。
https://arxiv.org/abs/2405.09814
In the era of space exploration, coronal holes on the sun play a significant role due to their impact on satellites and aircraft through their open magnetic fields and increased solar wind emissions. This study employs computer vision techniques to detect coronal hole regions and estimate their sizes using imagery from the Solar Dynamics Observatory (SDO). Additionally, we utilize deep learning methods, specifically Long Short-Term Memory (LSTM) networks, to analyze trends in the area of coronal holes and predict their areas across various solar regions over a span of seven days. By examining time series data, we aim to identify patterns in coronal hole behavior and understand their potential effects on space weather. This research enhances our ability to anticipate and prepare for space weather events that could affect Earth's technological systems.
在太空探索的时代,太阳上的日冕洞在卫星和飞机的作用中发挥着重要作用,因为它们的开放磁场的效应和太阳风增强导致了它们的影响。这项研究采用计算机视觉技术来检测日冕洞区域,并使用太阳动力学观测站(SDO)的图像估计它们的尺寸。此外,我们利用深度学习方法,特别是长短时记忆(LSTM)网络,分析日冕洞区域内的趋势,并预测它们在七天内对各个太阳区域的影响。通过分析时间序列数据,我们旨在识别日冕洞行为的模式,并理解它们对太空天气的影响。这项研究提高了我们预测和应对可能影响地球技术系统的太空天气事件的能
https://arxiv.org/abs/2405.09802
Correspondence-based statistical shape modeling (SSM) stands as a powerful technology for morphometric analysis in clinical research. SSM facilitates population-level characterization and quantification of anatomical shapes such as bones and organs, aiding in pathology and disease diagnostics and treatment planning. Despite its potential, SSM remains under-utilized in medical research due to the significant overhead associated with automatic construction methods, which demand complete, aligned shape surface representations. Additionally, optimization-based techniques rely on bias-inducing assumptions or templates and have prolonged inference times as the entire cohort is simultaneously optimized. To overcome these challenges, we introduce Point2SSM++, a principled, self-supervised deep learning approach that directly learns correspondence points from point cloud representations of anatomical shapes. Point2SSM++ is robust to misaligned and inconsistent input, providing SSM that accurately samples individual shape surfaces while effectively capturing population-level statistics. Additionally, we present principled extensions of Point2SSM++ to adapt it for dynamic spatiotemporal and multi-anatomy use cases, demonstrating the broad versatility of the Point2SSM++ framework. Furthermore, we present extensions of Point2SSM++ tailored for dynamic spatiotemporal and multi-anatomy scenarios, showcasing the broad versatility of the framework. Through extensive validation across diverse anatomies, evaluation metrics, and clinically relevant downstream tasks, we demonstrate Point2SSM++'s superiority over existing state-of-the-art deep learning models and traditional approaches. Point2SSM++ substantially enhances the feasibility of SSM generation and significantly broadens its array of potential clinical applications.
基于对应关系的统计形状建模(SSM)在临床研究中具有强大的技术价值。SSM通过促进对人群水平特征和数量级解剖形状(如骨和器官)的描述,有助于病理学和疾病诊断与治疗计划的制定。然而,尽管SSM具有巨大的潜力,但在医学研究中,它仍然没有得到充分利用,因为与自动构建方法相关的显著开销,这些方法要求完全对齐的形状表面表示。此外,基于优化的技术依赖于偏差诱导的假设或模板,并且在整个队列同时优化时,推理时间会延长。为了克服这些挑战,我们引入了Point2SSM++,一种基于原则的自监督深度学习方法,可以直接从解剖形状点云表示中学习对应点。Point2SSM++对 misaligned 和不一致的输入具有鲁棒性,提供SSM,准确地采样单个形状表面,同时有效捕捉人群水平统计数据。此外,我们还介绍了Point2SSM++的可扩展性,用于动态空间和多轴场景,展示了Point2SSM++框架的广泛适用性。通过跨越不同解剖学和临床相关任务的大量验证、评价指标和临床应用,我们证明了Point2SSM++在现有深度学习模型和传统方法上的优越性。Point2SSM++极大地提高了SSM生成的可行性,并显著拓宽了其潜在临床应用的范围。
https://arxiv.org/abs/2405.09707
Anatomical shape analysis plays a pivotal role in clinical research and hypothesis testing, where the relationship between form and function is paramount. Correspondence-based statistical shape modeling (SSM) facilitates population-level morphometrics but requires a cumbersome, potentially bias-inducing construction pipeline. Recent advancements in deep learning have streamlined this process in inference by providing SSM prediction directly from unsegmented medical images. However, the proposed approaches are fully supervised and require utilizing a traditional SSM construction pipeline to create training data, thus inheriting the associated burdens and limitations. To address these challenges, we introduce a weakly supervised deep learning approach to predict SSM from images using point cloud supervision. Specifically, we propose reducing the supervision associated with the state-of-the-art fully Bayesian variational information bottleneck DeepSSM (BVIB-DeepSSM) model. BVIB-DeepSSM is an effective, principled framework for predicting probabilistic anatomical shapes from images with quantification of both aleatoric and epistemic uncertainties. Whereas the original BVIB-DeepSSM method requires strong supervision in the form of ground truth correspondence points, the proposed approach utilizes weak supervision via point cloud surface representations, which are more readily obtainable. Furthermore, the proposed approach learns correspondence in a completely data-driven manner without prior assumptions about the expected variability in shape cohort. Our experiments demonstrate that this approach yields similar accuracy and uncertainty estimation to the fully supervised scenario while substantially enhancing the feasibility of model training for SSM construction.
解剖形状分析在临床研究和假设检验中发挥着关键作用,其中形式与功能的关系至关重要。基于配对的统计形状建模(SSM)促进了人口水平形态计量学,但需要一个繁琐、可能存在偏见构建流程。随着深度学习技术的最新进步,在推理过程中直接从无分割医疗图像中提供SSM预测,从而简化了这一过程。然而,所提出的方法是全监督的,需要利用传统的SSM构建流程创建训练数据,从而继承相关的负担和局限性。为了应对这些挑战,我们引入了一种弱监督的深度学习方法,通过点云监督预测SSM。具体来说,我们提出了一个减少与最先进的完全贝叶斯变分信息瓶颈DeepSSM(BVIB-DeepSSM)模型相关的监督的方法。BVIB-DeepSSM是一种有效的、有理的框架,可以从图像中预测概率解剖形状,同时对概率和实证不确定性进行量化。尽管原始的BVIB-DeepSSM方法需要强监督的地面真值配准点,但所提出的方法通过点云表面表示利用弱监督。此外,与传统方法不同,该方法完全基于数据驱动学习,没有关于预计形状随访的方差的可行假设。我们的实验结果表明,与完全监督情况相比,这种方法具有类似的准确性和不确定性估计,同时大大提高了模型训练为SSM构建的可行性。
https://arxiv.org/abs/2405.09697
Deep learning has achieved remarkable success in recent years. Central to its success is its ability to learn representations that preserve task-relevant structure. However, massive energy, compute, and data costs are required to learn general representations. This paper explores Hyperdimensional Computing (HDC), a computationally and data-efficient brain-inspired alternative. HDC acts as a bridge between connectionist and symbolic approaches to artificial intelligence (AI), allowing explicit specification of representational structure as in symbolic approaches while retaining the flexibility of connectionist approaches. However, HDC's simplicity poses challenges for encoding complex compositional structures, especially in its binding operation. To address this, we propose Generalized Holographic Reduced Representations (GHRR), an extension of Fourier Holographic Reduced Representations (FHRR), a specific HDC implementation. GHRR introduces a flexible, non-commutative binding operation, enabling improved encoding of complex data structures while preserving HDC's desirable properties of robustness and transparency. In this work, we introduce the GHRR framework, prove its theoretical properties and its adherence to HDC properties, explore its kernel and binding characteristics, and perform empirical experiments showcasing its flexible non-commutativity, enhanced decoding accuracy for compositional structures, and improved memorization capacity compared to FHRR.
近年来,深度学习取得了显著的成功。其成功的关键在于其能够学习保留任务相关结构的表示。然而,学习通用表示需要大量的能量、计算和数据成本。本文探讨了Hyperdimensional Computing(HDC),一种计算和数据效率极高的类脑人工智能(AI)替代方法。HDC充当连接主义和符号主义方法之间的桥梁,允许在保留符号主义方法中明确指定表示结构的同时保留连接主义方法的灵活性。然而,HDC的简单性使得对其编码复杂组合结构存在挑战,特别是在其结合操作中。为解决这一问题,我们提出了Generalized Holographic Reduced Representations(GHRR),这是Fourier Holographic Reduced Representations(FHRR)的扩展,是一种特定的HDC实现。GHRR引入了一个灵活、非交换的结合操作,从而在保留HDC的有益特性的同时改善了复杂数据结构的编码。 在本文中,我们介绍了GHRR框架,证明其理论性质以及符合HDC特性,探索了其内核和结合特性,并通过实验展示了其灵活非交换性、组合结构的增强解码准确性和提高的记忆能力。
https://arxiv.org/abs/2405.09689
This study introduces a groundbreaking optical coherence tomography (OCT) imaging system dedicated for high-throughput screening applications using ex vivo tissue culture. Leveraging OCT's non-invasive, high-resolution capabilities, the system is equipped with a custom-designed motorized platform and tissue detection ability for automated, successive imaging across samples. Transformer-based deep learning segmentation algorithms further ensure robust, consistent, and efficient readouts meeting the standards for screening assays. Validated using retinal explant cultures from a mouse model of retinal degeneration, the system provides robust, rapid, reliable, unbiased, and comprehensive readouts of tissue response to treatments. This fully automated OCT-based system marks a significant advancement in tissue screening, promising to transform drug discovery, as well as other relevant research fields.
这项研究引入了一项开拓性的光学共焦显微镜(OCT)成像系统,用于通过体外的组织培养进行高通量筛选。利用OCT的非侵入性、高分辨率特性,该系统配备了定制设计的电动平台和样品检测能力,实现自动化、连续成像。基于Transformer的深度学习分割算法进一步确保了稳健、一致和高效的检测结果,符合筛查实验的标准。通过来自视网膜退化小鼠模型的视网膜培养进行验证,该系统提供了对治疗反应的稳健、快速、可靠的全面检测。这种完全自动化的基于OCT的系统在组织筛查方面取得了显著的进展,有望改变药物发现以及其他相关研究领域。
https://arxiv.org/abs/2405.09601
Synthetic aperture radar (SAR) is essential in actively acquiring information for Earth observation. SAR Automatic Target Recognition (ATR) focuses on detecting and classifying various target categories under different image conditions. The current deep learning-based SAR ATR methods are typically designed for specific datasets and applications. Various target characteristics, scene background information, and sensor parameters across ATR datasets challenge the generalization of those methods. This paper aims to achieve general SAR ATR based on a foundation model with Self-Supervised Learning (SSL). Our motivation is to break through the specific dataset and condition limitations and obtain universal perceptual capabilities across the target, scene, and sensor. A foundation model named SARATR-X is proposed with the following four aspects: pre-training dataset, model backbone, SSL, and evaluation task. First, we integrated 14 datasets with various target categories and imaging conditions as a pre-training dataset. Second, different model backbones were discussed to find the most suitable approaches for remote-sensing images. Third, we applied two-stage training and SAR gradient features to ensure the diversity and scalability of SARATR-X. Finally, SARATR-X has achieved competitive and superior performance on 5 datasets with 8 task settings, which shows that the foundation model can achieve universal SAR ATR. We believe it is time to embrace fundamental models for SAR image interpretation in the era of increasing big data.
合成孔雷达(SAR)在积极获取地球观测信息方面至关重要。SAR自动目标识别(ATR)关注于在不同的图像条件下检测和分类各种目标类别。目前基于深度学习的SAR ATR方法通常是为特定数据集和应用设计的。各种目标特征、场景背景信息和ATR数据集中的传感器参数挑战了这些方法的一般化。本文旨在基于自监督学习(SSL)的基础模型实现通用SAR ATR。我们的目标是突破特定数据和条件的限制,获得目标、场景和传感器之间的普遍感知能力。 我们提出了一个名为SARATR-X的基础模型,包括以下四个方面:预训练数据集、模型骨架、SSL和评估任务。首先,我们将14个数据集与各种目标和成像条件集成作为一个预训练数据集。其次,讨论了不同的模型骨架,以找到最适合远程感测图像的适当方法。第三,我们应用了两阶段培训和SAR梯度特征来确保SARATR-X的多样性和可扩展性。最后,SARATR-X在5个数据集和8个任务设置上实现了竞争性和卓越性能,这表明基础模型可以实现通用SAR ATR。我们认为,在数据和数据量不断增加的时代,应该拥抱基本模型用于SAR图像解释。
https://arxiv.org/abs/2405.09365
Localizing oneself during endoscopic procedures can be problematic due to the lack of distinguishable textures and landmarks, as well as difficulties due to the endoscopic device such as a limited field of view and challenging lighting conditions. Expert knowledge shaped by years of experience is required for localization within the human body during endoscopic procedures. In this work, we present a deep learning method based on anatomy recognition, that constructs a surgical path in an unsupervised manner from surgical videos, modelling relative location and variations due to different viewing angles. At inference time, the model can map an unseen video's frames on the path and estimate the viewing angle, aiming to provide guidance, for instance, to reach a particular destination. We test the method on a dataset consisting of surgical videos of transsphenoidal adenomectomies, as well as on a synthetic dataset. An online tool that lets researchers upload their surgical videos to obtain anatomy detections and the weights of the trained YOLOv7 model are available at: this https URL.
在内镜手术过程中,将自身定位可能会出现问题,因为内镜设备(如有限视野和复杂的照明条件)以及由于缺乏可区分纹理和标志而产生的困难。在本文中,我们提出了一种基于解剖识别的深度学习方法,在未经监督的情况下从手术视频中构建手术路径,并建模不同视角下的相对位置和变化。在推理时,该模型可以在路径上映射未见过的视频帧,并估计视角,旨在提供指导,例如,到达特定目的地。我们在包括Transsphenoidal腺瘤在内的大规模手术视频数据集上测试了该方法,以及在合成数据集上进行了测试。在这个网站上有这样一个在线工具,让研究人员上传他们的手术视频以获得解剖检测和训练的YOLOv7模型的权重:https:// this URL.
https://arxiv.org/abs/2405.09355
Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples. Further, these adversarial examples are found to be transferable from the source network in which they are crafted to a black-box target network. As the trend of using deep learning on embedded devices grows, it becomes relevant to study the transferability properties of adversarial examples among compressed networks. In this paper, we consider quantization as a network compression technique and evaluate the performance of transfer-based attacks when the source and target networks are quantized at different bitwidths. We explore how algorithm specific properties affect transferability by considering various adversarial example generation algorithms. Furthermore, we examine transferability in a more realistic scenario where the source and target networks may differ in bitwidth and other model-related properties like capacity and architecture. We find that although quantization reduces transferability, certain attack types demonstrate an ability to enhance it. Additionally, the average transferability of adversarial examples among quantized versions of a network can be used to estimate the transferability to quantized target networks with varying capacity and architecture.
深度神经网络(DNNs)以其易受对抗性样本攻击而闻名。此外,这些对抗性样本已被证明可以在其创建的网络中从源网络转移到目标网络,且这些源网络中的攻击在目标网络中是透明的。随着在嵌入设备上使用深度学习的趋势不断增长,研究在压缩网络之间对抗性样本的传输特性变得尤为重要。在本文中,我们将量化作为一种网络压缩技术,评估在不同位宽下基于传输的攻击的性能。我们考虑了各种攻击生成算法,以评估算法特定属性对传输特性的影响。此外,我们在一个更现实的情况中研究了源网络和目标网络在位宽和其他与模型相关的属性(如容量和架构)上的差异。我们发现,尽管量化减少了传输性,但某些攻击类型表现出增强传输性的能力。此外,量化版本之间 adversarial 样本的平均传输性可用于估计具有不同容量和架构的量化目标网络的传输性。
https://arxiv.org/abs/2405.09598