As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: this https URL.
随着大型语言模型(LLMs)的不断发展,其与3D空间数据的集成取得了快速进展,为理解和与物理空间进行交互提供了前所未有的能力。这项调查对LLMs处理、理解和生成3D数据的方法进行了全面概述。强调LLMs的独特优势,如上下文学习、逐步推理、开放词汇功能和广泛的世界知识,我们强调它们在智能体人工智能(AI)系统中将空间理解和交互的重大潜力。我们的研究跨越了各种3D数据表示,从点云到神经辐射场(NeRFs)。它研究了LLM与3D场景理解、捕捉、问答和对话等任务的集成,以及基于LLM的空间推理、规划和导航等代理。此外,本文还简要回顾了其他将3D和语言集成的方法。本文中的元分析揭示了显著的进展,但同时也强调了需要新的方法来充分利用3D-LLMs的全部潜力。因此,本文旨在为未来研究绘制一个探索和扩展3D-LLMs理解和服务于复杂3D世界的道路的路线图。为了支持这项调查,我们建立了一个与我们的主题相关的项目页,其中列出了相关的论文:https://url。
https://arxiv.org/abs/2405.10255
Text-to-speech (TTS) development for African languages such as Luganda is still limited, primarily due to the scarcity of high-quality, single-speaker recordings essential for training TTS models. Prior work has focused on utilizing the Luganda Common Voice recordings of multiple speakers aged between 20-49. Although the generated speech is intelligible, it is still of lower quality than the model trained on studio-grade recordings. This is due to the insufficient data preprocessing methods applied to improve the quality of the Common Voice recordings. Furthermore, speech convergence is more difficult to achieve due to varying intonations, as well as background noise. In this paper, we show that the quality of Luganda TTS from Common Voice can improve by training on multiple speakers of close intonation in addition to further preprocessing of the training data. Specifically, we selected six female speakers with close intonation determined by subjectively listening and comparing their voice recordings. In addition to trimming out silent portions from the beginning and end of the recordings, we applied a pre-trained speech enhancement model to reduce background noise and enhance audio quality. We also utilized a pre-trained, non-intrusive, self-supervised Mean Opinion Score (MOS) estimation model to filter recordings with an estimated MOS over 3.5, indicating high perceived quality. Subjective MOS evaluations from nine native Luganda speakers demonstrate that our TTS model achieves a significantly better MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover, for a fair comparison, our model trained on six speakers outperforms models trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This showcases the effectiveness of compensating for the lack of data from one speaker with data from multiple speakers of close intonation to improve TTS quality.
文本转语音(TTS)发展对于非洲语言如卢加丹仍然是有限的,主要原因是因为高质量、单声道录音对于训练TTS模型至关重要。之前的工作主要集中在利用多个年龄在20-49岁的说话者的卢加丹共同声音录音。尽管生成的声音是可以理解的,但它们仍然比训练在工作室级别录音上的模型质量较低。这是因为用于提高共同声音录音质量的数据预处理方法不足。此外,由于变调的存在以及背景噪音,声音收敛更困难。在本文中,我们证明了通过在共同声音训练中使用多声道接近调的说话者,卢加丹TTS的质量可以得到改善。具体来说,我们选择了六名女性说话者,通过主观听力和比较它们的录音,确定了他们的共同声音。除了剪去录音的开头和结尾处的无声部分外,我们还应用了一个预训练的语音增强模型来降低背景噪音并提高音频质量。我们还利用了一个预训练的非侵入性自我监督Mean Opinion Score(MOS)估计模型,该模型可以过滤具有估计MOS超过3.5的录音,表明具有高感知质量。来自九名母语为卢加丹的参与者的主观MOS评估表明,我们的TTS模型实现了比现有模型更高的MOS值(3.55),而现有的模型报告的MOS值为2.5。此外,对于公平的比较,基于多声道的接近调的模型训练胜过基于单声道的模型(3.13 MOS)或双声道的模型(3.22 MOS)。这展示了通过从单一说话者的不足数据中补充多声道接近调的数据来提高TTS质量的有效性。
https://arxiv.org/abs/2405.10211
Epigenetic cell memory, the inheritance of gene expression patterns across subsequent cell divisions, is a critical property of multi-cellular organisms. In recent work [10], a subset of the authors observed in a simulation study how the stochastic dynamics and time-scale differences between establishment and erasure processes in chromatin modifications (such as histone modifications and DNA methylation) can have a critical effect on epigenetic cell memory. In this paper, we provide a mathematical framework to rigorously validate and extend beyond these computational findings. Viewing our stochastic model of a chromatin modification circuit as a singularly perturbed, finite state, continuous time Markov chain, we extend beyond existing theory in order to characterize the leading coefficients in the series expansions of stationary distributions and mean first passage times. In particular, we characterize the limiting stationary distribution in terms of a reduced Markov chain, provide an algorithm to determine the orders of the poles of mean first passage times, and determine how changing erasure rates affects system behavior. The theoretical tools developed in this paper not only allow us to set a rigorous mathematical basis for the computational findings of our prior work, highlighting the effect of chromatin modification dynamics on epigenetic cell memory, but they can also be applied to other singularly perturbed Markov chains beyond the applications in this paper, especially those associated with chemical reaction networks.
表观遗传细胞记忆,即多细胞生物中基因表达模式在后续细胞分裂中的遗传,是多细胞生物的关键特性。在最近的工作[10]中,一些作者通过模拟研究观察到,在真核生物中,染色质修饰(如组蛋白修饰和DNA甲基化)的建立和消除过程的随机动态以及时间尺度差异可能会对表观遗传细胞记忆产生关键影响。在本文中,我们提供了数学框架,以严谨验证和扩展这些计算结果。将我们染色质修饰电路的随机模型视为一个单例微扰,有限状态连续时间随机过程,我们超越了现有的理论,以刻画随 stationary 分布展开式的导数和均 first passage 时间的关系。特别地,我们刻画了极限 stationary 分布,给出了确定 mean first passage 时间极值的算法,并确定了改变 erasure 速率对系统行为的影响。本文中发展的理论工具不仅使我们能够为之前的工作建立一个严谨的数学基础,突出染色质修饰动态对表观遗传细胞记忆的影响,而且还可以应用于其他单例微扰 Markov 链,尤其是与化学反应网络相关的 those。
https://arxiv.org/abs/2405.10184
We introduce RoScenes, the largest multi-view roadside perception dataset, which aims to shed light on the development of vision-centric Bird's Eye View (BEV) approaches for more challenging traffic scenes. The highlights of RoScenes include significantly large perception area, full scene coverage and crowded traffic. More specifically, our dataset achieves surprising 21.13M 3D annotations within 64,000 $m^2$. To relieve the expensive costs of roadside 3D labeling, we present a novel BEV-to-3D joint annotation pipeline to efficiently collect such a large volume of data. After that, we organize a comprehensive study for current BEV methods on RoScenes in terms of effectiveness and efficiency. Tested methods suffer from the vast perception area and variation of sensor layout across scenes, resulting in performance levels falling below expectations. To this end, we propose RoBEV that incorporates feature-guided position embedding for effective 2D-3D feature assignment. With its help, our method outperforms state-of-the-art by a large margin without extra computational overhead on validation set. Our dataset and devkit will be made available at \url{this https URL}.
我们介绍了RoScenes,这是最大的多视角路边感知数据集,旨在阐明在更复杂的交通场景中,视觉中心鸟眼视(BEV)方法的发展。RoScenes的重点包括显著的感知区域、完整的场景覆盖和拥挤的交通。具体来说,我们的数据集在64,000 $m^2的面积上实现了令人惊讶的21.13M 3D注释。为了减轻道路边3D标注昂贵的成本,我们提出了一个新颖的BEV到3D联合注释管道,以有效地收集如此大的数据量。此后,我们组织了一项关于RoScenes上当前BEV方法的全面研究,从有效性、效率等方面进行评估。经过测试的方法存在感知区域巨大和场景传感器布局变化多样的问题,导致性能水平低于预期。因此,我们提出了RoBEV,它采用基于特征的定位嵌入来有效进行2D-3D特征分配。有了它的帮助,我们的方法在验证集上显著优于最先进的水平,而无需额外的计算开销。我们的数据集和开发工具包将公开在 \url{这个链接} 上。
https://arxiv.org/abs/2405.09883
This paper introduces Fusion Intelligence (FI), a bio-inspired intelligent system, where the innate sensing, intelligence and unique actuation abilities of biological organisms such as bees and ants are integrated with the computational power of Artificial Intelligence (AI). This interdisciplinary field seeks to create systems that are not only smart but also adaptive and responsive in ways that mimic the nature. As FI evolves, it holds the promise of revolutionizing the way we approach complex problems, leveraging the best of both biological and digital worlds to create solutions that are more effective, sustainable, and harmonious with the environment. We demonstrate FI's potential to enhance agricultural IoT system performance through a simulated case study on improving insect pollination efficacy (entomophily).
本论文介绍了Fusion Intelligence(FI),一种以生物启发的智能系统,其中将生物生物体(如蜜蜂和蚂蚁)的固有感知、智能和独特操作能力与人工智能(AI)的计算能力相结合。这个跨学科领域旨在创建不仅聪明,而且具有适应性和响应性的系统,以模仿生物的自然方式。随着FI的发展,它有望彻底改变我们解决复杂问题的方式,利用生物和数字世界的优势来创建更有效、可持续且与环境和谐解决方案。我们通过模拟一个提高昆虫授粉有效性的案例研究,展示了FI在提高农业物联网系统性能方面的潜力。
https://arxiv.org/abs/2405.09763
Correspondence-based statistical shape modeling (SSM) stands as a powerful technology for morphometric analysis in clinical research. SSM facilitates population-level characterization and quantification of anatomical shapes such as bones and organs, aiding in pathology and disease diagnostics and treatment planning. Despite its potential, SSM remains under-utilized in medical research due to the significant overhead associated with automatic construction methods, which demand complete, aligned shape surface representations. Additionally, optimization-based techniques rely on bias-inducing assumptions or templates and have prolonged inference times as the entire cohort is simultaneously optimized. To overcome these challenges, we introduce Point2SSM++, a principled, self-supervised deep learning approach that directly learns correspondence points from point cloud representations of anatomical shapes. Point2SSM++ is robust to misaligned and inconsistent input, providing SSM that accurately samples individual shape surfaces while effectively capturing population-level statistics. Additionally, we present principled extensions of Point2SSM++ to adapt it for dynamic spatiotemporal and multi-anatomy use cases, demonstrating the broad versatility of the Point2SSM++ framework. Furthermore, we present extensions of Point2SSM++ tailored for dynamic spatiotemporal and multi-anatomy scenarios, showcasing the broad versatility of the framework. Through extensive validation across diverse anatomies, evaluation metrics, and clinically relevant downstream tasks, we demonstrate Point2SSM++'s superiority over existing state-of-the-art deep learning models and traditional approaches. Point2SSM++ substantially enhances the feasibility of SSM generation and significantly broadens its array of potential clinical applications.
基于对应关系的统计形状建模(SSM)在临床研究中具有强大的技术价值。SSM通过促进对人群水平特征和数量级解剖形状(如骨和器官)的描述,有助于病理学和疾病诊断与治疗计划的制定。然而,尽管SSM具有巨大的潜力,但在医学研究中,它仍然没有得到充分利用,因为与自动构建方法相关的显著开销,这些方法要求完全对齐的形状表面表示。此外,基于优化的技术依赖于偏差诱导的假设或模板,并且在整个队列同时优化时,推理时间会延长。为了克服这些挑战,我们引入了Point2SSM++,一种基于原则的自监督深度学习方法,可以直接从解剖形状点云表示中学习对应点。Point2SSM++对 misaligned 和不一致的输入具有鲁棒性,提供SSM,准确地采样单个形状表面,同时有效捕捉人群水平统计数据。此外,我们还介绍了Point2SSM++的可扩展性,用于动态空间和多轴场景,展示了Point2SSM++框架的广泛适用性。通过跨越不同解剖学和临床相关任务的大量验证、评价指标和临床应用,我们证明了Point2SSM++在现有深度学习模型和传统方法上的优越性。Point2SSM++极大地提高了SSM生成的可行性,并显著拓宽了其潜在临床应用的范围。
https://arxiv.org/abs/2405.09707
Objective: Federated Learning (FL) enables collaborative model training while keeping data locally. Currently, most FL studies in radiology are conducted in simulated environments due to numerous hurdles impeding its translation into practice. The few existing real-world FL initiatives rarely communicate specific measures taken to overcome these hurdles, leaving behind a significant knowledge gap. Minding efforts to implement real-world FL, there is a notable lack of comprehensive assessment comparing FL to less complex alternatives. Materials & Methods: We extensively reviewed FL literature, categorizing insights along with our findings according to their nature and phase while establishing a FL initiative, summarized to a comprehensive guide. We developed our own FL infrastructure within the German Radiological Cooperative Network (RACOON) and demonstrated its functionality by training FL models on lung pathology segmentation tasks across six university hospitals. We extensively evaluated FL against less complex alternatives in three distinct evaluation scenarios. Results: The proposed guide outlines essential steps, identified hurdles, and proposed solutions for establishing successful FL initiatives conducting real-world experiments. Our experimental results show that FL outperforms less complex alternatives in all evaluation scenarios, justifying the effort required to translate FL into real-world applications. Discussion & Conclusion: Our proposed guide aims to aid future FL researchers in circumventing pitfalls and accelerating translation of FL into radiological applications. Our results underscore the value of efforts needed to translate FL into real-world applications by demonstrating advantageous performance over alternatives, and emphasize the importance of strategic organization, robust management of distributed data and infrastructure in real-world settings.
目标:联邦学习(FL)可以在保留数据本地的情况下实现协作模型训练。目前,由于许多阻碍将其转化为实践的障碍,大多数放射学领域的FL研究都是在模拟环境中进行的。少数现有现实世界的FL倡议很少详细介绍为克服这些障碍所采取的具体措施,留下了相当大的知识空白。关注实施现实世界FL,在比较FL与其他更简单选项的全局评估方面存在显著的不足。材料和方法:我们广泛审查了FL文献,根据其性质和阶段将见解进行分类,同时建立FL倡议,并将其总结为一本全面的指南。我们还在德国放射学合作网络(RACOON)内开发了自己的FL基础设施,并通过在六所大学医院的肺病理分割任务上训练FL模型来展示其功能性。我们在三个不同的评估场景对FL与更简单的替代方案进行了广泛评估。结果:所提出的指南概述了建立成功FL倡议的必要步骤,识别了障碍并提出了解决方案。我们的实验结果表明,FL在所有评估场景中都优于更简单的替代方案,从而为将FL融入放射学应用付出了所需的努力。讨论与结论:我们的指南旨在帮助未来的FL研究人员避免陷阱,加速FL向放射学应用的转化。我们的结果强调了将FL融入现实世界应用所需的努力,通过证明其相对于替代方案的优越性能,并着重强调了在现实环境中的战略组织、分布式数据和基础设施的稳健管理的重要性。
https://arxiv.org/abs/2405.09409
While content-based image retrieval (CBIR) has been extensively studied in natural image retrieval, its application to medical images presents ongoing challenges, primarily due to the 3D nature of medical images. Recent studies have shown the potential use of pre-trained vision embeddings for CBIR in the context of radiology image retrieval. However, a benchmark for the retrieval of 3D volumetric medical images is still lacking, hindering the ability to objectively evaluate and compare the efficiency of proposed CBIR approaches in medical imaging. In this study, we extend previous work and establish a benchmark for region-based and multi-organ retrieval using the TotalSegmentator dataset (TS) with detailed multi-organ annotations. We benchmark embeddings derived from pre-trained supervised models on medical images against embeddings derived from pre-trained unsupervised models on non-medical images for 29 coarse and 104 detailed anatomical structures in volume and region levels. We adopt a late interaction re-ranking method inspired by text matching for image retrieval, comparing it against the original method proposed for volume and region retrieval achieving retrieval recall of 1.0 for diverse anatomical regions with a wide size range. The findings and methodologies presented in this paper provide essential insights and benchmarks for the development and evaluation of CBIR approaches in the context of medical imaging.
虽然基于内容的图像检索(CBIR)在自然图像检索中已经得到了广泛研究,但在医学图像中应用时仍然存在挑战,主要原因是医学图像的3D性质。最近的研究表明,在放射学图像检索背景下,预训练视觉嵌入可能有用于CBIR。然而,还没有一个用于检索3D体积医学图像的基准,这阻碍了客观评估和比较所提出的CBIR方法在医学成像中的效率。在这项研究中,我们延长了以前的工作,并使用TotalSegmentator数据集(TS)建立了基于区域的和多器官检索的基准,并对医学图像和非医学图像的预训练嵌入进行了比较。我们对29个粗粒度和104个详细解剖结构的体积和区域水平的预训练嵌入进行了比较,采用了一种类似于文本匹配的晚期交互重新排名方法,将其与体积和区域检索的原始方法进行比较,实现了检索召回率为1.0,具有多样解剖结构的广泛大小范围。本文所提出的研究成果和方法提供了开发和评估CBIR方法在医学成像领域的必要见解和基准。
https://arxiv.org/abs/2405.09334
Due to the increasing need for effective security measures and the integration of cameras in commercial products, a hugeamount of visual data is created today. Law enforcement agencies (LEAs) are inspecting images and videos to findradicalization, propaganda for terrorist organizations and illegal products on darknet markets. This is time consuming.Instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specificlocations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deepconvolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has fivecontributions. The first contribution allows image-based geo-localization to estimate the origin of an image. CNNs andgeotagged images are used to create a model that determines the location of an image by its pixel values. The secondcontribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposedmethod encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition ofperson attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attributeproblem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotationtool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimalannotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion.Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectableconcepts is required for the users. The methods are validated on data with varying locations (popular and non-touristiclocations), varying person attributes (CelebA dataset), and varying number of annotations.
由于对有效安全措施的需求不断增加以及摄像头在商业产品中的应用,如今产生了大量的视觉数据。执法机构(LEAs)正在检查图像和视频以寻找极端化、恐怖主义组织和非法商品在暗网市场上的传播。这需要耗费大量时间。 instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specific locations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deep convolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has five contributions. The first contribution allows image-based geolocation to estimate the origin of an image. CNNs and geotagged images are used to create a model that determines the location of an image by its pixel values. The second contribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposed method encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition of person attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attribute problem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotation tool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimal annotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion. Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectable concepts is required for the users. The methods are validated on data with varying locations (popular and non-tourist locations), varying person attributes (CelebA dataset), and varying number of annotations.
https://arxiv.org/abs/2405.09194
Modeling visual saliency in graphical user interfaces (GUIs) allows to understand how people perceive GUI designs and what elements attract their attention. One aspect that is often overlooked is the fact that computational models depend on a series of design parameters that are not straightforward to decide. We systematically analyze how different design parameters affect scanpath evaluation metrics using a state-of-the-art computational model (DeepGaze++). We particularly focus on three design parameters: input image size, inhibition-of-return decay, and masking radius. We show that even small variations of these design parameters have a noticeable impact on standard evaluation metrics such as DTW or Eyenalysis. These effects also occur in other scanpath models, such as UMSS and ScanGAN, and in other datasets such as MASSVIS. Taken together, our results put forward the impact of design decisions for predicting users' viewing behavior on GUIs.
在图形用户界面(GUIs)中建模视觉显著性可以帮助人们理解如何看待GUI设计以及哪些元素会吸引他们的注意。通常被忽视的一个方面是,计算模型依赖于一系列设计参数,而这些参数并不容易决定。我们使用最先进的计算模型(DeepGaze++)系统地分析不同设计参数如何影响扫描路径评估指标。我们特别关注三个设计参数:输入图像大小、抑制返回衰减和掩码半径。我们发现,即使是这些设计参数的小变化也会对标准评估指标,如DTW或Eyenalysis产生显著影响。这些影响也存在于其他扫描路径模型中,如UMSS和ScanGAN,以及其他数据集中。结合我们的结果,我们提出了设计决策对预测用户在GUIs上的观看行为具有影响的观点。
https://arxiv.org/abs/2405.08981
This paper addresses the problem of pathological lung segmentation, a significant challenge in medical image analysis, particularly pronounced in cases of peripheral opacities (severe fibrosis and consolidation) because of the textural similarity between lung tissue and surrounding areas. To overcome these challenges, this paper emphasizes the use of CycleGAN for unpaired image-to-image translation, in order to provide an augmentation method able to generate fake pathological images matching an existing ground truth. Although previous studies have employed CycleGAN, they often neglect the challenge of shape deformation, which is crucial for accurate medical image segmentation. Our work introduces an innovative strategy that incorporates additional loss functions. Specifically, it proposes an L1 loss based on the lung surrounding which shape is constrained to remain unchanged at the transition from the healthy to pathological domains. The lung surrounding is derived based on ground truth lung masks available in the healthy domain. Furthermore, preprocessing steps, such as cropping based on ribs/vertebra locations, are applied to refine the input for the CycleGAN, ensuring that the network focus on the lung region. This is essential to avoid extraneous biases, such as the zoom effect bias, which can divert attention from the main task. The method is applied to enhance in semi-supervised manner the lung segmentation process by employing a U-Net model trained with on-the-fly data augmentation incorporating synthetic pathological tissues generated by the CycleGAN model. Preliminary results from this research demonstrate significant qualitative and quantitative improvements, setting a new benchmark in the field of pathological lung segmentation. Our code is available at this https URL
本文解决了病理肺部分割的问题,这是医学图像分析的一个显著挑战,尤其是在边缘透明度(严重纤维化和胶质化)病例中更加突出,因为肺组织与周围区域的纹理相似。为了克服这些挑战,本文强调使用CycleGAN进行无配对图像到图像的转换,以提供一种能够生成与现有真实 ground truth 匹配的假病理图像的增强方法。虽然之前的研究已经使用了CycleGAN,但它们通常忽视了形状变形的重要性,这对于准确医学图像分割至关重要。我们的工作引入了一种创新策略,包括额外的损失函数。具体来说,它提出了一个基于肺周围约束的L1损失,该约束在从健康到病理域的转换过程中保持形状不变。肺周围基于健康的领域内存在的真实肺mask为基础进行提取。此外,对输入进行预处理步骤,如基于肋/椎位置的裁剪,以优化CycleGAN,确保网络集中于肺区域。这对于避免诸如放大效果偏见等额外偏差至关重要。将该方法应用于半监督方式增强肺分割过程,通过使用训练时数据增强包含由CycleGAN模型生成的合成病理组织的方法。来自这项研究的结果表明,在半监督方式下,肺分割过程有显著的质量和数量改进,为病理肺部分割领域树立了新的基准。我们的代码可在此处访问:https://www.thisurl.com
https://arxiv.org/abs/2405.08556
In biological evolution complex neural structures grow from a handful of cellular ingredients. As genomes in nature are bounded in size, this complexity is achieved by a growth process where cells communicate locally to decide whether to differentiate, proliferate and connect with other cells. This self-organisation is hypothesized to play an important part in the generalisation, and robustness of biological neural networks. Artificial neural networks (ANNs), on the other hand, are traditionally optimized in the space of weights. Thus, the benefits and challenges of growing artificial neural networks remain understudied. Building on the previously introduced Neural Developmental Programs (NDP), in this work we present an algorithm for growing ANNs that solve reinforcement learning tasks. We identify a key challenge: ensuring phenotypic complexity requires maintaining neuronal diversity, but this diversity comes at the cost of optimization stability. To address this, we introduce two mechanisms: (a) equipping neurons with an intrinsic state inherited upon neurogenesis; (b) lateral inhibition, a mechanism inspired by biological growth, which controlls the pace of growth, helping diversity persist. We show that both mechanisms contribute to neuronal diversity and that, equipped with them, NDPs achieve comparable results to existing direct and developmental encodings in complex locomotion tasks
在生物进化中,复杂的神经结构从一些细胞成分开始生长。由于自然界中的基因组大小是有限的,这种复杂性是通过细胞在局部交流以决定是否分化、增殖并与其他细胞连接来实现增长的。这种自组织被认为在生物神经网络的泛化和鲁棒性中发挥了重要作用。另一方面,人工神经网络(ANN)在权重空间中通常是优化的。因此,生长人工神经网络的收益和挑战仍然没有被充分研究。在之前引入的神经发育程序(NDP)的基础上,在这篇论文中,我们提出了一个生长ANN的算法,用于解决强化学习任务。我们认识到一个关键挑战是:保证表型复杂性需要保持神经元多样性,但这种多样性是以优化稳定性为代价的。为了应对这个问题,我们引入了两个机制:(a)为神经元提供源于神经发生学的内在状态;(b)横向抑制,一种受到生物生长启发的机制,它控制生长速度,有助于维持多样性。我们证明了这两个机制都贡献了神经元多样性,有了它们,NDP在复杂运动任务上的效果与现有的直接和发育编码相当。
https://arxiv.org/abs/2405.08510
In recent years, deep learning has greatly streamlined the process of generating realistic fake face images. Aware of the dangers, researchers have developed various tools to spot these counterfeits. Yet none asked the fundamental question: What digital manipulations make a real photographic face image fake, while others do not? In this paper, we put face forgery in a semantic context and define that computational methods that alter semantic face attributes to exceed human discrimination thresholds are sources of face forgery. Guided by our new definition, we construct a large face forgery image dataset, where each image is associated with a set of labels organized in a hierarchical graph. Our dataset enables two new testing protocols to probe the generalization of face forgery detectors. Moreover, we propose a semantics-oriented face forgery detection method that captures label relations and prioritizes the primary task (\ie, real or fake face detection). We show that the proposed dataset successfully exposes the weaknesses of current detectors as the test set and consistently improves their generalizability as the training set. Additionally, we demonstrate the superiority of our semantics-oriented method over traditional binary and multi-class classification-based detectors.
近年来,深度学习极大地简化了生成逼真假脸图像的过程。为了意识到这种伪造技术的危险性,研究人员开发了各种工具来检测这些伪造技术。然而,没有一个工具问到这个问题:什么数字编辑会使得真实照片脸孔图像伪造,而其他工具不会?在本文中,我们将面部伪造置于语义背景下,并定义了能够超过人类判断阈值的语义人脸属性编辑的方法是面部伪造的源头。在受到新定义的指导下,我们构建了一个大规模的人脸伪造图像数据集,其中每个图像都与一个由层次图组织起来的标签集相关联。我们的数据集使得有两个新的测试协议可以探究面部伪造检测器的泛化能力。此外,我们提出了一个语义导向的面部伪造检测方法,它抓住了标签关系并优先考虑了主要任务(即真实或伪造脸孔检测)。我们证明了所提出的数据集成功地揭示了当前检测器的缺陷作为测试集,并且在训练集上持续改进了它们的泛化能力。此外,我们还证明了我们的语义导向方法比传统的二分类和多分类分类基础检测器具有优越性。
https://arxiv.org/abs/2405.08487
Perivascular spaces(PVSs) form a central component of the brainś waste clearance system, the glymphatic system. These structures are visible on MRI images, and their morphology is associated with aging and neurological disease. Manual quantification of PVS is time consuming and subjective. Numerous deep learning methods for PVS segmentation have been developed, however the majority have been developed and evaluated on homogenous datasets and high resolution scans, perhaps limiting their applicability for the wide range of image qualities acquired in clinic and research. In this work we train a nnUNet, a top-performing biomedical image segmentation algorithm, on a heterogenous training sample of manually segmented MRI images of a range of different qualities and resolutions from 6 different datasets. These are compared to publicly available deep learning methods for 3D segmentation of PVS. The resulting model, PINGU (Perivascular space Identification Nnunet for Generalised Usage), achieved voxel and cluster level dice scores of 0.50(SD=0.15), 0.63(0.17) in the white matter(WM), and 0.54(0.11), 0.66(0.17) in the basal ganglia(BG). Performance on data from unseen sites was substantially lower for both PINGU(0.20-0.38(WM, voxel), 0.29-0.58(WM, cluster), 0.22-0.36(BG, voxel), 0.46-0.60(BG, cluster)) and the publicly available algorithms(0.18-0.30(WM, voxel), 0.29-0.38(WM cluster), 0.10-0.20(BG, voxel), 0.15-0.37(BG, cluster)), but PINGU strongly outperformed the publicly available algorithms, particularly in the BG. Finally, training PINGU on manual segmentations from a single site with homogenous scan properties gave marginally lower performances on internal cross-validation, but in some cases gave higher performance on external validation. PINGU stands out as broad-use PVS segmentation tool, with particular strength in the BG, an area of PVS related to vascular disease and pathology.
皮层外间隙(PVS)是清除系统中的一个重要组成部分,称为糖质系统。这些结构在MRI图像上是可见的,它们的形态与衰老和神经系统疾病有关。手动量化PVS是耗时且主观的。已经开发了许多用于PVS分割的深度学习方法,然而,大多数都针对具有相同质量和分辨率的高质量MRI数据集进行开发和评估,这可能使它们在广泛的诊所和研究的图像质量上应用有限。在这项工作中,我们使用了一个nnUNet,一种在各种质量和分辨率下手动分割的生物医学图像分割算法的顶级性能,对6个不同数据集的异质训练样本进行训练。这些与公开可用的深度学习方法进行比较,用于3D分割PVS。得到的模型PINGU(Perivascular space Identification Nnunet for Generalised Usage)在白质(WM)的体积和聚类级别 dice 分数分别为0.50(SD=0.15),0.63(0.17),在黑质(BG)的体积和聚类级别 dice 分数分别为0.54(0.11),0.66(0.17)。对于未见过的站点数据,PINGU的性能显著较低,尤其是在BG方面(0.20-0.38(WM,体积),0.29-0.58(WM,聚类),0.22-0.36(BG,体积),0.46-0.60(BG,聚类))。然而,与公开可用的算法相比,PINGU在BG方面表现出色。最后,使用同质扫描属性从单一站点训练PINGU,在内部交叉验证上的性能稍低,但有时在 external validation 上表现出更高的性能。总的来说,PINGU是一个通用的 PVS 分割工具,尤其是在 BG 方面,这是一个与血管疾病和病理学相关的 PVS 区域。
https://arxiv.org/abs/2405.08337
The grasp generation of dexterous hand often requires a large number of grasping annotations. Especially for functional grasp-requiring the grasp pose to be convenient for the subsequent use of the object. However, annotating high DoF dexterous hand pose is rather challenging. This prompt us to explore how people achieve manipulations on new objects based on past grasp experiences. We find that people are adept at discovering and leveraging various similarities between objects when grasping new items, including shape, layout, and grasp type. In light of this, we analyze and collect grasp-related similarity relationships among 51 common tool-like object categories and annotate semantic grasp representation for 1768 objects. These data are organized into the form of a knowledge graph, which helps infer our proposed cross-category functional grasp synthesis. Through extensive experiments, we demonstrate that the grasp-related knowledge indeed contributed to achieving functional grasp transfer across unknown or entirely new categories of objects. We will publicly release the dataset and code to facilitate future research.
熟练的手的抓取生成通常需要大量的抓取注释。特别是对于需要功能抓取且抓持姿势对于后续使用对象来说方便的对象。然而,对高维度灵活手抓取姿势的注释相当具有挑战性。这个提示我们探索人们如何基于过去的抓取经验来在新物体上进行操作。我们发现人们擅长发现和利用物体之间的各种相似性,包括形状、布局和抓取类型。鉴于这一点,我们对51个常见工具状物体类别进行了抓取相关相似关系分析,并为1768个物体标注了语义抓取表示。这些数据以知识图谱的形式组织起来,有助于推断我们提出的跨类功能抓取合成。通过大量实验,我们证实了抓取相关知识确实有助于在未知的或完全新的物体类别之间实现功能抓取转移。我们将公开发布该数据集和代码,以促进未来研究。
https://arxiv.org/abs/2405.08310
Histology slide digitization is becoming essential for telepathology (remote consultation), knowledge sharing (education), and using the state-of-the-art artificial intelligence algorithms (augmented/automated end-to-end clinical workflows). However, the cumulative costs of digital multi-slide high-speed brightfield scanners, cloud/on-premises storage, and personnel (IT and technicians) make the current slide digitization workflows out-of-reach for limited-resource settings, further widening the health equity gap; even single-slide manual scanning commercial solutions are costly due to hardware requirements (high-resolution cameras, high-spec PC/workstation, and support for only high-end microscopes). In this work, we present a new cloud slide digitization workflow for creating scanner-quality whole-slide images (WSIs) from uploaded low-quality videos, acquired from cheap and inexpensive microscopes with built-in cameras. Specifically, we present a pipeline to create stitched WSIs while automatically deblurring out-of-focus regions, upsampling input 10X images to 40X resolution, and reducing brightness/contrast and light-source illumination variations. We demonstrate the WSI creation efficacy from our workflow on World Health Organization-declared neglected tropical disease, Cutaneous Leishmaniasis (prevalent only in the poorest regions of the world and only diagnosed by sub-specialist dermatopathologists, rare in poor countries), as well as other common pathologies on core biopsies of breast, liver, duodenum, stomach and lymph node. The code and pretrained models will be accessible via our GitHub (this https URL), and the cloud platform will be available at this https URL for uploading microscope videos and downloading/viewing WSIs with shareable links (no sign-in required) for telepathology and knowledge sharing.
活组织切片数字化在远程咨询、知识共享和应用最先进的人工智能算法(增强/自动化端到端临床工作流程)中变得越来越重要。然而,数字多滑动高倍光场扫描仪、云/本地存储和人员(IT和技术人员)的累积成本使得当前的扫描数字化工作流程对于资源有限的环境无能为力,进一步扩大了健康差距。即使是单张光片手动扫描的商业解决方案,由于硬件要求(高分辨率相机、高性能PC/工作站和支持仅高端显微镜)也较为昂贵。 在这项工作中,我们提出了一个新的云扫描数字化工作流程,用于从上传的低质量视频创建扫描质量的整张图像(WSIs),这些视频来自具有内置摄像头的廉价且经济实惠的显微镜。特别地,我们提出了一种在自动模糊失焦区域、提高输入10倍图像分辨率、降低亮度和对比度以及降低光源照明变化的情况下创建WSI的流程。我们用我们的工作流程在世界卫生组织宣布的贫困热带疾病、皮肤利什曼病(仅在世界最贫困地区流行,仅由下专科皮肤病理学家诊断,在贫困国家更为罕见)以及其他常见乳腺癌、肝脏、十二指肠、胃和淋巴结核心活组织切片上进行了WSI创建 efficacy 的实验,证实了我们的工作流程的有效性。 代码和预训练的模型将通过我们的GitHub(就是这个https URL)访问,而云平台将通过这个https URL提供上传显微镜视频和使用共享链接下载/查看WSIs(无需登录)进行远程咨询和知识共享。
https://arxiv.org/abs/2405.08169
Synthesizing high-quality photorealistic images with textual descriptions as a condition is very challenging. Generative Adversarial Networks (GANs), the classical model for this task, frequently suffer from low consistency between image and text descriptions and insufficient richness in synthesized images. Recently, conditional affine transformations (CAT), such as conditional batch normalization and instance normalization, have been applied to different layers of GAN to control content synthesis in images. CAT is a multi-layer perceptron that independently predicts data based on batch statistics between neighboring layers, with global textual information unavailable to other layers. To address this issue, we first model CAT and a recurrent neural network (RAT) to ensure that different layers can access global information. We then introduce shuffle attention between RAT to mitigate the characteristic of information forgetting in recurrent neural networks. Moreover, both our generator and discriminator utilize the powerful pre-trained model, Clip, which has been extensively employed for establishing associations between text and images through the learning of multimodal representations in latent space. The discriminator utilizes CLIP's ability to comprehend complex scenes to accurately assess the quality of the generated images. Extensive experiments have been conducted on the CUB, Oxford, and CelebA-tiny datasets to demonstrate the superiority of the proposed model over current state-of-the-art models. The code is this https URL.
生成式对抗网络(GANs)是这个任务的经典模型,但通常在图像和文本描述之间缺乏一致性,并在生成图像的丰富性上不足。最近,条件反向传播(CAT)技术,如条件批归一化和实例归一化,已经被应用到GAN的不同层中,以控制图像中的内容合成。CAT是一个多层感知器,根据批归一化统计数据独立预测数据,而其他层则无法访问全局文本信息。为解决这个问题,我们首先将CAT和循环神经网络(RAT)建模,以确保不同层可以访问全局信息。然后,在RAT之间引入平移注意力和循环神经网络(RNN)的特点,降低信息遗忘的特点。此外,我们的生成器和鉴别器都利用了强大的预训练模型Clip,该模型已通过在潜在空间中学习多模态表示来建立文本和图像之间的联系。鉴别器利用Clip理解复杂场景,从而准确评估生成图像的质量。在CUB、牛津和CelebA-tiny数据集上进行了大量实验,证明了与当前最先进模型相比,所提出的模型具有优越性。代码在这个https URL上。
https://arxiv.org/abs/2405.08114
Purpose: To introduce a deep learning model capable of multi-organ segmentation in MRI scans, offering a solution to the current limitations in MRI analysis due to challenges in resolution, standardized intensity values, and variability in sequences. Materials and Methods: he model was trained on 1,200 manually annotated MRI scans from the UK Biobank, 221 in-house MRI scans and 1228 CT scans, leveraging cross-modality transfer learning from CT segmentation models. A human-in-the-loop annotation workflow was employed to efficiently create high-quality segmentations. The model's performance was evaluated on NAKO and the AMOS22 dataset containing 600 and 60 MRI examinations. Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) was used to assess segmentation accuracy. The model will be open sourced. Results: The model showcased high accuracy in segmenting well-defined organs, achieving Dice Similarity Coefficient (DSC) scores of 0.97 for the right and left lungs, and 0.95 for the heart. It also demonstrated robustness in organs like the liver (DSC: 0.96) and kidneys (DSC: 0.95 left, 0.95 right), which present more variability. However, segmentation of smaller and complex structures such as the portal and splenic veins (DSC: 0.54) and adrenal glands (DSC: 0.65 left, 0.61 right) revealed the need for further model optimization. Conclusion: The proposed model is a robust, tool for accurate segmentation of 40 anatomical structures in MRI and CT images. By leveraging cross-modality learning and interactive annotation, the model achieves strong performance and generalizability across diverse datasets, making it a valuable resource for researchers and clinicians. It is open source and can be downloaded from this https URL.
目的:介绍一种能够在MRI扫描中进行多器官分割的深度学习模型,解决了由于分辨率、标准化强度值和序列变异性导致的MRI分析 current limitations。材料和方法:该模型在来自英国生物银行的手动标注的1,200个MRI扫描、221个内部MRI扫描和1,228个CT扫描上进行训练,利用来自CT分割模型的跨模态转移学习。采用人机交互注释工作流程来高效地创建高质量分割。对模型性能进行了评估,在NAKO和包括600个和60个MRI检查的AMOS22数据集上。Dice相似性系数(DSC)和汉明距离(HD)被用来评估分割准确性。该模型将开源。结果:该模型在分割明确定义的器官方面表现出色,右肺和左肺的Dice相似性系数(DSC)分别为0.97,心脏的Dice相似性系数(DSC)分别为0.95。它还展示了在肝脏(DSC: 0.96)和肾脏(DSC: 0.95 left, 0.95 right)等结构中保持稳健性,这些结构具有更大的变异性。然而,对较小和复杂结构的分割(如门静脉和脾静脉,DSC: 0.54)和肾上腺素分泌细胞(DSC: 0.65 left, 0.61 right)的分割揭示了进一步模型优化。结论:所提出的模型是一种准确分割MRI和CT图像中40个解剖结构的有力工具。通过利用跨模态学习和支持性注释,该模型在各种数据集上都实现了强大的性能和泛化能力,成为研究人员和临床医生的有价值资源。该模型是开源的,可以从https://这个链接下载。
https://arxiv.org/abs/2405.06463
In recent years, diffusion models (DMs) have become a popular method for generating synthetic data. By achieving samples of higher quality, they quickly became superior to generative adversarial networks (GANs) and the current state-of-the-art method in generative modeling. However, their potential has not yet been exploited in radar, where the lack of available training data is a long-standing problem. In this work, a specific type of DMs, namely denoising diffusion probabilistic model (DDPM) is adapted to the SAR domain. We investigate the network choice and specific diffusion parameters for conditional and unconditional SAR image generation. In our experiments, we show that DDPM qualitatively and quantitatively outperforms state-of-the-art GAN-based methods for SAR image generation. Finally, we show that DDPM profits from pretraining on largescale clutter data, generating SAR images of even higher quality.
近年来,扩散模型(DMs)已成为生成合成数据的一种流行方法。通过实现高质量的样本,它们迅速成为生成对抗网络(GANs)和当前生成建模状态的佼佼者。然而,在雷达领域,缺乏可用训练数据是一个长期存在的问题。在这项工作中,我们针对SAR领域 adapt 一种特定类型的DM,即去噪扩散概率模型(DDPM)。我们研究了条件下的和无条件SAR图像生成网络选择和扩散参数。在我们的实验中,我们证明了DDPM在SAR图像生成方面既具有定性又具有定量优势。最后,我们证明了DDPM在大型杂乱数据上的预训练使其产生更高质量的SAR图像。
https://arxiv.org/abs/2405.07776
Sign Language Production (SLP) is a challenging task, given the limited resources available and the inherent diversity within sign data. As a result, previous works have suffered from the problem of regression to the mean, leading to under-articulated and incomprehensible signing. In this paper, we propose using dictionary examples and a learnt codebook of facial expressions to create expressive sign language sequences. However, simply concatenating signs and adding the face creates robotic and unnatural sequences. To address this we present a 7-step approach to effectively stitch sequences together. First, by normalizing each sign into a canonical pose, cropping, and stitching we create a continuous sequence. Then, by applying filtering in the frequency domain and resampling each sign, we create cohesive natural sequences that mimic the prosody found in the original data. We leverage a SignGAN model to map the output to a photo-realistic signer and present a complete Text-to-Sign (T2S) SLP pipeline. Our evaluation demonstrates the effectiveness of the approach, showcasing state-of-the-art performance across all datasets. Finally, a user evaluation shows our approach outperforms the baseline model and is capable of producing realistic sign language sequences.
手语生产(SLP)是一项具有挑战性的任务,由于资源有限以及手语数据固有的多样性,因此以前的工作都受到了回归平均值的问题,导致手语表达不充分且难以理解。在本文中,我们提出了一种使用词汇示例和学过的表情代码库来创建表现手语序列的方法。然而,简单地将手语和面部拼接起来会创建出机械和异常的手语序列。为了解决这个问题,我们提出了7步方法来有效地将序列拼接起来。首先,通过将每个手语正常化到规范姿势、裁剪和缝合,我们创建了一个连续的手语序列。然后,通过在频域中应用滤波器和重采样每个手语,我们创建了凝聚人心且与原始数据中的语调相符的自然序列。我们利用SignGAN模型将输出映射到照片真实的手语签名者,并提出了完整的文本到手语(T2S)SLP管道。我们的评估展示了这种方法的有效性,展示了所有数据集上的最先进性能。最后,用户评估结果表明,我们的方法超过了基线模型,具有产生真实手语序列的能力。
https://arxiv.org/abs/2405.07663