The roto-translation group SE2 has been of active interest in image analysis due to methods that lift the image data to multi-orientation representations defined on this Lie group. This has led to impactful applications of crossing-preserving flows for image de-noising, geodesic tracking, and roto-translation equivariant deep learning. In this paper, we develop a computational framework for optimal transportation over Lie groups, with a special focus on SE2. We make several theoretical contributions (generalizable to matrix Lie groups) such as the non-optimality of group actions as transport maps, invariance and equivariance of optimal transport, and the quality of the entropic-regularized optimal transport plan using geodesic distance approximations. We develop a Sinkhorn like algorithm that can be efficiently implemented using fast and accurate distance approximations of the Lie group and GPU-friendly group convolutions. We report valuable advancements in the experiments on 1) image barycenters, 2) interpolation of planar orientation fields, and 3) Wasserstein gradient flows on SE2. We observe that our framework of lifting images to SE2 and optimal transport with left-invariant anisotropic metrics leads to equivariant transport along dominant contours and salient line structures in the image. This yields sharper and more meaningful interpolations compared to their counterparts on $\mathbb{R}^2$
由于将图像数据提升到在Lie群上定义的多旋度表示的方法,SE2的旋转平移组对图像分析产生了积极的影响。这导致了一系列交叉保持的流在图像去噪、测地跟踪和罗投影等深度学习任务中的应用。在本文中,我们开发了一个计算框架,重点关注SE2。我们在论文中做出了几个理论贡献(可推广到矩阵Lie组)。例如,群行动的非优化性作为传输映射,等价性和不变性,以及使用测地距离逼近的geodesic距离对最优传输规划的熵 regularized 质量。我们开发了一个Sinkhorn类似算法,可以高效地使用对Lie群的快速且准确的距离逼近和GPU友好的群卷积实现。我们在SE2上的实验中取得了重要的进展,包括1)图像聚类,2)平滑规划方向场的中值估计,3)Wasserstein梯度流。我们观察到,将图像提升到SE2和用左不变的各向同性度量进行最优传输的方法导致沿着主导轮廓和突出线结构进行等价传输。这使得与在$\mathbb{R}^2$上的类似实现相比,具有更锐利的插值和更深刻的含义。
https://arxiv.org/abs/2402.15322
The integration of medical imaging, computational analysis, and robotic technology has brought about a significant transformation in minimally invasive surgical procedures, particularly in the realm of laparoscopic rectal surgery (LRS). This specialized surgical technique, aimed at addressing rectal cancer, requires an in-depth comprehension of the spatial dynamics within the narrow space of the pelvis. Leveraging Magnetic Resonance Imaging (MRI) scans as a foundational dataset, this study incorporates them into Computer-Aided Design (CAD) software to generate precise three-dimensional (3D) reconstructions of the patient's anatomy. At the core of this research is the analysis of the surgical workspace, a critical aspect in the optimization of robotic interventions. Sophisticated computational algorithms process MRI data within the CAD environment, meticulously calculating the dimensions and contours of the pelvic internal regions. The outcome is a nuanced understanding of both viable and restricted zones during LRS, taking into account factors such as curvature, diameter variations, and potential obstacles. This paper delves deeply into the complexities of workspace analysis for robotic LRS, illustrating the seamless collaboration between medical imaging, CAD software, and surgical robotics. Through this interdisciplinary approach, the study aims to surpass traditional surgical methodologies, offering novel insights for a paradigm shift in optimizing robotic interventions within the complex environment of the pelvis.
医疗影像、计算分析和机器人技术的集成已经在微创手术中产生了显著的变革,特别是在直肠癌的腹腔镜右半结肠手术(LRS)领域。这种专门针对直肠癌的手术技术,旨在解决盆底问题,需要对盆腔空间内的空间动力学有深入的理解。将磁共振成像(MRI)扫描作为基础数据,本研究将它们纳入计算机辅助设计(CAD)软件中,生成精确的三维(3D)人体解剖结构的重建。 这项研究的核心是对手术工作空间的分析,这是在优化机器人干预方面进行优化的重要方面。复杂的计算算法在CAD环境中处理MRI数据,仔细计算盆底内部区域的尺寸和轮廓。结果是对LRS中可行和受限区域的微妙理解,考虑诸如弯曲、直径变化和潜在障碍等因素。本文深入研究了机器人LRS工作空间分析的复杂性,展示了医疗影像、CAD软件和机器人手术之间的无缝合作。通过这种跨学科的方法,该研究旨在超越传统的手术方法,为盆底手术在复杂的环境中的优化提供新颖的见解,为盆底手术方法的发展做出贡献。
https://arxiv.org/abs/2402.14386
Recent diffusion-based generative models show promise in their ability to generate text images, but limitations in specifying the styles of the generated texts render them insufficient in the realm of typographic design. This paper proposes a typographic text generation system to add and modify text on typographic designs while specifying font styles, colors, and text effects. The proposed system is a novel combination of two off-the-shelf methods for diffusion models, ControlNet and Blended Latent Diffusion. The former functions to generate text images under the guidance of edge conditions specifying stroke contours. The latter blends latent noise in Latent Diffusion Models (LDM) to add typographic text naturally onto an existing background. We first show that given appropriate text edges, ControlNet can generate texts in specified fonts while incorporating effects described by prompts. We further introduce text edge manipulation as an intuitive and customizable way to produce texts with complex effects such as ``shadows'' and ``reflections''. Finally, with the proposed system, we successfully add and modify texts on a predefined background while preserving its overall coherence.
最近基于扩散的生成模型在生成文本图像方面显示出潜力,但指定生成文本的风格在排版设计领域存在局限性。本文提出了一种可以在排版设计中添加和修改文本的字体样式、颜色和文本效果的字体生成系统。所提出的系统是将扩散模型的两个经典方法——ControlNet和Blended Latent Diffusion相结合的全新组合。前一个方法根据指定边缘条件生成文本图像,控制轮廓;后一个方法将潜在噪音在Latent Diffusion Models(LDM)中混合,以自然地将字体文本添加到现有背景上。我们首先证明了,只要给定适当的文本边缘,ControlNet可以在指定的字体中生成指定的文本,同时包含由提示描述的效果。我们进一步引入了文本边缘操作作为直观且可定制的生成具有复杂效果(如“阴影”和“反射”)的文本的方法。最后,在所提出的系统中,我们在预定义的背景上成功添加和修改文本,同时保留其整体连贯性。
https://arxiv.org/abs/2402.14314
Lipreading involves using visual data to recognize spoken words by analyzing the movements of the lips and surrounding area. It is a hot research topic with many potential applications, such as human-machine interaction and enhancing audio speech recognition. Recent deep-learning based works aim to integrate visual features extracted from the mouth region with landmark points on the lip contours. However, employing a simple combination method such as concatenation may not be the most effective approach to get the optimal feature vector. To address this challenge, firstly, we propose a cross-attention fusion-based approach for large lexicon Arabic vocabulary to predict spoken words in videos. Our method leverages the power of cross-attention networks to efficiently integrate visual and geometric features computed on the mouth region. Secondly, we introduce the first large-scale Lip Reading in the Wild for Arabic (LRW-AR) dataset containing 20,000 videos for 100-word classes, uttered by 36 speakers. The experimental results obtained on LRW-AR and ArabicVisual databases showed the effectiveness and robustness of the proposed approach in recognizing Arabic words. Our work provides insights into the feasibility and effectiveness of applying lipreading techniques to the Arabic language, opening doors for further research in this field. Link to the project page: this https URL
lip-reading 是一种使用视觉数据来识别口语单词的方法,通过分析嘴唇及其周围区域的运动。这是一个热门的研究课题,具有许多潜在应用,如人机交互和增强音频语音识别。最近基于深度学习的工作试图将嘴部区域提取的视觉特征与嘴轮廓上的地标点相结合。然而,采用简单的组合方法(如连接)可能不是获得最佳特征向量的最有效方法。为了应对这个挑战,我们首先提出了一个基于跨注意力的方法,用于预测视频中的口语单词。我们的方法利用了跨注意力的网络的力量,有效地将嘴部区域计算的视觉和几何特征集成在一起。其次,我们引入了第一个大型的阿拉伯语(LRW-AR)数据集中包含的阿拉伯语(LRW-AR)数据集,其中包括100个词语 class 的20,000个视频,由36个不同的说话者朗读。在LRW-AR和阿拉伯视觉数据库上进行实验得出的结果表明,所提出的方案在识别阿拉伯单词方面具有有效性和稳健性。我们的工作揭示了将 lip-reading 技术应用于阿拉伯语的潜力和效果,为这个领域进一步的研究打开了大门。相关项目页面:此链接
https://arxiv.org/abs/2402.11520
Lung mask creation lacks well-defined criteria and standardized guidelines, leading to a high degree of subjectivity between annotators. In this study, we assess the underestimation of lung regions on chest X-ray segmentation masks created according to the current state-of-the-art method, by comparison with total lung volume evaluated on computed tomography (CT). We show, that lung X-ray masks created by following the contours of the heart, mediastinum, and diaphragm significantly underestimate lung regions and exclude substantial portions of the lungs from further assessment, which may result in numerous clinical errors.
创制肺口罩缺乏明确的标准和标准化准则,导致注释者之间的主观程度较高。在这项研究中,我们评估了根据最先进的方法创建的胸部X光分割图上肺区域的低估程度,并与通过计算机断层扫描(CT)评估的总体肺容积进行比较。我们发现,遵循心脏、 mediastinum 和 diaphragm轮廓的肺 X光口罩显著低估肺区域,排除了大量肺组织的进一步评估,这可能导致许多临床错误。
https://arxiv.org/abs/2402.11510
This article introduces Lester, a novel method to automatically synthetise retro-style 2D animations from videos. The method approaches the challenge mainly as an object segmentation and tracking problem. Video frames are processed with the Segment Anything Model (SAM) and the resulting masks are tracked through subsequent frames with DeAOT, a method of hierarchical propagation for semi-supervised video object segmentation. The geometry of the masks' contours is simplified with the Douglas-Peucker algorithm. Finally, facial traits, pixelation and a basic shadow effect can be optionally added. The results show that the method exhibits an excellent temporal consistency and can correctly process videos with different poses and appearances, dynamic shots, partial shots and diverse backgrounds. The proposed method provides a more simple and deterministic approach than diffusion models based video-to-video translation pipelines, which suffer from temporal consistency problems and do not cope well with pixelated and schematic outputs. The method is also much most practical than techniques based on 3D human pose estimation, which require custom handcrafted 3D models and are very limited with respect to the type of scenes they can process.
这篇文章介绍了一种新的方法,称为Lester,可以从视频中自动合成怀旧风格的2D动画。该方法主要作为物体分割和跟踪问题。视频帧使用Segment Anything Model(SAM)处理,然后通过DeAOT,一种半监督视频物体分割方法,处理后续帧。使用Douglas-Peucker算法简化mask轮廓的几何形状。最后,还可以选择添加面部特征、像素化和基本阴影效果。结果表明,该方法具有出色的时间一致性,能够正确处理不同姿态和外观的视频,包括动态镜头、部分镜头和多样化的背景。与基于视频到视频翻译管道的扩散模型相比,所提出的方法更简单和确定性。这种方法比基于3D人体姿态估计的技术更实用,不需要手工制作3D模型,而且对于它们可以处理的场景类型也非常有限。
https://arxiv.org/abs/2402.09883
Glass-like objects can be seen everywhere in our daily life which are very hard for existing methods to segment them. The properties of transparencies pose great challenges of detecting them from the chaotic background and the vague separation boundaries further impede the acquisition of their exact contours. Moving machines which ignore glasses have great risks of crashing into transparent barriers or difficulties in analysing objects reflected in the mirror, thus it is of substantial significance to accurately locate glass-like objects and completely figure out their contours. In this paper, inspired by the scale integration strategy and the refinement method, we proposed a brand-new network, named as MGNet, which consists of a Fine-Rescaling and Merging module (FRM) to improve the ability to extract spatially relationship and a Primary Prediction Guiding module (PPG) to better mine the leftover semantics from the fused features. Moreover, we supervise the model with a novel loss function with the uncertainty-aware loss to produce high-confidence segmentation maps. Unlike the existing glass segmentation models that must be trained on different settings with respect to varied datasets, our model are trained under consistent settings and has achieved superior performance on three popular public datasets. Code is available at
日常生活中,我们处处都可以看到玻璃般的物体,这对于现有方法来说很难对其进行分割。透明度的性质给从混乱的背景中检测它们带来了巨大的挑战,而模糊的分割边界进一步阻碍了获取其精确轮廓。忽略玻璃的移动机器具有很大的撞墙风险或镜子中反射的物体的分析困难,因此准确地定位玻璃般的物体并完全理解其轮廓具有重大的实际意义。在本文中,我们受到规模集成策略和优化方法的影响,提出了一种名为MGNet的新网络,它包括一个Fine-Rescaling和Merging模块(FRM)和一个Primary Prediction Guiding模块(PPG)。此外,我们还使用具有不确定性感知损失的模型监督方法,以产生高置信度分割图。与必须针对不同数据集在不同的设置上进行训练的现有玻璃分割模型不同,我们的模型在一致的设置下训练,已经在三个流行的公共数据集上取得了卓越的性能。代码可于以下链接获取:
https://arxiv.org/abs/2402.08571
A method was proposed for the point cloud-based registration and image fusion between cardiac single photon emission computed tomography (SPECT) myocardial perfusion images (MPI) and cardiac computed tomography angiograms (CTA). Firstly, the left ventricle (LV) epicardial regions (LVERs) in SPECT and CTA images were segmented by using different U-Net neural networks trained to generate the point clouds of the LV epicardial contours (LVECs). Secondly, according to the characteristics of cardiac anatomy, the special points of anterior and posterior interventricular grooves (APIGs) were manually marked in both SPECT and CTA image volumes. Thirdly, we developed an in-house program for coarsely registering the special points of APIGs to ensure a correct cardiac orientation alignment between SPECT and CTA images. Fourthly, we employed ICP, SICP or CPD algorithm to achieve a fine registration for the point clouds (together with the special points of APIGs) of the LV epicardial surfaces (LVERs) in SPECT and CTA images. Finally, the image fusion between SPECT and CTA was realized after the fine registration. The experimental results showed that the cardiac orientation was aligned well and the mean distance error of the optimal registration method (CPD with affine transform) was consistently less than 3 mm. The proposed method could effectively fuse the structures from cardiac CTA and SPECT functional images, and demonstrated a potential in assisting in accurate diagnosis of cardiac diseases by combining complementary advantages of the two imaging modalities.
提出了一种基于点云的超声心动图(SPECT)心肌灌注图像(MPI)和心脏计算机断层扫描(CTA)之间的注册和图像融合方法。首先,使用不同训练的U-Net神经网络对SPECT和CTA图像中的左心室(LV)心尖壁(LVERs)进行分割。其次,根据心脏解剖学的特点,在SPECT和CTA图像体积中手动标记前间室后腔(APIG)的特别点。第三,我们开发了一个内部程序,用于将APIG的特别点与SPECT和CTA图像中的LV心尖壁(LVERs)进行粗略对齐,以确保正确的的心脏方向对齐。第四,使用ICP、SICP或CPD算法对点云(包括APIG的特别点)进行精细对齐,以实现SPECT和CTA图像中LV心尖壁的点云对齐。最后,在对齐后进行图像融合。实验结果表明,心脏方向对齐良好,最优对齐方法的平均距离误差(通过平滑变换的CPD)始终小于3毫米。所提出的方法可以有效地将心脏的SPECT和CTA功能图像的结构融合在一起,并表明通过结合两种成像模态的互补优势,有助于准确诊断心脏疾病。
https://arxiv.org/abs/2402.06841
In order to optimize the radiotherapy delivery for cancer treatment, especially when dealing with complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI), the accurate contouring of the Planning Target Volume (PTV) is crucial. Unfortunately, relying on manual contouring for such treatments is time-consuming and prone to errors. In this paper, we investigate the application of Deep Learning (DL) to automate the segmentation of the PTV in TMLI treatment, building upon previous work that introduced a solution to this problem based on a 2D U-Net model. We extend the previous research (i) by employing the nnU-Net framework to develop both 2D and 3D U-Net models and (ii) by evaluating the trained models on the PTV with the exclusion of bones, which consist mainly of lymp-nodes and represent the most challenging region of the target volume to segment. Our result show that the introduction of nnU-NET framework led to statistically significant improvement in the segmentation performance. In addition, the analysis on the PTV after the exclusion of bones showed that the models are quite robust also on the most challenging areas of the target volume. Overall, our study is a significant step forward in the application of DL in a complex radiotherapy treatment such as TMLI, offering a viable and scalable solution to increase the number of patients who can benefit from this treatment.
为了优化癌症治疗中的放疗剂量,特别是当处理复杂的治疗方案时,如全身骨髓照射和淋巴结照射(TMLI),精确刻画计划靶体积(PTV)至关重要。然而,仅依赖手动轮廓进行此类治疗会花费很长时间并容易出错。在本文中,我们研究了将深度学习(DL)应用于TMLI治疗中的PTV分割,并在此基础上对之前的工作进行扩展,该工作基于2D U-Net模型解决了这个问题。我们通过使用nnU-Net框架开发2D和3D U-Net模型,以及通过评估训练后的模型对不含骨的PTV进行评估,来扩展之前的研究(i)。我们还研究了在TMLI治疗中使用nnU-NET框架对PTV进行分割后,模型的性能是否显著提高。此外,在排除骨之后对PTV的分析表明,模型在目标体积的最困难区域也表现出相当不错的鲁棒性。总的来说,我们的研究是应用深度学习在复杂放射治疗治疗中的应用迈出的重要一步,为诸如TMLI这样的复杂放射治疗提供了一个可行的、可扩展的解决方案,从而使更多的患者受益于这种治疗。
https://arxiv.org/abs/2402.06494
A transversal study of the pitch variability of parkinsonian voices in read speech is presented. 30 patients suffering from Parkinson's disease (PD) and 32 healthy speakers were recorded while reading a text without voiceless phonemes. The fundamental frequency contours were calculated from the recordings, and the following measures were used for describing them: mean, minimum, maximum, and standard deviation of the estimated fundamental frequencies. Results based on these measures indicate that the influence of PD on some aspects of intonation can be masked by the effects of aging, especially for male voices. However, some parameters such as the relative fundamental frequency range exhibit lower correlations with age than with PD stage, as evaluated using the Hoehn and Yahr scale. These correlations between relative fundamental frequency range and PD stage reach moderate-to-high values in the case of women. Additionally, three parameters describing the form of the fundamental frequency modulation spectrum were investigated for correlation with age and PD stage. The study of this modulation spectrum provides some insight into the ability of the speakers to plan the intonation of full phrases. For both male and female populations, significant correlations were found between parameters obtained from the modulation spectrum of fundamental frequency and the PD stage. Nevertheless, the quantitative assessment of the performance of regression models built from these modulation parameters and fundamental frequency range suggests that such measures are likely to be of limited value in the early diagnosis of PD due to inter-speaker variability.
本文对 Parkinsonian 语音变调的研究进行了综述。在对 30 名患有 Parkinson's disease(PD)的患者和 32 名健康说话者进行阅读时无音素语音的文本时进行录音。从录音中计算出基本频率轮廓,并使用以下度量对其进行描述:平均、最小、最大和标准差估计的基本频率。基于这些度量的结果表明,PD 对某些语调方面的影响可能会被衰老的影响所掩盖,特别是对于男性声音。然而,使用 Hoehn 和 Yahr 刻度对相对基本频率范围和PD 阶段的关联度进行评估,结果显示这些参数与年龄的关联较低,而与PD阶段的关联较高。此外,研究了三个参数描述基本频率调制频谱的形式,以评估其与年龄和PD阶段的关联。基于这些调制参数的基本频率频谱的研究提供了一些对说话者规划完整短语的能力的洞察。对于男性和女性人群,从基本频率频谱获得了的参数与PD 阶段之间发现了显著的关联。然而,基于这些参数的回归模型的定量评估表明,由于说话者之间的变异性,这些测量值在 PD 的早期诊断中可能有限价值。
https://arxiv.org/abs/2402.06387
The newly released Segment Anything Model (SAM) is a popular tool used in image processing due to its superior segmentation accuracy, variety of input prompts, training capabilities, and efficient model design. However, its current model is trained on a diverse dataset not tailored to medical images, particularly ultrasound images. Ultrasound images tend to have a lot of noise, making it difficult to segment out important structures. In this project, we developed ClickSAM, which fine-tunes the Segment Anything Model using click prompts for ultrasound images. ClickSAM has two stages of training: the first stage is trained on single-click prompts centered in the ground-truth contours, and the second stage focuses on improving the model performance through additional positive and negative click prompts. By comparing the first stage predictions to the ground-truth masks, true positive, false positive, and false negative segments are calculated. Positive clicks are generated using the true positive and false negative segments, and negative clicks are generated using the false positive segments. The Centroidal Voronoi Tessellation algorithm is then employed to collect positive and negative click prompts in each segment that are used to enhance the model performance during the second stage of training. With click-train methods, ClickSAM exhibits superior performance compared to other existing models for ultrasound image segmentation.
刚刚发布的Segment Anything Model(SAM)是一种在图像处理中非常受欢迎的工具,由于其卓越的分割准确度、输入提示的多样性、训练能力和高效的模型设计而备受推崇。然而,目前训练的模型是在非针对医学图像的多样数据集中训练的,特别是超声图像。超声图像通常存在大量的噪声,这使得分割重要结构变得更加困难。在这个项目中,我们开发了ClickSAM,它通过点击超声图像的提示来微调Segment Anything Model。ClickSAM有两个训练阶段:第一个阶段是在目标轮廓的单击提示上进行训练,第二个阶段是通过额外的正负点击提示来提高模型性能。通过将第一阶段预测与真实掩码进行比较,可以计算出真正例、假正例和假负实例。正点击是通过真实正例和假负实例生成的,而负点击是通过假正例实例生成的。然后,采用Centroidal Voronoi Tessellation算法收集每个分段中使用的正负点击提示,用于在第二阶段训练期间增强模型性能。与点击训练方法相比,ClickSAM在超声图像分割方面的表现优于其他现有模型。
https://arxiv.org/abs/2402.05902
In this paper, a new system based on combinations of a shape descriptor and a contour descriptor has been proposed for classifying inserts in milling processes according to their wear level following a computer vision based approach. To describe the wear region shape we have proposed a new descriptor called ShapeFeat and its contour has been characterized using the method BORCHIZ that, to the best of our knowledge, achieves the best performance for tool wear monitoring following a computer vision-based approach. Results show that the combination of BORCHIZ with ShapeFeat using a late fusion method improves the classification performance significantly, obtaining an accuracy of 91.44% in the binary classification (i.e. the classification of the wear as high or low) and 82.90% using three target classes (i.e. classification of the wear as high, medium or low). These results outperform the ones obtained by both descriptors used on their own, which achieve accuracies of 88.70 and 80.67% for two and three classes, respectively, using ShapeFeat and 87.06 and 80.24% with B-ORCHIZ. This study yielded encouraging results for the manufacturing community in order to classify automatically the inserts in terms of their wear for milling processes.
在本文中,根据计算机视觉方法提出了一种新系统,用于根据工件的磨损级别对铣削过程中切削面的插入进行分类。为了描述磨损区域形状,我们提出了一个新的描述符ShapeFeat,并使用BORCHIZ方法对其轮廓进行特征描述。根据我们目前的知识,该方法在计算机视觉方法下对工具磨损监测方面的表现最佳。结果表明,使用晚融合方法将ShapeFeat与BORCHIZ的结合可以显著提高分类性能,获得二分类准确度为91.44%,三分类准确度为82.90%。这些结果超过了使用各自描述符的结果,后者分别获得了88.70%和80.67%的准确度。对于铣削过程中的切削面插入,本研究为制造业界提供了鼓舞人心的结果,以便能够自动对工件的磨损进行分类。
https://arxiv.org/abs/2402.05978
While free-hand sketching has long served as an efficient representation to convey characteristics of an object, they are often subjective, deviating significantly from realistic representations. Moreover, sketches are not consistent for arbitrary viewpoints, making it hard to catch 3D shapes. We propose 3Dooole, generating descriptive and view-consistent sketch images given multi-view images of the target object. Our method is based on the idea that a set of 3D strokes can efficiently represent 3D structural information and render view-consistent 2D sketches. We express 2D sketches as a union of view-independent and view-dependent components. 3D cubic B ezier curves indicate view-independent 3D feature lines, while contours of superquadrics express a smooth outline of the volume of varying viewpoints. Our pipeline directly optimizes the parameters of 3D stroke primitives to minimize perceptual losses in a fully differentiable manner. The resulting sparse set of 3D strokes can be rendered as abstract sketches containing essential 3D characteristic shapes of various objects. We demonstrate that 3Doodle can faithfully express concepts of the original images compared with recent sketch generation approaches.
自由手绘早已成为一种有效的表示物体特征的方法,但它们通常具有主观性,与现实主义表现有很大的偏差。此外,手绘图对于任意视角并不一致,使得很难捕捉到3D形状。我们提出3Dooole,它可以根据目标对象的多元视角生成描述性和视图一致的手绘图像。我们的方法基于一个理念,即一系列3D画笔可以有效地表示3D结构信息并渲染视图一致的2D手绘图。我们将2D手绘图表示为独立于视点和相关于视点的组件的并集。3D立方贝塞尔曲线表示独立于视点的3D特征线,而超四元体的轮廓表示不同视点下体积的平滑轮廓。我们的管道直接优化3D画笔原型的参数,以以完全可导的方式最小化感知损失。通过这种方式生成的稀疏集3D画笔可以渲染成包含各种物体关键3D特征形状的抽象手绘图。我们证明了3Doodle能够与最近的手绘生成方法相比,准确表达原始图像的概念。
https://arxiv.org/abs/2402.03690
Recently, there have been significant advancements in Image Restoration based on CNN and transformer. However, the inherent characteristics of the Image Restoration task are often overlooked in many works. These works often focus on the basic block design and stack numerous basic blocks to the model, leading to redundant parameters and unnecessary computations and hindering the efficiency of the image restoration. In this paper, we propose a Lightweight Image Restoration network called LIR to efficiently remove degradation (blur, rain, noise, haze, etc.). A key component in LIR is the Efficient Adaptive Attention (EAA) Block, which is mainly composed of Adaptive Filters and Attention Blocks. It is capable of adaptively sharpening contours, removing degradation, and capturing global information in various image restoration scenes in an efficient and computation-friendly manner. In addition, through a simple structural design, LIR addresses the degradations existing in the local and global residual connections that are ignored by modern networks. Extensive experiments demonstrate that our LIR achieves comparable performance to state-of-the-art networks on most benchmarks with fewer parameters and computations. It is worth noting that our LIR produces better visual results than state-of-the-art networks that are more in line with the human aesthetic.
近年来,基于CNN和Transformer的图像修复取得了显著的进展。然而,在许多工作中,图像修复任务的固有特点常常被忽视。这些工作通常关注基本组件设计,并将许多基本组件堆叠到模型中,导致冗余参数和不必要的计算,从而降低了图像修复的效率。在本文中,我们提出了一个轻量级的图像修复网络LIR,以有效地去除降解(模糊,雨,噪声,雾等)。LIR的关键组件是Efficient Adaptive Attention(EAA)块,它主要由自适应滤波器和注意力块组成。它能够动态地锐化轮廓,消除降解,并以有效且计算友好的方式捕捉各种图像修复场景中的全局信息。此外,通过简单的结构设计,LIR解决了现代网络中局部和全局残差连接中存在的降解问题。大量实验证明,我们的LIR在参数和计算量更少的情况下,与最先进的网络在大多数基准测试上的性能相当。值得注意的是,我们的LIR产生的视觉效果优于与人类美学更加贴近的先进网络。
https://arxiv.org/abs/2402.01368
Generating medical images from human-drawn free-hand sketches holds promise for various important medical imaging applications. Due to the extreme difficulty in collecting free-hand sketch data in the medical domain, most deep learning-based methods have been proposed to generate medical images from the synthesized sketches (e.g., edge maps or contours of segmentation masks from real images). However, these models often fail to generalize on the free-hand sketches, leading to unsatisfactory results. In this paper, we propose a practical free-hand sketch-to-image generation model called Sketch2MedI that learns to represent sketches in StyleGAN's latent space and generate medical images from it. Thanks to the ability to encode sketches into this meaningful representation space, Sketch2MedI only requires synthesized sketches for training, enabling a cost-effective learning process. Our Sketch2MedI demonstrates a robust generalization to free-hand sketches, resulting in high-quality and realistic medical image generations. Comparative evaluations of Sketch2MedI against the pix2pix, CycleGAN, UNIT, and U-GAT-IT models show superior performance in generating pharyngeal images, both quantitative and qualitative across various metrics.
基于手绘的自由手绘图生成医学图像具有各种重要医学成像应用的潜力。由于在医疗领域中收集自由手绘数据非常困难,因此提出了许多基于深度学习的生成医学图像的方法,这些方法通常从合成图(例如,真实图像中的边缘图或分割掩码的轮廓)生成医学图像。然而,这些模型通常无法泛化到自由手绘图,导致不满意的结果。在本文中,我们提出了一个实用的自由手绘图到图像生成模型,称为Sketch2MedI,它学会了在StyleGAN的潜在空间中表示手绘图,并从中生成医学图像。由于能够将手绘图编码到这个有意义的表示空间中,Sketch2MedI仅需要合成图进行训练,从而实现了一种经济有效的学习过程。我们的Sketch2MedI在自由手绘图中表现出优异的泛化能力,从而能够生成高质量和真实的医学图像。Sketch2MedI与pix2pix、CycleGAN、UNIT和U-GAT-IT模型的比较评估显示,在生成咽喉图像方面,Sketch2MedI在量和质上均优于这些模型。
https://arxiv.org/abs/2402.00353
Monitoring the distribution and size structure of long-living shrubs, such as Juniperus communis, can be used to estimate the long-term effects of climate change on high-mountain and high latitude ecosystems. Historical aerial very-high resolution imagery offers a retrospective tool to monitor shrub growth and distribution at high precision. Currently, deep learning models provide impressive results for detecting and delineating the contour of objects with defined shapes. However, adapting these models to detect natural objects that express complex growth patterns, such as junipers, is still a challenging task. This research presents a novel approach that leverages remotely sensed RGB imagery in conjunction with Mask R-CNN-based instance segmentation models to individually delineate Juniperus shrubs above the treeline in Sierra Nevada (Spain). In this study, we propose a new data construction design that consists in using photo interpreted (PI) and field work (FW) data to respectively develop and externally validate the model. We also propose a new shrub-tailored evaluation algorithm based on a new metric called Multiple Intersections over Ground Truth Area (MIoGTA) to assess and optimize the model shrub delineation performance. Finally, we deploy the developed model for the first time to generate a wall-to-wall map of Juniperus individuals. The experimental results demonstrate the efficiency of our dual data construction approach in overcoming the limitations associated with traditional field survey methods. They also highlight the robustness of MIoGTA metric in evaluating instance segmentation models on species with complex growth patterns showing more resilience against data annotation uncertainty. Furthermore, they show the effectiveness of employing Mask R-CNN with ResNet101-C4 backbone in delineating PI and FW shrubs, achieving an F1-score of 87,87% and 76.86%, respectively.
监测长寿命灌木(如Juniperus communis)的分布和大小结构可用于估计气候变化对高山和极地生态系统的长期影响。历史高分辨率航空影像提供了一个回顾性工具,以精确监测灌木生长和分布。目前,深度学习模型对于检测和分割具有明确形状的物体非常出色。然而,将这些模型适应检测具有复杂生长模式的天然物体,如 Juniperus,仍然具有挑战性。这项研究提出了一种新方法,该方法利用遥感红外(RGB)影像与Mask R-CNN基于实例分割模型的结合,在西班牙塞拉纳瓦(Sierra Nevada)山脉上部林线上单独描绘 Juniperus 灌木。在这项研究中,我们提出了一个新的数据构建设计,即利用照片解释(PI)和现场工作(FW)数据分别开发和验证模型。我们还提出了一个基于新指标Multiple Intersections over Ground Truth Area(MIoGTA)的新灌木评估算法,用于评估和优化模型灌木划分性能。最后,我们将开发的应用程序第一次用于生成 Juniperus 个体墙对墙的地图。实验结果表明,我们的双数据构建方法克服了传统调查方法的局限性。它们还突出了MIoGTA指标在评估具有复杂生长模式的物种上的实例分割模型的稳健性。此外,它们还展示了使用Mask R-CNN与ResNet101-C4骨干网络在描绘PI和FW灌木方面的有效性,F1分数分别为87.87%和76.86%。
https://arxiv.org/abs/2401.17985
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech by improving the intelligibility and naturalness. This is a challenging task especially for patients with severe dysarthria and speaking in complex, noisy acoustic environments. To address these challenges, we propose a novel multi-modal framework to utilize visual information, e.g., lip movements, in DSR as extra clues for reconstructing the highly abnormal pronunciations. The multi-modal framework consists of: (i) a multi-modal encoder to extract robust phoneme embeddings from dysarthric speech with auxiliary visual features; (ii) a variance adaptor to infer the normal phoneme duration and pitch contour from the extracted phoneme embeddings; (iii) a speaker encoder to encode the speaker's voice characteristics; and (iv) a mel-decoder to generate the reconstructed mel-spectrogram based on the extracted phoneme embeddings, prosodic features and speaker embeddings. Both objective and subjective evaluations conducted on the commonly used UASpeech corpus show that our proposed approach can achieve significant improvements over baseline systems in terms of speech intelligibility and naturalness, especially for the speakers with more severe symptoms. Compared with original dysarthric speech, the reconstructed speech achieves 42.1\% absolute word error rate reduction for patients with more severe dysarthria levels.
Dysarthric speech reconstruction (DSR) 的目标是通过提高可听性和自然性来将失调性言语转化为正常言语。这对那些患有严重失调症并在复杂、噪音干扰的听觉环境中说话的患者来说是一个具有挑战性的任务。为了应对这些挑战,我们提出了一个新颖的多模态框架,以利用视觉信息,例如嘴唇运动,作为附加线索来重构高度异常的语调。多模态框架包括:(i)一个多模态编码器,用于从失调性语音中提取有辅助视觉特征的 robust 音素嵌入;(ii)一个方差适应器,用于从提取的音素嵌入中推断正常音素持续时间和语调轮廓;(iii)一个说话者编码器,用于编码说话者的声音特征;(iv)一个 Mel-解码器,根据提取的音素嵌入生成重构的 Mel-频谱图。对于常用的 UASpeech 数据集进行客观和主观评估,结果显示我们提出的方法相对于基线系统在提高语音可听性和自然性方面具有显著的改进,特别是对于症状更加严重的说话者。与原始失调性语音相比,重构的语音在患有更严重失调症的患者上降低了42.1%的绝对单词错误率。
https://arxiv.org/abs/2401.17796
Diffusion-based text-to-image personalization have achieved great success in generating subjects specified by users among various contexts. Even though, existing finetuning-based methods still suffer from model overfitting, which greatly harms the generative diversity, especially when given subject images are few. To this end, we propose Pick-and-Draw, a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. Our approach consists of two components: appearance picking guidance and layout drawing guidance. As for the former, we construct an appearance palette with visual features from the reference image, where we pick local patterns for generating the specified subject with consistent identity. As for layout drawing, we outline the subject's contour by referring to a generative template from the vanilla diffusion model, and inherit the strong image prior to synthesize diverse contexts according to different text conditions. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image. Qualitative and quantitative experiments show that Pick-and-Draw consistently improves identity consistency and generative diversity, pushing the trade-off between subject fidelity and image-text fidelity to a new Pareto frontier.
基于扩散的文本-图像个性化取得了很大的成功,在生成用户指定各种上下文中的主题之间。尽管如此,现有的基于微调的方法仍然存在模型过拟合的问题,这极大地破坏了生成多样性,尤其是在主题图像很少的情况下。为此,我们提出了Pick-and-Draw,一种无需训练的语义指导方法,以提高个性化方法的身份一致性和生成多样性。我们的方法包括两个组件:外观选择指导和布局绘制指导。关于前者,我们通过构建参考图像的视觉特征来构建外观调色板,在那里我们选择局部模式来生成指定主题的一致身份。关于布局绘制,我们通过参考原版扩散模型的生成模板来绘制主题轮廓,并根据不同的文本条件继承强大的图像先验。所提出的方法可以应用于任何个性化的扩散模型,并且只需要一个参考图像。定性和定量的实验证明,Pick-and-Draw始终能够提高身份一致性和生成多样性,将主题一致性和图像文本一致性之间的权衡推向新的帕累托前沿。
https://arxiv.org/abs/2401.16762
Augmented reality for laparoscopic liver resection is a visualisation mode that allows a surgeon to localise tumours and vessels embedded within the liver by projecting them on top of a laparoscopic image. Preoperative 3D models extracted from CT or MRI data are registered to the intraoperative laparoscopic images during this process. In terms of 3D-2D fusion, most of the algorithms make use of anatomical landmarks to guide registration. These landmarks include the liver's inferior ridge, the falciform ligament, and the occluding contours. They are usually marked by hand in both the laparoscopic image and the 3D model, which is time-consuming and may contain errors if done by a non-experienced user. Therefore, there is a need to automate this process so that augmented reality can be used effectively in the operating room. We present the Preoperative-to-Intraoperative Laparoscopic Fusion Challenge (P2ILF), held during the Medical Imaging and Computer Assisted Interventions (MICCAI 2022) conference, which investigates the possibilities of detecting these landmarks automatically and using them in registration. The challenge was divided into two tasks: 1) A 2D and 3D landmark detection task and 2) a 3D-2D registration task. The teams were provided with training data consisting of 167 laparoscopic images and 9 preoperative 3D models from 9 patients, with the corresponding 2D and 3D landmark annotations. A total of 6 teams from 4 countries participated, whose proposed methods were evaluated on 16 images and two preoperative 3D models from two patients. All the teams proposed deep learning-based methods for the 2D and 3D landmark segmentation tasks and differentiable rendering-based methods for the registration task. Based on the experimental outcomes, we propose three key hypotheses that determine current limitations and future directions for research in this domain.
增强现实在腹腔镜肝切除手术中是一种可视化模式,允许外科医生通过将肿瘤和血管植入肝脏的图像投影在腹腔镜图像上,来定位肝脏内的肿瘤和血管。在这个过程中,通过CT或MRI数据提取的前期3D模型与腹腔镜活检图像进行配准。在3D-2D融合方面,大多数算法利用解剖标志来指导配准。这些标志包括肝的低位脊、肝钩状韧带和遮盖轮廓。它们通常在腹腔镜图像和3D模型上用手标出,这需要花费时间,并且如果由非经验丰富的用户完成,可能会包含错误。因此,需要自动化这个过程,以便增强现实在手术室中得到有效利用。我们在医学影像和计算机辅助干预(MICCAI 2022)会议期间举办了预术前到术中腹腔镜融合挑战(P2ILF)比赛,研究了自动检测这些标志并使用它们进行配准的可能性。挑战分为两个任务:1)2D和3D地标检测任务和2)3D-2D配准任务。来自4个国家的6支队伍参加了比赛,他们的方法根据16个腹腔镜图像和2个病人体前3D模型的2D和3D地标注释进行了评估。所有队伍都提出了基于深度学习的地标分割方法和基于可分离渲染的地标配准方法。根据实验结果,我们提出了三个关键假设,决定了这个领域当前的局限性和未来的研究方向。
https://arxiv.org/abs/2401.15753
The projected belief network (PBN) is a generative stochastic network with tractable likelihood function based on a feed-forward neural network (FFNN). The generative function operates by "backing up" through the FFNN. The PBN is two networks in one, a FFNN that operates in the forward direction, and a generative network that operates in the backward direction. Both networks co-exist based on the same parameter set, have their own cost functions, and can be separately or jointly trained. The PBN therefore has the potential to possess the best qualities of both discriminative and generative classifiers. To realize this potential, a separate PBN is trained on each class, maximizing the generative likelihood function for the given class, while minimizing the discriminative cost for the FFNN against "all other classes". This technique, called discriminative alignment (PBN-DA), aligns the contours of the likelihood function to the decision boundaries and attains vastly improved classification performance, rivaling that of state of the art discriminative networks. The method may be further improved using a hidden Markov model (HMM) as a component of the PBN, called PBN-DA-HMM. This paper provides a comprehensive treatment of PBN, PBN-DA, and PBN-DA-HMM. In addition, the results of two new classification experiments are provided. The first experiment uses air-acoustic events, and the second uses underwater acoustic data consisting of marine mammal calls. In both experiments, PBN-DA-HMM attains comparable or better performance as a state of the art CNN, and attain a factor of two error reduction when combined with the CNN.
预计信念网络(PBN)是基于前馈神经网络(FFNN)的生成随机网络,具有可导的概率函数。生成函数通过“反向传播”操作进行。PBN有两个网络:一个在前进方向上操作的FFNN和一个在反向方向上操作的生成网络。两个网络基于相同的参数集存在,具有自己的成本函数,可以单独或共同训练。因此,PBN具有同时具备良好区分度和生成类别的品质。为了实现这一潜力,对于每个类别,单独对PBN进行训练,最大化给定类别的生成概率函数,同时最小化FFNN与“所有其他类别”之间的判别成本。这种技术被称为判别对齐(PBN-DA),将概率函数的轮廓与决策边界对齐,获得了与最先进的判别网络相当的分类性能,甚至超过了其性能。通过将隐马尔可夫模型(HMM)作为PBN的一个组成部分,称为PBN-DA-HMM,该技术可以进一步改进。本文对PBN、PBN-DA和PBN-DA-HMM进行了全面的讨论。此外,还提供了两个新的分类实验的结果。第一个实验使用空气声事件,第二个实验使用海洋动物叫声数据。在两个实验中,PBN-DA-HMM的性能与最先进的CNN相当,并且与CNN结合时,可以将误差降低一半。
https://arxiv.org/abs/2401.11199