Ultrasound video-based breast lesion segmentation provides a valuable assistance in early breast lesion detection and treatment. However, existing works mainly focus on lesion segmentation based on ultrasound breast images which usually can not be adapted well to obtain desirable results on ultrasound videos. The main challenge for ultrasound video-based breast lesion segmentation is how to exploit the lesion cues of both intra-frame and inter-frame simultaneously. To address this problem, we propose a novel Spatial-Temporal Progressive Fusion Network (STPFNet) for video based breast lesion segmentation problem. The main aspects of the proposed STPFNet are threefold. First, we propose to adopt a unified network architecture to capture both spatial dependences within each ultrasound frame and temporal correlations between different frames together for ultrasound data representation. Second, we propose a new fusion module, termed Multi-Scale Feature Fusion (MSFF), to fuse spatial and temporal cues together for lesion detection. MSFF can help to determine the boundary contour of lesion region to overcome the issue of lesion boundary blurring. Third, we propose to exploit the segmentation result of previous frame as the prior knowledge to suppress the noisy background and learn more robust representation. In particular, we introduce a new publicly available ultrasound video breast lesion segmentation dataset, termed UVBLS200, which is specifically dedicated to breast lesion segmentation. It contains 200 videos, including 80 videos of benign lesions and 120 videos of malignant lesions. Experiments on the proposed dataset demonstrate that the proposed STPFNet achieves better breast lesion detection performance than state-of-the-art methods.
超声视频为基础的乳腺癌病变分割提供了一个有价值的帮助,在早期乳腺癌病变检测和治疗中。然而,现有的作品主要关注基于超声乳房图像的病变分割,这些方法通常不能很好地适应超声视频上的期望结果。超声视频为基础的乳腺癌病变分割的主要挑战是如何同时利用帧内和帧间的病变线索。为了解决这个问题,我们提出了一个名为STPFNet的超声视频乳腺癌病变分割网络。STPFNet的主要方面是三点。首先,我们提出了一种统一网络架构,以捕捉每个超声帧内的空间相关性以及不同帧之间的时序相关性,实现超声数据的表示。其次,我们提出了一种新的融合模块,称为多尺度特征融合(MSFF),用于将空间和时间线索一起融合用于病变检测。MSFF可以帮助确定病变区域的边界轮廓,克服病变边界模糊的问题。第三,我们利用前帧的分割结果作为先验知识来抑制噪声背景,并学习更鲁棒的表示。特别地,我们引入了一个新的公开可用的超声视频乳腺癌病变分割数据集,称为UVBLS200,该数据集专门用于乳腺癌病变分割。它包括200个视频,包括80个良性病变和120个恶性病变。对于提出的数据集的实验证明,与最先进的乳腺癌病变检测方法相比,STPFNet具有更好的性能。
https://arxiv.org/abs/2403.11699
Recently, circle representation has been introduced for medical imaging, designed specifically to enhance the detection of instance objects that are spherically shaped (e.g., cells, glomeruli, and nuclei). Given its outstanding effectiveness in instance detection, it is compelling to consider the application of circle representation for segmenting instance medical objects. In this study, we introduce CircleSnake, a simple end-to-end segmentation approach that utilizes circle contour deformation for segmenting ball-shaped medical objects at the instance level. The innovation of CircleSnake lies in these three areas: (1) It substitutes the complex bounding box-to-octagon contour transformation with a more consistent and rotation-invariant bounding circle-to-circle contour adaptation. This adaptation specifically targets ball-shaped medical objects. (2) The circle representation employed in CircleSnake significantly reduces the degrees of freedom to two, compared to eight in the octagon representation. This reduction enhances both the robustness of the segmentation performance and the rotational consistency of the method. (3) CircleSnake is the first end-to-end deep instance segmentation pipeline to incorporate circle representation, encompassing consistent circle detection, circle contour proposal, and circular convolution in a unified framework. This integration is achieved through the novel application of circular graph convolution within the context of circle detection and instance segmentation. In practical applications, such as the detection of glomeruli, nuclei, and eosinophils in pathological images, CircleSnake has demonstrated superior performance and greater rotation invariance when compared to benchmarks. The code has been made publicly available: this https URL.
近年来,在医学影像中引入了圆表示法,特别设计用于增强实例对象的检测(例如细胞、肾小球和核)。由于其在实例检测方面的出色效果,人们不禁考虑将圆表示法应用于实例分割。在这项研究中,我们引入了CircleSnake,一种简单的端到端实例分割方法,它利用圆轮廓变形在实例级别分割球形医疗物体。圆Snake的创新之处在于这三个方面:(1)它用更一致和旋转不变的圆轮廓到圆轮廓的适应取代了复杂的边界框到八边形轮廓的变换。这个适应特别针对球形医疗物体。(2)圆表示法在圆Snake中显著减少了自由度,与八边形表示法相比,减少了六个自由度。这种减少增强了分割性能的稳健性以及方法旋转一致性的提高。(3)圆Snake是第一个将圆表示法集成到端到端实例分割中的管道,涵盖了一致的圆检测、圆轮廓建议和环状卷积在一个统一的框架中。通过在圆检测和实例分割的背景下应用新颖的环形图卷积,实现了这一集成。在实际应用中,例如病理图像中检测肾小球、核和嗜碱性粒细胞,圆Snake已经表现出与基准测试相比的卓越性能和更高的旋转不变性。代码已公开发布:https:// this URL。
https://arxiv.org/abs/2403.11507
Accurately translating medical images across different modalities (e.g., CT to MRI) has numerous downstream clinical and machine learning applications. While several methods have been proposed to achieve this, they often prioritize perceptual quality with respect to output domain features over preserving anatomical fidelity. However, maintaining anatomy during translation is essential for many tasks, e.g., when leveraging masks from the input domain to develop a segmentation model with images translated to the output domain. To address these challenges, we propose ContourDiff, a novel framework that leverages domain-invariant anatomical contour representations of images. These representations are simple to extract from images, yet form precise spatial constraints on their anatomical content. We introduce a diffusion model that converts contour representations of images from arbitrary input domains into images in the output domain of interest. By applying the contour as a constraint at every diffusion sampling step, we ensure the preservation of anatomical content. We evaluate our method by training a segmentation model on images translated from CT to MRI with their original CT masks and testing its performance on real MRIs. Our method outperforms other unpaired image translation methods by a significant margin, furthermore without the need to access any input domain information during training.
准确地跨模态(例如,CT到MRI)翻译医学图像具有许多下游临床和机器学习应用。虽然为达到这一目标提出了几种方法,但它们通常在输出领域特征的感知质量方面优先于保留解剖准确性。然而,在翻译过程中保持解剖结构至关重要,对于许多任务来说,例如利用输入域的掩码开发具有输出域图像的分割模型,如此一来。为了应对这些挑战,我们提出了ContourDiff,一种利用图像域Invariant解剖轮廓表示的全新框架。这些表示方法从图像中提取起来很简便,同时对其解剖内容的空间约束非常明确。我们引入了一个扩散模型,将图像轮廓表示从任意输入域转换为感兴趣输出域中的图像。通过在扩散采样步骤中将轮廓作为约束,确保保留解剖内容。我们通过将轮廓作为一个约束在每一个扩散采样步骤上应用,来评估我们的方法。然后,我们将训练分割模型从CT到MRI的图像,并测试其在真实MRI上的性能。我们的方法在与其他未配对图像翻译方法的大幅优势相比,表现优异,而且不需要在训练过程中访问任何输入域信息。
https://arxiv.org/abs/2403.10786
Remote photoplethysmography (rPPG) technique extracts blood volume pulse (BVP) signals from subtle pixel changes in video frames. This study introduces rFaceNet, an advanced rPPG method that enhances the extraction of facial BVP signals with a focus on facial contours. rFaceNet integrates identity-specific facial contour information and eliminates redundant data. It efficiently extracts facial contours from temporally normalized frame inputs through a Temporal Compressor Unit (TCU) and steers the model focus to relevant facial regions by using the Cross-Task Feature Combiner (CTFC). Through elaborate training, the quality and interpretability of facial physiological signals extracted by rFaceNet are greatly improved compared to previous methods. Moreover, our novel approach demonstrates superior performance than SOTA methods in various heart rate estimation benchmarks.
远程光脉搏计(rPPG)技术从视频帧中微小的像素变化中提取血容量脉冲(BVP)信号。本研究引入了rFaceNet,一种专注于面部轮廓的先进rPPG方法,通过增强面部BVP信号的提取来提高。rFaceNet整合了与身份相关的面部轮廓信息,并消除了冗余数据。它通过Temporal Compressor Unit(TCU)高效地从时间归一化的帧输入中提取面部轮廓。通过详细的训练,rFaceNet提取面部生理信号的质量和对解释性的提高与以前的方法相比有了很大的改善。此外,我们的新方法在各种心率估计基准测试中的性能优于目前的最优方法。
https://arxiv.org/abs/2403.09034
Due to the influence of imaging equipment and complex imaging environments, most images in daily life have features of intensity inhomogeneity and noise. Therefore, many scholars have designed many image segmentation algorithms to address these issues. Among them, the active contour model is one of the most effective image segmentation algorithms.This paper proposes an active contour model driven by the hybrid signed pressure function that combines global and local information construction. Firstly, a new global region-based signed pressure function is introduced by combining the average intensity of the inner and outer regions of the curve with the median intensity of the inner region of the evolution curve. Then, the paper uses the energy differences between the inner and outer regions of the curve in the local region to design the signed pressure function of the local term. Combine the two SPF function to obtain a new signed pressure function and get the evolution equation of the new model. Finally, experiments and numerical analysis show that the model has excellent segmentation performance for both intensity inhomogeneous images and noisy images.
由于成像设备和复杂成像环境的影响,日常生活中的大多数图像具有强度异构性和噪声特征。因此,许多学者为解决这些问题设计了各种图像分割算法。在这些算法中,主动轮廓模型是最有效的图像分割算法之一。本文提出了一种基于混合正压力函数的主动轮廓模型,结合了全局和局部信息进行构建。首先,通过将曲线内外的平均强度与进化曲线内区域的均方强度相加引入了一种新的全局区域基于正压力函数。然后,在局部区域内利用曲线内外的能量差来设计局部项的符号压力函数。将两个SPF函数结合以获得新的符号压力函数并得到新模型的演化方程。最后,实验和数值分析结果表明,对于强度异构性和噪声图像,该模型具有出色的分割性能。
https://arxiv.org/abs/2403.07570
In clinical practice, medical image segmentation provides useful information on the contours and dimensions of target organs or tissues, facilitating improved diagnosis, analysis, and treatment. In the past few years, convolutional neural networks (CNNs) and Transformers have dominated this area, but they still suffer from either limited receptive fields or costly long-range modeling. Mamba, a State Space Sequence Model (SSM), recently emerged as a promising paradigm for long-range dependency modeling with linear complexity. In this paper, we introduce a Large Window-based Mamba U}-shape Network, or LMa-UNet, for 2D and 3D medical image segmentation. A distinguishing feature of our LMa-UNet is its utilization of large windows, excelling in locally spatial modeling compared to small kernel-based CNNs and small window-based Transformers, while maintaining superior efficiency in global modeling compared to self-attention with quadratic complexity. Additionally, we design a novel hierarchical and bidirectional Mamba block to further enhance the global and neighborhood spatial modeling capability of Mamba. Comprehensive experiments demonstrate the effectiveness and efficiency of our method and the feasibility of using large window size to achieve large receptive fields. Codes are available at this https URL.
在临床实践中,医学图像分割提供了关于目标器官或组织轮廓和尺寸的有用信息,从而促进了更准确的诊断、分析和治疗。在过去的几年里,卷积神经网络(CNNs)和Transformer主导了这一领域,但他们仍然受到 receptive field 有限或 long-range modeling 昂贵的影响。Mamba,一种状态空间序列模型(SSM),最近成为具有线性复杂度的远距离依赖建模的有前景的范式。在本文中,我们引入了一种基于大窗口的大型Mamba U-形状网络,或LMa-UNet,用于二维和三维医学图像分割。我们LMa-UNet的特点是其利用了大的窗口,在局部空间建模方面表现出色,相对于小内核CNNs和小窗口Transformer,在全局建模方面具有优势,而相对于自注意力机制的平方复杂度,全局建模的效率得到了进一步提高。此外,我们还设计了一个新颖的带有层次结构和双向的Mamba块,以进一步增强Mamba的全局和邻居空间建模能力。综合实验证明了我们方法的有效性和高效性,以及使用大窗口尺寸实现大接收视野的可行性。代码可在此处访问:https://url.cn/
https://arxiv.org/abs/2403.07332
In this paper, we propose a new framework for improving Content Based Image Retrieval (CBIR) for texture images. This is achieved by using a new image representation based on the RCT-Plus transform which is a novel variant of the Redundant Contourlet transform that extracts a richer directional information in the image. Moreover, the process of image search is improved through a learning-based approach where the images of the database are classified using an adapted similarity metric to the statistical modeling of the RCT-Plus transform. A query is then first classified to select the best texture class after which the retained class images are ranked to select top ones. By this, we have achieved significant improvements in the retrieval rates compared to previous CBIR schemes.
在本文中,我们提出了一个新的框架,用于提高基于内容的图像检索(CBIR)对于纹理图像。这是通过使用一种基于RCT-Plus变换的新图像表示来实现的,这是一种新颖的变体,可以从图像中提取更丰富的方向信息。此外,通过一种基于学习的图像搜索方法来改进图像检索过程。具体来说,我们将数据库中的图像分类为使用自适应的相似度度量对RCT-Plus变换的统计建模。然后,在选择最佳纹理类之前,保留的纹理类图像按排名排序以选择前几名。通过这种方法,我们已经在与以前CBIR方案的检索率上取得了显著的改进。
https://arxiv.org/abs/2403.06048
Acoustic-to-articulatory inversion (AAI) is to convert audio into articulator movements, such as ultrasound tongue imaging (UTI) data. An issue of existing AAI methods is only using the personalized acoustic information to derive the general patterns of tongue motions, and thus the quality of generated UTI data is limited. To address this issue, this paper proposes an audio-textual diffusion model for the UTI data generation task. In this model, the inherent acoustic characteristics of individuals related to the tongue motion details are encoded by using wav2vec 2.0, while the ASR transcriptions related to the universality of tongue motions are encoded by using BERT. UTI data are then generated by using a diffusion module. Experimental results showed that the proposed diffusion model could generate high-quality UTI data with clear tongue contour that is crucial for the linguistic analysis and clinical assessment. The project can be found on the website\footnote{this https URL
声学-到发音转换(AAI)是将音频转换成发音器官运动,如超声舌面成像(UTI)数据。现有AAI方法的局限性在于仅使用个性化音频信息来推导出总体舌动规律,因此生成的UTI数据的质量有限。为了解决这个问题,本文提出了一种用于UTI数据生成任务的音频-文本扩散模型。在这个模型中,通过使用wav2vec 2.0对与舌运动细节相关的个体化音频特征进行编码,而将普遍舌动规律相关的ASR转录使用BERT进行编码。然后通过扩散模块生成UTI数据。实验结果表明,与传统的扩散模型相比,所提出的扩散模型可以生成具有清晰舌轮廓的高质量UTI数据,这对语言分析和临床评估至关重要。该项目可以在以下链接找到:
https://arxiv.org/abs/2403.05820
ControlNet excels at creating content that closely matches precise contours in user-provided masks. However, when these masks contain noise, as a frequent occurrence with non-expert users, the output would include unwanted artifacts. This paper first highlights the crucial role of controlling the impact of these inexplicit masks with diverse deterioration levels through in-depth analysis. Subsequently, to enhance controllability with inexplicit masks, an advanced Shape-aware ControlNet consisting of a deterioration estimator and a shape-prior modulation block is devised. The deterioration estimator assesses the deterioration factor of the provided masks. Then this factor is utilized in the modulation block to adaptively modulate the model's contour-following ability, which helps it dismiss the noise part in the inexplicit masks. Extensive experiments prove its effectiveness in encouraging ControlNet to interpret inaccurate spatial conditions robustly rather than blindly following the given contours. We showcase application scenarios like modifying shape priors and composable shape-controllable generation. Codes are soon available.
控制网络在创建与用户提供的口罩准确轮廓紧密匹配的内容方面表现出色。然而,当这些口罩包含噪声时,作为非专业用户的常见现象,输出将包括不需要的伪影。本文首先强调了通过深入分析来控制这些隐含口罩对不同退化级别的印象的影响的重要性。接着,为了通过隐含口罩增强可控制性,设计了一个由退化估计器和形状先验模块组成的先进形状感知控制网络。退化估计器评估所提供的口罩的退化率。然后,这个因素被用于模块中,以适应地调整模型的轮廓跟踪能力,从而帮助它忽略隐含口罩中的噪声部分。大量实验证明,它可以有效地鼓励控制网络更准确地解释不准确的局部条件,而不是盲目地遵循给定的轮廓。我们展示了诸如修改形状先验和可组合形状控制生成等应用场景。代码不久将可用。
https://arxiv.org/abs/2403.00467
The development and progression of arthritis is strongly associated with osteophytes, which are small and elusive bone growths. This paper presents one of the first efforts towards automated spinal osteophyte detection in spinal X-rays. A novel automated patch extraction process, called SegPatch, has been proposed based on deep learning-driven vertebrae segmentation and the enlargement of mask contours. A final patch classification accuracy of 84.5\% is secured, surpassing a baseline tiling-based patch generation technique by 9.5%. This demonstrates that even with limited annotations, SegPatch can deliver superior performance for detection of tiny structures such as osteophytes. The proposed approach has potential to assist clinicians in expediting the process of manually identifying osteophytes in spinal X-ray.
类风湿性关节炎的发展和进展与骨刺密切相关,骨刺是细小且难以察觉的骨性生长。本文是自动脊椎骨刺检测在脊椎X光片上的首次尝试。提出了一种基于深度学习驱动的椎体分割和扩大掩码轮廓的自动补丁提取过程。获得了84.5%的最终补丁分类准确率,比基于镶嵌的补丁生成技术高9.5%。这表明,即使有限注释,SegPatch也可以在检测小结构(如骨刺)方面提供优越性能。所提出的方法具有潜在的协助临床医生加速手动识别骨刺在脊椎X光片上的过程。
https://arxiv.org/abs/2402.19263
Event cameras can record scene dynamics with high temporal resolution, providing rich scene details for monocular depth estimation (MDE) even at low-level illumination. Therefore, existing complementary learning approaches for MDE fuse intensity information from images and scene details from event data for better scene understanding. However, most methods directly fuse two modalities at pixel level, ignoring that the attractive complementarity mainly impacts high-level patterns that only occupy a few pixels. For example, event data is likely to complement contours of scene objects. In this paper, we discretize the scene into a set of high-level patterns to explore the complementarity and propose a Pattern-based Complementary learning architecture for monocular Depth estimation (PCDepth). Concretely, PCDepth comprises two primary components: a complementary visual representation learning module for discretizing the scene into high-level patterns and integrating complementary patterns across modalities and a refined depth estimator aimed at scene reconstruction and depth prediction while maintaining an efficiency-accuracy balance. Through pattern-based complementary learning, PCDepth fully exploits two modalities and achieves more accurate predictions than existing methods, especially in challenging nighttime scenarios. Extensive experiments on MVSEC and DSEC datasets verify the effectiveness and superiority of our PCDepth. Remarkably, compared with state-of-the-art, PCDepth achieves a 37.9% improvement in accuracy in MVSEC nighttime scenarios.
事件相机具有高时间分辨率,可以记录场景动态,为单目深度估计(MDE)提供丰富的场景细节,即使在低光水平照明下也是如此。因此,现有的MDE互补学习方法将图像和事件数据中来自图像的强度信息和场景细节进行融合,以更好地理解场景。然而,大多数方法直接在像素级别融合两个模式,忽略了互补性主要影响仅占几个像素的高层次模式。例如,事件数据很可能与场景对象的轮廓互补。在本文中,我们将场景划分为一系列高级模式,以探索互补性和提出基于模式的互补学习架构用于单目深度估计(PCDepth)。具体来说,PCDepth包括两个主要组件:将场景划分为高级模式的互补视觉表示学习模块和跨模态融合互补模式以及用于场景建模和深度预测的精确深度估计器。通过基于模式的互补学习,PCDepth完全利用两个模式,并实现了比现有方法更准确预测的结果,尤其是在具有挑战性的夜间场景中。在MVSEC和DSEC数据集上的实验证实了我们的PCDepth的有效性和优越性。值得注意的是,与最先进的 methods相比,PCDepth在MVSEC夜间场景中实现了37.9%的准确度提高。
https://arxiv.org/abs/2402.18925
The F1TENTH autonomous racing platform, consisting of 1:10 scale RC cars, has evolved into a leading research platform. The many publications and real-world competitions span many domains, from classical path planning to novel learning-based algorithms. Consequently, the field is wide and disjointed, hindering direct comparison of methods and making it difficult to assess the state-of-the-art. Therefore, we aim to unify the field by surveying current approaches, describing common methods and providing benchmark results to facilitate clear comparison and establish a baseline for future work. We survey current work in F1TENTH racing in the classical and learning categories, explaining the different solution approaches. We describe particle filter localisation, trajectory optimisation and tracking, model predictive contouring control (MPCC), follow-the-gap and end-to-end reinforcement learning. We provide an open-source evaluation of benchmark methods and investigate overlooked factors of control frequency and localisation accuracy for classical methods and reward signal and training map for learning methods. The evaluation shows that the optimisation and tracking method achieves the fastest lap times, followed by the MPCC planner. Finally, our work identifies and outlines the relevant research aspects to help motivate future work in the F1TENTH domain.
F1TENTH自主赛车平台,由1:10比例的遥控车组成,已经发展成为领先的研究平台。众多出版物和现实比赛跨越多个领域,从经典路径规划到基于学习的全新学习算法。因此,领域很广,支离破碎,这阻碍了方法之间的直接比较,使得评估最先进的技术状态很困难。因此,我们旨在通过调查现有方法,描述常用方法并提供基准结果,使领域统一,为明确比较和建立未来工作的基准奠定基础。我们调查了F1TENTH赛车在古典和学习类别中的当前工作,解释了不同的解决方案方法。我们描述了粒子滤波器定位、轨迹优化和跟踪、模型预测控制(MPCC)、跟随间隔和端到端强化学习。我们提供了对基准方法的开放源代码评估,并研究了古典方法和学习方法中未被忽视的控制频率和定位准确性因素。评估显示,优化和跟踪方法取得了最快的 lap times,接着是 MPCC 规划器。最后,我们的工作识别和概述了与F1TENTH领域相关的研究领域,以激励未来工作的进一步发展。
https://arxiv.org/abs/2402.18558
Natural image matting aims to estimate the alpha matte of the foreground from a given image. Various approaches have been explored to address this problem, such as interactive matting methods that use guidance such as click or trimap, and automatic matting methods tailored to specific objects. However, existing matting methods are designed for specific objects or guidance, neglecting the common requirement of aggregating global and local contexts in image matting. As a result, these methods often encounter challenges in accurately identifying the foreground and generating precise boundaries, which limits their effectiveness in unforeseen scenarios. In this paper, we propose a simple and universal matting framework, named Dual-Context Aggregation Matting (DCAM), which enables robust image matting with arbitrary guidance or without guidance. Specifically, DCAM first adopts a semantic backbone network to extract low-level features and context features from the input image and guidance. Then, we introduce a dual-context aggregation network that incorporates global object aggregators and local appearance aggregators to iteratively refine the extracted context features. By performing both global contour segmentation and local boundary refinement, DCAM exhibits robustness to diverse types of guidance and objects. Finally, we adopt a matting decoder network to fuse the low-level features and the refined context features for alpha matte estimation. Experimental results on five matting datasets demonstrate that the proposed DCAM outperforms state-of-the-art matting methods in both automatic matting and interactive matting tasks, which highlights the strong universality and high performance of DCAM. The source code is available at \url{this https URL}.
自然图像合成旨在从给定的图像中估计前景的alpha遮罩。为了解决这个问题,已经探索了许多方法,例如使用点击或剪裁映像的交互式遮罩方法和针对特定对象的自动遮罩方法。然而,现有的遮罩方法都是为特定物体或指导设计的,忽视了图像遮罩中全局和局部上下文整合的常见要求。因此,这些方法通常会在准确识别前景和生成精确边界方面遇到困难,从而限制其在未知场景中的有效性。在本文中,我们提出了一个简单而通用的遮罩框架,名为双上下文聚合遮罩(DCAM),它具有任意指导或无指导的鲁棒图像合成能力。具体来说,DCAM首先采用语义骨架网络从输入图像和指导中提取低级特征和上下文特征。然后,我们引入了一种双上下文聚合网络,它包括全局物体聚合器和局部外观聚合器,用于迭代优化提取的上下文特征。通过执行全局轮廓分割和局部边界修复,DCAM在各种类型的指导和物体上表现出鲁棒性。最后,我们采用遮罩解码器网络将低级特征和修复后的上下文特征融合进行alpha遮罩估计。在五个遮罩数据集上的实验结果表明,与最先进的遮罩方法相比,DCAM在自动遮罩和交互式遮罩任务上都表现出卓越的性能,这突出了DCAM的宽泛性和高性能。源代码可在此处访问:\url{this <https://this URL>.
https://arxiv.org/abs/2402.18109
Personalization techniques for large text-to-image (T2I) models allow users to incorporate new concepts from reference images. However, existing methods primarily rely on textual descriptions, leading to limited control over customized images and failing to support fine-grained and local editing (e.g., shape, pose, and details). In this paper, we identify sketches as an intuitive and versatile representation that can facilitate such control, e.g., contour lines capturing shape information and flow lines representing texture. This motivates us to explore a novel task of sketch concept extraction: given one or more sketch-image pairs, we aim to extract a special sketch concept that bridges the correspondence between the images and sketches, thus enabling sketch-based image synthesis and editing at a fine-grained level. To accomplish this, we introduce CustomSketching, a two-stage framework for extracting novel sketch concepts. Considering that an object can often be depicted by a contour for general shapes and additional strokes for internal details, we introduce a dual-sketch representation to reduce the inherent ambiguity in sketch depiction. We employ a shape loss and a regularization loss to balance fidelity and editability during optimization. Through extensive experiments, a user study, and several applications, we show our method is effective and superior to the adapted baselines.
大型文本-图像(T2I)模型的个性化技术允许用户将参考图像的新概念进行整合。然而,现有的方法主要依赖于文本描述,导致用户对定制图像的控制有限,并且无法支持细粒度和局部编辑(例如形状、姿态和细节)。在本文中,我们将草图作为一种直观且多功能的表示,以促进这种控制,例如轮廓线捕捉形状信息,流动线表示纹理。这激励我们探索一个新任务:基于草图的概念提取:给定一组草图图像对,我们的目标是提取一个特殊的草图概念,该概念将图像和草图之间的对应关系联系起来,从而实现基于草图的图像合成和编辑。为了实现这一目标,我们引入了CustomSketching,一个两阶段框架,用于提取新的草图概念。考虑到物体通常用轮廓线来表示一般形状,以及内部细节的额外笔画,我们引入了双草图表示来减少草图描绘中的固有歧义。我们采用形状损失和正则化损失来平衡还原度和可编辑性。通过广泛的实验、用户研究和多个应用,我们证明了我们的方法是有效的,并且优于适应基线。
https://arxiv.org/abs/2402.17624
The roto-translation group SE2 has been of active interest in image analysis due to methods that lift the image data to multi-orientation representations defined on this Lie group. This has led to impactful applications of crossing-preserving flows for image de-noising, geodesic tracking, and roto-translation equivariant deep learning. In this paper, we develop a computational framework for optimal transportation over Lie groups, with a special focus on SE2. We make several theoretical contributions (generalizable to matrix Lie groups) such as the non-optimality of group actions as transport maps, invariance and equivariance of optimal transport, and the quality of the entropic-regularized optimal transport plan using geodesic distance approximations. We develop a Sinkhorn like algorithm that can be efficiently implemented using fast and accurate distance approximations of the Lie group and GPU-friendly group convolutions. We report valuable advancements in the experiments on 1) image barycenters, 2) interpolation of planar orientation fields, and 3) Wasserstein gradient flows on SE2. We observe that our framework of lifting images to SE2 and optimal transport with left-invariant anisotropic metrics leads to equivariant transport along dominant contours and salient line structures in the image. This yields sharper and more meaningful interpolations compared to their counterparts on $\mathbb{R}^2$
由于将图像数据提升到在Lie群上定义的多旋度表示的方法,SE2的旋转平移组对图像分析产生了积极的影响。这导致了一系列交叉保持的流在图像去噪、测地跟踪和罗投影等深度学习任务中的应用。在本文中,我们开发了一个计算框架,重点关注SE2。我们在论文中做出了几个理论贡献(可推广到矩阵Lie组)。例如,群行动的非优化性作为传输映射,等价性和不变性,以及使用测地距离逼近的geodesic距离对最优传输规划的熵 regularized 质量。我们开发了一个Sinkhorn类似算法,可以高效地使用对Lie群的快速且准确的距离逼近和GPU友好的群卷积实现。我们在SE2上的实验中取得了重要的进展,包括1)图像聚类,2)平滑规划方向场的中值估计,3)Wasserstein梯度流。我们观察到,将图像提升到SE2和用左不变的各向同性度量进行最优传输的方法导致沿着主导轮廓和突出线结构进行等价传输。这使得与在$\mathbb{R}^2$上的类似实现相比,具有更锐利的插值和更深刻的含义。
https://arxiv.org/abs/2402.15322
The integration of medical imaging, computational analysis, and robotic technology has brought about a significant transformation in minimally invasive surgical procedures, particularly in the realm of laparoscopic rectal surgery (LRS). This specialized surgical technique, aimed at addressing rectal cancer, requires an in-depth comprehension of the spatial dynamics within the narrow space of the pelvis. Leveraging Magnetic Resonance Imaging (MRI) scans as a foundational dataset, this study incorporates them into Computer-Aided Design (CAD) software to generate precise three-dimensional (3D) reconstructions of the patient's anatomy. At the core of this research is the analysis of the surgical workspace, a critical aspect in the optimization of robotic interventions. Sophisticated computational algorithms process MRI data within the CAD environment, meticulously calculating the dimensions and contours of the pelvic internal regions. The outcome is a nuanced understanding of both viable and restricted zones during LRS, taking into account factors such as curvature, diameter variations, and potential obstacles. This paper delves deeply into the complexities of workspace analysis for robotic LRS, illustrating the seamless collaboration between medical imaging, CAD software, and surgical robotics. Through this interdisciplinary approach, the study aims to surpass traditional surgical methodologies, offering novel insights for a paradigm shift in optimizing robotic interventions within the complex environment of the pelvis.
医疗影像、计算分析和机器人技术的集成已经在微创手术中产生了显著的变革,特别是在直肠癌的腹腔镜右半结肠手术(LRS)领域。这种专门针对直肠癌的手术技术,旨在解决盆底问题,需要对盆腔空间内的空间动力学有深入的理解。将磁共振成像(MRI)扫描作为基础数据,本研究将它们纳入计算机辅助设计(CAD)软件中,生成精确的三维(3D)人体解剖结构的重建。 这项研究的核心是对手术工作空间的分析,这是在优化机器人干预方面进行优化的重要方面。复杂的计算算法在CAD环境中处理MRI数据,仔细计算盆底内部区域的尺寸和轮廓。结果是对LRS中可行和受限区域的微妙理解,考虑诸如弯曲、直径变化和潜在障碍等因素。本文深入研究了机器人LRS工作空间分析的复杂性,展示了医疗影像、CAD软件和机器人手术之间的无缝合作。通过这种跨学科的方法,该研究旨在超越传统的手术方法,为盆底手术在复杂的环境中的优化提供新颖的见解,为盆底手术方法的发展做出贡献。
https://arxiv.org/abs/2402.14386
Recent diffusion-based generative models show promise in their ability to generate text images, but limitations in specifying the styles of the generated texts render them insufficient in the realm of typographic design. This paper proposes a typographic text generation system to add and modify text on typographic designs while specifying font styles, colors, and text effects. The proposed system is a novel combination of two off-the-shelf methods for diffusion models, ControlNet and Blended Latent Diffusion. The former functions to generate text images under the guidance of edge conditions specifying stroke contours. The latter blends latent noise in Latent Diffusion Models (LDM) to add typographic text naturally onto an existing background. We first show that given appropriate text edges, ControlNet can generate texts in specified fonts while incorporating effects described by prompts. We further introduce text edge manipulation as an intuitive and customizable way to produce texts with complex effects such as ``shadows'' and ``reflections''. Finally, with the proposed system, we successfully add and modify texts on a predefined background while preserving its overall coherence.
最近基于扩散的生成模型在生成文本图像方面显示出潜力,但指定生成文本的风格在排版设计领域存在局限性。本文提出了一种可以在排版设计中添加和修改文本的字体样式、颜色和文本效果的字体生成系统。所提出的系统是将扩散模型的两个经典方法——ControlNet和Blended Latent Diffusion相结合的全新组合。前一个方法根据指定边缘条件生成文本图像,控制轮廓;后一个方法将潜在噪音在Latent Diffusion Models(LDM)中混合,以自然地将字体文本添加到现有背景上。我们首先证明了,只要给定适当的文本边缘,ControlNet可以在指定的字体中生成指定的文本,同时包含由提示描述的效果。我们进一步引入了文本边缘操作作为直观且可定制的生成具有复杂效果(如“阴影”和“反射”)的文本的方法。最后,在所提出的系统中,我们在预定义的背景上成功添加和修改文本,同时保留其整体连贯性。
https://arxiv.org/abs/2402.14314
Lipreading involves using visual data to recognize spoken words by analyzing the movements of the lips and surrounding area. It is a hot research topic with many potential applications, such as human-machine interaction and enhancing audio speech recognition. Recent deep-learning based works aim to integrate visual features extracted from the mouth region with landmark points on the lip contours. However, employing a simple combination method such as concatenation may not be the most effective approach to get the optimal feature vector. To address this challenge, firstly, we propose a cross-attention fusion-based approach for large lexicon Arabic vocabulary to predict spoken words in videos. Our method leverages the power of cross-attention networks to efficiently integrate visual and geometric features computed on the mouth region. Secondly, we introduce the first large-scale Lip Reading in the Wild for Arabic (LRW-AR) dataset containing 20,000 videos for 100-word classes, uttered by 36 speakers. The experimental results obtained on LRW-AR and ArabicVisual databases showed the effectiveness and robustness of the proposed approach in recognizing Arabic words. Our work provides insights into the feasibility and effectiveness of applying lipreading techniques to the Arabic language, opening doors for further research in this field. Link to the project page: this https URL
lip-reading 是一种使用视觉数据来识别口语单词的方法,通过分析嘴唇及其周围区域的运动。这是一个热门的研究课题,具有许多潜在应用,如人机交互和增强音频语音识别。最近基于深度学习的工作试图将嘴部区域提取的视觉特征与嘴轮廓上的地标点相结合。然而,采用简单的组合方法(如连接)可能不是获得最佳特征向量的最有效方法。为了应对这个挑战,我们首先提出了一个基于跨注意力的方法,用于预测视频中的口语单词。我们的方法利用了跨注意力的网络的力量,有效地将嘴部区域计算的视觉和几何特征集成在一起。其次,我们引入了第一个大型的阿拉伯语(LRW-AR)数据集中包含的阿拉伯语(LRW-AR)数据集,其中包括100个词语 class 的20,000个视频,由36个不同的说话者朗读。在LRW-AR和阿拉伯视觉数据库上进行实验得出的结果表明,所提出的方案在识别阿拉伯单词方面具有有效性和稳健性。我们的工作揭示了将 lip-reading 技术应用于阿拉伯语的潜力和效果,为这个领域进一步的研究打开了大门。相关项目页面:此链接
https://arxiv.org/abs/2402.11520
Lung mask creation lacks well-defined criteria and standardized guidelines, leading to a high degree of subjectivity between annotators. In this study, we assess the underestimation of lung regions on chest X-ray segmentation masks created according to the current state-of-the-art method, by comparison with total lung volume evaluated on computed tomography (CT). We show, that lung X-ray masks created by following the contours of the heart, mediastinum, and diaphragm significantly underestimate lung regions and exclude substantial portions of the lungs from further assessment, which may result in numerous clinical errors.
创制肺口罩缺乏明确的标准和标准化准则,导致注释者之间的主观程度较高。在这项研究中,我们评估了根据最先进的方法创建的胸部X光分割图上肺区域的低估程度,并与通过计算机断层扫描(CT)评估的总体肺容积进行比较。我们发现,遵循心脏、 mediastinum 和 diaphragm轮廓的肺 X光口罩显著低估肺区域,排除了大量肺组织的进一步评估,这可能导致许多临床错误。
https://arxiv.org/abs/2402.11510
This article introduces Lester, a novel method to automatically synthetise retro-style 2D animations from videos. The method approaches the challenge mainly as an object segmentation and tracking problem. Video frames are processed with the Segment Anything Model (SAM) and the resulting masks are tracked through subsequent frames with DeAOT, a method of hierarchical propagation for semi-supervised video object segmentation. The geometry of the masks' contours is simplified with the Douglas-Peucker algorithm. Finally, facial traits, pixelation and a basic shadow effect can be optionally added. The results show that the method exhibits an excellent temporal consistency and can correctly process videos with different poses and appearances, dynamic shots, partial shots and diverse backgrounds. The proposed method provides a more simple and deterministic approach than diffusion models based video-to-video translation pipelines, which suffer from temporal consistency problems and do not cope well with pixelated and schematic outputs. The method is also much most practical than techniques based on 3D human pose estimation, which require custom handcrafted 3D models and are very limited with respect to the type of scenes they can process.
这篇文章介绍了一种新的方法,称为Lester,可以从视频中自动合成怀旧风格的2D动画。该方法主要作为物体分割和跟踪问题。视频帧使用Segment Anything Model(SAM)处理,然后通过DeAOT,一种半监督视频物体分割方法,处理后续帧。使用Douglas-Peucker算法简化mask轮廓的几何形状。最后,还可以选择添加面部特征、像素化和基本阴影效果。结果表明,该方法具有出色的时间一致性,能够正确处理不同姿态和外观的视频,包括动态镜头、部分镜头和多样化的背景。与基于视频到视频翻译管道的扩散模型相比,所提出的方法更简单和确定性。这种方法比基于3D人体姿态估计的技术更实用,不需要手工制作3D模型,而且对于它们可以处理的场景类型也非常有限。
https://arxiv.org/abs/2402.09883