Facial Aesthetics Enhancement (FAE) aims to improve facial attractiveness by adjusting the structure and appearance of a facial image while preserving its identity as much as possible. Most existing methods adopted deep feature-based or score-based guidance for generation models to conduct FAE. Although these methods achieved promising results, they potentially produced excessively beautified results with lower identity consistency or insufficiently improved facial attractiveness. To enhance facial aesthetics with less loss of identity, we propose the Nearest Neighbor Structure Guidance based on Diffusion (NNSG-Diffusion), a diffusion-based FAE method that beautifies a 2D facial image with 3D structure guidance. Specifically, we propose to extract FAE guidance from a nearest neighbor reference face. To allow for less change of facial structures in the FAE process, a 3D face model is recovered by referring to both the matched 2D reference face and the 2D input face, so that the depth and contour guidance can be extracted from the 3D face model. Then the depth and contour clues can provide effective guidance to Stable Diffusion with ControlNet for FAE. Extensive experiments demonstrate that our method is superior to previous relevant methods in enhancing facial aesthetics while preserving facial identity.
面部美学增强(FAE)的目标是通过调整面部图像的结构和外观来提高面部吸引力,同时尽可能地保留其身份特征。现有的大多数方法采用基于深度特征或评分的引导方式,用于生成模型进行FAE处理。尽管这些方法取得了令人满意的结果,但它们可能会产生过度美化的效果,导致身份一致性降低,或者对脸部美感的改善不足。为了在较少损失面部身份的情况下增强面部美学,我们提出了基于扩散的最近邻结构指导(NNSG-Diffusion)方法,这是一种利用3D结构引导来美化2D面部图像的扩散基础FAE方法。 具体而言,我们提出从最近的邻居参考脸中提取FAE指南。为了在FAE过程中减少面部结构的变化,通过参考匹配的2D参考脸和输入的2D人脸恢复出一个3D脸部模型,这样可以从该3D脸部模型中提取深度和轮廓指导信息。然后,这些深度和轮廓线索可以为Stable Diffusion结合ControlNet提供有效的指导,以进行FAE处理。 大量的实验表明,我们的方法在增强面部美学的同时保留面部身份方面优于以前的相关方法。
https://arxiv.org/abs/2503.14402
Salient Object Detection (SOD) is crucial in computer vision, yet RGB-based methods face limitations in challenging scenes, such as small objects and similar color features. Hyperspectral images provide a promising solution for more accurate Hyperspectral Salient Object Detection (HSOD) by abundant spectral information, while HSOD methods are hindered by the lack of extensive and available datasets. In this context, we introduce HSOD-BIT-V2, the largest and most challenging HSOD benchmark dataset to date. Five distinct challenges focusing on small objects and foreground-background similarity are designed to emphasize spectral advantages and real-world complexity. To tackle these challenges, we propose Hyper-HRNet, a high-resolution HSOD network. Hyper-HRNet effectively extracts, integrates, and preserves effective spectral information while reducing dimensionality by capturing the self-similar spectral features. Additionally, it conveys fine details and precisely locates object contours by incorporating comprehensive global information and detailed object saliency representations. Experimental analysis demonstrates that Hyper-HRNet outperforms existing models, especially in challenging scenarios.
目标显著性检测(SOD)在计算机视觉领域至关重要,然而基于RGB的方法在处理小物体和颜色特征相似的复杂场景时存在局限。高光谱图像通过提供丰富的光谱信息为更准确的目标显著性检测(HSOD)提供了有前景的解决方案,但HSOD方法的发展受到可用数据集稀缺的影响。在此背景下,我们引入了HSOD-BIT-V2,这是迄今为止规模最大、挑战性最强的HSOD基准数据集。该数据集中设计了五个不同的挑战,重点关注小物体和前景背景相似度问题,以突出光谱信息的优势及现实世界的复杂性。为应对这些挑战,我们提出了高分辨率HSOD网络Hyper-HRNet。Hyper-HRNet能够有效提取、整合并保留有效的光谱信息,并通过捕获自我相似的光谱特征来降低维度。此外,该网络还通过融合全面的全局信息和详细的物体显著性表示来传达精细细节,并精确定位物体轮廓。实验分析表明,在具有挑战性的场景下,Hyper-HRNet优于现有的模型表现。
https://arxiv.org/abs/2503.13906
Recently, deep learning-based 3D face reconstruction methods have demonstrated promising advancements in terms of quality and efficiency. Nevertheless, these techniques face challenges in effectively handling occluded scenes and fail to capture intricate geometric facial details. Inspired by the principles of GANs and bump mapping, we have successfully addressed these issues. Our approach aims to deliver comprehensive 3D facial reconstructions, even in the presence of this http URL maintaining the overall shape's robustness, we introduce a mid-level shape refinement to the fundamental structure. Furthermore, we illustrate how our method adeptly extends to generate plausible details for obscured facial regions. We offer numerous examples that showcase the effectiveness of our framework in producing realistic results, where traditional methods often struggle. To substantiate the superior adaptability of our approach, we have conducted extensive experiments in the context of general 3D face reconstruction tasks, serving as concrete evidence of its regulatory prowess compared to manual occlusion removal methods.
最近,基于深度学习的三维人脸重建方法在质量和效率方面取得了显著进展。然而,这些技术在处理被遮挡场景时仍然面临挑战,并且无法捕捉到复杂的几何面部细节。受GAN(生成对抗网络)和凹凸贴图原理的启发,我们成功地解决了这些问题。我们的方法旨在提供全面的三维面部重建,在存在遮挡的情况下也能保持整体形状的稳健性,并在此基础上引入了中层形状细化技术来改进基础结构。此外,我们展示了如何利用该方法为被遮盖的面部区域生成合理的细节。通过大量实例,我们证明了框架在产生逼真结果方面的有效性,而传统方法通常难以达到这一效果。 为了证实我们的方法具有更优越的适应性,我们在一般的三维人脸重建任务中进行了广泛的实验,并将其实验结果作为与手动遮挡去除方法相比,其调控能力更为卓越的具体证据。
https://arxiv.org/abs/2503.12494
Rotation-invariant recognition of shapes is a common challenge in computer vision. Recent approaches have significantly improved the accuracy of rotation-invariant recognition by encoding the rotational invariance of shapes as hand-crafted image features and introducing deep neural networks. However, the methods based on pixels have too much redundant information, and the critical geometric information is prone to early leakage, resulting in weak rotation-invariant recognition of fine-grained shapes. In this paper, we reconsider the shape recognition problem from the perspective of contour points rather than pixels. We propose an anti-noise rotation-invariant convolution module based on contour geometric aware for fine-grained shape recognition. The module divides the shape contour into multiple local geometric regions(LGA), where we implement finer-grained rotation-invariant coding in terms of point topological relations. We provide a deep network composed of five such cascaded modules for classification and retrieval experiments. The results show that our method exhibits excellent performance in rotation-invariant recognition of fine-grained shapes. In addition, we demonstrate that our method is robust to contour noise and the rotation centers. The source code is available at this https URL.
旋转不变形状识别是计算机视觉中的一个常见挑战。近期的方法通过编码手工设计的图像特征来改进旋转不变性,这些特征描述了形状的旋转不变性,并引入了深度神经网络以提高准确性。然而,基于像素的方法包含了大量的冗余信息,关键几何信息容易过早丢失,这导致对细粒度形状进行弱旋转不变性识别的问题。 在本文中,我们重新考虑从轮廓点而非像素的角度来解决形状识别问题。为此,我们提出了一种抗噪的、基于轮廓几何感知的旋转不变卷积模块(Rotation-invariant Convolution Module with Contour Geometric Awareness, RC-MCGA),以用于细粒度形状识别。该模块将形状轮廓划分为多个局部几何区域(LGA),并在点拓扑关系的基础上实现了更细化的旋转不变性编码。我们提供了一个由五个这样的级联模块组成的深度网络,用于分类和检索实验。实验结果表明,我们的方法在细粒度形状的旋转不变性识别中表现出色。 此外,我们还证明了该方法对轮廓噪声和旋转中心具有鲁棒性,并且源代码可在以下链接获得:[提供具体URL]。
https://arxiv.org/abs/2503.10992
Flow-based transformer models for image generation have achieved state-of-the-art performance with larger model parameters, but their inference deployment cost remains high. To enhance inference performance while maintaining generation quality, we propose progressive rectified flow transformers. We divide the rectified flow into different stages according to resolution, using fewer transformer layers at the low-resolution stages to generate image layouts and concept contours, and progressively adding more layers as the resolution increases. Experiments demonstrate that our approach achieves fast convergence and reduces inference time while ensuring generation quality. The main contributions of this paper are summarized as follows: (1) We introduce progressive rectified flow transformers that enable multi-resolution training, accelerating model convergence; (2) NAMI leverages piecewise flow and spatial cascading of Diffusion Transformer (DiT) to rapidly generate images, reducing inference time by 40% to generate a 1024 resolution image; (3) We propose NAMI-1K benchmark to evaluate human preference performance, aiming to mitigate distributional bias and prevent data leakage from open-source benchmarks. The results show that our model is competitive with state-of-the-art models.
基于流的变压器模型在图像生成方面已经取得了最先进的性能,但随着模型参数的增大,其推理部署成本仍然很高。为了在保持生成质量的同时提高推理性能,我们提出了渐进式修正流变压器。我们将修正流根据分辨率划分为不同的阶段,在低分辨率阶段使用较少的变压器层来生成图像布局和概念轮廓,并逐步增加更多的层次以适应分辨率的增长。实验表明,我们的方法实现了快速收敛并减少了推理时间,同时确保了生成质量。 本文的主要贡献总结如下: 1. 我们介绍了渐进式修正流变压器,该模型支持多分辨率训练,加速了模型的收敛过程。 2. NAMI利用分段流和Diffusion Transformer (DiT)的空间级联来快速生成图像,在不降低生成质量的前提下将推理时间减少了40%,从而可以在1024分辨率下更快地生成图像。 3. 我们提出了NAMI-1K基准测试,以评估人类偏好的性能,并旨在减轻分布偏差并防止开源基准中的数据泄漏。结果表明,我们的模型在与现有最先进的模型竞争时具有竞争力。 通过这些方法和技术的结合使用,我们不仅提高了图像生成的速度和效率,还确保了模型能够更好地服务于各种应用场景的需求。
https://arxiv.org/abs/2503.09242
Previous studies on echocardiogram segmentation are focused on the left ventricle in parasternal long-axis views. In this study, deep-learning models were evaluated on the segmentation of the ventricles in parasternal short-axis echocardiograms (PSAX-echo). Segmentation of the ventricles in complementary echocardiogram views will allow the computation of important metrics with the potential to aid in diagnosing cardio-pulmonary diseases and other cardiomyopathies. Evaluating state-of-the-art models with small datasets can reveal if they improve performance on limited data. PSAX-echo were performed on 33 volunteer women. An experienced cardiologist identified end-diastole and end-systole frames from 387 scans, and expert observers manually traced the contours of the cardiac structures. Traced frames were pre-processed and used to create labels to train 2 specific-domain (Unet-Resnet101 and Unet-ResNet50), and 4 general-domain (3 Segment Anything (SAM) variants, and the Detectron2) deep-learning models. The performance of the models was evaluated using the Dice similarity coefficient (DSC), Hausdorff distance (HD), and difference in cross-sectional area (DCSA). The Unet-Resnet101 model provided superior performance in the segmentation of the ventricles with 0.83, 4.93 pixels, and 106 pixel2 on average for DSC, HD, and DCSA respectively. A fine-tuned MedSAM model provided a performance of 0.82, 6.66 pixels, and 1252 pixel2, while the Detectron2 model provided 0.78, 2.12 pixels, and 116 pixel2 for the same metrics respectively. Deep-learning models are suitable for the segmentation of the left and right ventricles in PSAX-echo. This study demonstrated that specific-domain trained models such as Unet-ResNet provide higher accuracy for echo segmentation than general-domain segmentation models when working with small and locally acquired datasets.
先前关于心超分割的研究主要集中在胸骨旁长轴视图中的左心室。本研究评估了深度学习模型在胸骨旁短轴心超(PSAX-echo)中对心室分割的效果。通过不同视角的心超图像进行心脏结构的分割,有助于计算重要指标,并可能用于诊断心血管和肺部疾病及其他心肌病。使用小型数据集来评价最先进的模型性能可以揭示它们是否能够在有限的数据下提高表现。 本研究在33名志愿者女性中进行了PSAX-echo检查,一名有经验的心脏病专家从387个扫描图像中确定了舒张末期和收缩末期的帧,并由专业观察员手动描记心脏结构轮廓。预处理后的描记图用于训练两种特定领域(Unet-Resnet101 和 Unet-ResNet50)以及四种通用领域的深度学习模型(三种Segment Anything (SAM) 变体和Detectron2)。使用Dice相似系数(DSC),Hausdorff距离(HD) 和面积差异(DCSA)来评估这些模型的性能。 Unet-Resnet101 模型在心室分割中表现出优越性,其平均DSC、HD和DCSA分别为0.83、4.93像素和106平方像素。微调后的MedSAM模型分别以0.82的DSC、6.66像素的HD以及1252平方像素的DCSA表现,而Detectron2模型则为0.78的DSC、2.12像素的HD和116平方像素的DCSA。 研究表明深度学习模型适合用于PSAX-echo中左右心室的分割。本研究证明,在处理小规模且本地采集的数据集时,特定领域训练的Unet-ResNet等模型比通用领域的分割模型具有更高的准确度。
https://arxiv.org/abs/2503.08970
Low-level texture feature/knowledge is also of vital importance for characterizing the local structural pattern and global statistical properties, such as boundary, smoothness, regularity, and color contrast, which may not be well addressed by high-level deep features. In this paper, we aim to re-emphasize the low-level texture information in deep networks for semantic segmentation and related knowledge distillation tasks. To this end, we take full advantage of both structural and statistical texture knowledge and propose a novel Structural and Statistical Texture Knowledge Distillation (SSTKD) framework for semantic segmentation. Specifically, Contourlet Decomposition Module (CDM) is introduced to decompose the low-level features with iterative Laplacian pyramid and directional filter bank to mine the structural texture knowledge, and Texture Intensity Equalization Module (TIEM) is designed to extract and enhance the statistical texture knowledge with the corresponding Quantization Congruence Loss (QDL). Moreover, we propose the Co-occurrence TIEM (C-TIEM) and generic segmentation frameworks, namely STLNet++ and U-SSNet, to enable existing segmentation networks to harvest the structural and statistical texture information more effectively. Extensive experimental results on three segmentation tasks demonstrate the effectiveness of the proposed methods and their state-of-the-art performance on seven popular benchmark datasets, respectively.
低级纹理特征/知识对于描述局部结构模式和全局统计特性(如边界、平滑度、规则性和颜色对比)至关重要,而这些特性可能无法通过高级深度特征很好地处理。在本文中,我们旨在重新强调在语义分割及相关知识蒸馏任务中的低级纹理信息的重要性。为此,我们充分利用了结构化和统计化的纹理知识,并提出了一种新颖的结构性与统计性纹理知识蒸馏(SSTKD)框架用于语义分割。 具体来说,引入了轮廓分解模块(CDM),通过迭代拉普拉斯金字塔和方向滤波器组来分解低级特征,以挖掘结构化纹理知识;设计了纹理强度均衡模块(TIEM)并通过相应的量化一致性损失(QDL)提取并增强统计性纹理知识。此外,我们还提出了共现TIEM (C-TIEM) 和通用分割框架(即STLNet++和U-SSNet),使现有的分割网络能够更有效地获取结构化与统计性的纹理信息。 在三个分割任务上进行的大量实验结果验证了所提出方法的有效性及其在七个流行基准数据集上的先进性能。
https://arxiv.org/abs/2503.08043
Infrared small target detection is currently a hot and challenging task in computer vision. Existing methods usually focus on mining visual features of targets, which struggles to cope with complex and diverse detection scenarios. The main reason is that infrared small targets have limited image information on their own, thus relying only on visual features fails to discriminate targets and interferences, leading to lower detection performance. To address this issue, we introduce a novel approach leveraging semantic text to guide infrared small target detection, called Text-IRSTD. It innovatively expands classical IRSTD to text-guided IRSTD, providing a new research idea. On the one hand, we devise a novel fuzzy semantic text prompt to accommodate ambiguous target categories. On the other hand, we propose a progressive cross-modal semantic interaction decoder (PCSID) to facilitate information fusion between texts and images. In addition, we construct a new benchmark consisting of 2,755 infrared images of different scenarios with fuzzy semantic textual annotations, called FZDT. Extensive experimental results demonstrate that our method achieves better detection performance and target contour recovery than the state-of-the-art methods. Moreover, proposed Text-IRSTD shows strong generalization and wide application prospects in unseen detection scenarios. The dataset and code will be publicly released after acceptance of this paper.
红外小目标检测目前是计算机视觉领域的一个热点和具有挑战性的任务。现有方法通常侧重于挖掘目标的视觉特征,这在面对复杂多样的检测场景时显得力不从心。主要原因在于红外小目标自身图像信息有限,仅仅依赖视觉特征难以区分目标与干扰物,导致检测性能较低。为解决这一问题,我们提出了一种利用语义文本引导红外小目标检测的新方法,称为Text-IRSTD(文本辅助的红外小目标检测)。该方法创新性地将传统的IRSTD扩展到文本指导下的IRSTD,提供了一种新的研究思路。 一方面,我们设计了一种新颖的模糊语义文本提示来适应模棱两可的目标类别。另一方面,我们提出了一个逐步跨模态语义交互解码器(PCSID),以促进文本和图像之间的信息融合。此外,我们构建了一个包含2,755张不同场景下红外图像的新基准数据集,并使用模糊语义文本注释标记为FZDT。 广泛的实验结果表明,我们的方法在检测性能和目标轮廓恢复方面优于现有最先进的方法。另外,所提出的Text-IRSTD表现出较强的一般化能力和广泛的应用前景,在未见过的检测场景中亦有卓越表现。该数据集和代码将在论文接受后公开发布。
https://arxiv.org/abs/2503.07249
Lesion synthesis methods have made significant progress in generating large-scale synthetic datasets. However, existing approaches predominantly focus on texture synthesis and often fail to accurately model masks for anatomically complex lesions. Additionally, these methods typically lack precise control over the synthesis process. For example, perirectal lymph nodes, which range in diameter from 1 mm to 10 mm, exhibit irregular and intricate contours that are challenging for current techniques to replicate faithfully. To address these limitations, we introduce CAFusion, a novel approach for synthesizing perirectal lymph nodes. By leveraging Signed Distance Functions (SDF), CAFusion generates highly realistic 3D anatomical structures. Furthermore, it offers flexible control over both anatomical and textural features by decoupling the generation of morphological attributes (such as shape, size, and position) from textural characteristics, including signal intensity. Experimental results demonstrate that our synthetic data substantially improve segmentation performance, achieving a 6.45% increase in the Dice coefficient. In the visual Turing test, experienced radiologists found it challenging to distinguish between synthetic and real lesions, highlighting the high degree of realism and anatomical accuracy achieved by our approach. These findings validate the effectiveness of our method in generating high-quality synthetic lesions for advancing medical image processing applications.
病灶合成方法在生成大规模合成数据集方面取得了显著进展。然而,现有方法主要集中在纹理合成上,并且通常无法准确建模解剖结构复杂的病灶掩码。此外,这些方法往往缺乏对合成过程的精确控制能力。例如,直肠旁淋巴结直径范围从1毫米到10毫米不等,其轮廓不规则且复杂,现有的技术很难忠实复制这种特征。为了解决这些问题,我们引入了一种名为CAFusion的新方法来合成直肠旁淋巴结。通过利用符号距离函数(SDF),CAFusion可以生成高度逼真的三维解剖结构。此外,它还提供了对解剖和纹理特性进行灵活控制的能力,将形态属性的生成(如形状、大小和位置)与包括信号强度在内的纹理特征相分离。实验结果显示,我们合成的数据显著提高了分割性能,在Dice系数上实现了6.45%的增长。在视觉图灵测试中,经验丰富的放射科医生难以区分合成病灶和真实病灶,这突显了我们的方法所达到的现实性和解剖准确性水平。这些发现验证了我们在生成高质量合成病变方面的方法的有效性,以推进医学图像处理应用的发展。
https://arxiv.org/abs/2503.06919
This paper proposes Separability Membrane, a robust 3D active contour for extracting a surface from 3D point cloud object. Our approach defines the surface of a 3D object as the boundary that maximizes the separability of point features, such as intensity, color, or local density, between its inner and outer regions based on Fisher's ratio. Separability Membrane identifies the exact surface of a 3D object by maximizing class separability while controlling the rigidity of the 3D surface model with an adaptive B-spline surface that adjusts its properties based on the local and global separability. A key advantage of our method is its ability to accurately reconstruct surface boundaries even when they are ambiguous due to noise or outliers, without requiring any training data or conversion to volumetric representation. Evaluations on a synthetic 3D point cloud dataset and the 3DNet dataset demonstrate the membrane's effectiveness and robustness under diverse conditions.
本文提出了一种名为“Separability Membrane”的鲁棒三维活动轮廓方法,用于从三维点云对象中提取表面。我们的方法将三维物体的表面定义为一个边界,该边界最大化了内侧和外侧区域之间基于费希尔比率(Fisher's ratio)计算出的点特征(如强度、颜色或局部密度)的可分离性。Separability Membrane 通过最大限度地提高类别间的可分性和利用自适应B样条曲面控制三维表面模型的刚度来精确识别三维物体的实际表面,该B样条曲面可以根据局部和全局可分离性的不同调整其属性。 我们的方法的一个关键优势在于它能够在噪声或异常值导致边界模糊的情况下准确重建表面边界,并且不需要任何训练数据或转换为体积表示。在合成三维点云数据集以及3DNet数据集上的评估证明了该膜的有效性和鲁棒性,即使是在各种不同的条件下也是如此。
https://arxiv.org/abs/2503.05217
The rice grain quality can be determined from its size and chalkiness. The traditional approach to measure the rice grain size involves manual inspection, which is inefficient and leads to inconsistent results. To address this issue, an image processing based approach is proposed and developed in this research. The approach takes image of rice grains as input and outputs the number of rice grains and size of each rice grain. The different steps, such as extraction of region of interest, segmentation of rice grains, and sub-contours removal, involved in the proposed approach are discussed. The approach was tested on rice grain images captured from different height using mobile phone camera. The obtained results show that the proposed approach successfully detected 95\% of the rice grains and achieved 90\% accuracy for length and width measurement.
稻谷的质量可以通过其大小和垩白度来确定。传统上,测量稻谷尺寸的方法是手动检查,这种方法效率低下且结果不一致。为了解决这一问题,在本次研究中提出并开发了一种基于图像处理的方法。该方法以稻谷的图片作为输入,并输出稻谷的数量以及每粒稻谷的大小。文中讨论了所提出的方案涉及的不同步骤,包括感兴趣区域的提取、稻谷的分割和次轮廓的移除等。 该方法在使用手机摄像头从不同高度拍摄的稻谷图像上进行了测试。获得的结果表明,所提出的方法成功检测出了95%的稻谷,并且在长度和宽度测量方面达到了90%的准确率。
https://arxiv.org/abs/2503.03214
In this paper, we propose a novel approach for air drawing that uses image processing techniques to draw on the screen by moving fingers in the air. This approach benefits a wide range of applications such as sign language, in-air drawing, and 'writing' in the air as a new way of input. The approach starts with preparing ROI (Region of Interest) background images by taking a running average in initial camera frames and later subtracting it from the live camera frames to get a binary mask image. We calculate the pointer's position as the top of the contour on the binary image. When drawing a circle on the canvas in that position, it simulates the drawing. Furthermore, we combine the pre-trained Tesseract model for OCR purposes. To address the false contours, we perform hand detection based on the haar cascade before performing the background subtraction. In an experimental setup, we achieved a latency of only 100ms in air drawing. The code used to this research are available in GitHub as this https URL
在这篇论文中,我们提出了一种新颖的空中绘图方法,该方法利用图像处理技术,在空中移动手指即可在屏幕上作画。这种方法适用于广泛的应用场景,如手语、空中绘画和作为输入新方式的“空中书写”。我们的方法首先通过初始摄像帧中的运行平均值准备ROI(感兴趣区域)背景图像,并随后从实时摄像帧中减去该背景以获得二进制掩模图像。我们计算指针的位置为二值图上轮廓的顶部。当在该位置画布上的相应位置绘制圆形时,它模拟了绘画的过程。此外,我们将预训练的Tesseract模型结合用于OCR(光学字符识别)目的。为了应对误判轮廓的问题,在执行背景减法之前进行了基于Haar级联的手部检测。在一个实验设置中,我们的空中绘图技术实现了仅100毫秒的延迟。 本研究使用的代码可在GitHub上的以下链接获取:[请在此处插入实际URL]
https://arxiv.org/abs/2503.01497
Using Quadrics as the object representation has the benefits of both generality and closed-form projection derivation between image and world spaces. Although numerous constraints have been proposed for dual quadric reconstruction, we found that many of them are imprecise and provide minimal improvements to this http URL scrutinizing the existing constraints, we introduce a concise yet more precise convex hull-based algebraic constraint for object landmarks, which is applied to object reconstruction, frontend pose estimation, and backend bundle this http URL constraint is designed to fully leverage precise semantic segmentation, effectively mitigating mismatches between complex-shaped object contours and dual this http URL on public datasets demonstrate that our approach is applicable to both monocular and RGB-D SLAM and achieves improved object mapping and localization than existing quadric SLAM methods. The implementation of our method is available at this https URL.
使用四元体(Quadrics)作为物体表示方法,兼具通用性和从图像空间到世界空间的闭式投影推导的优势。尽管已经提出了许多用于双曲面重建的约束条件,但我们发现其中很多并不精确,并且对现有方法改进甚微。通过仔细审查现有的约束条件,我们引入了一种简洁但更准确的基于凸包代数约束的方法来处理物体特征点的问题,这种方法适用于物体重建、前端姿态估计和后端捆绑调整。该约束条件旨在充分利用精确的语义分割信息,有效缓解复杂形状物体轮廓与双曲面之间的不匹配问题。在公共数据集上的实验表明,我们的方法既适用于单目视觉定位(SLAM)也适用于RGB-D SLAM,并且在物体映射和定位方面比现有的四元体SLAM方法表现更佳。我们方法的实现代码可以在以下网址获取:[此链接应为实际提供的GitHub或其他公开代码库地址,在这里以this https URL表示]。
https://arxiv.org/abs/2503.01254
3D face reconstruction from a single sketch is a critical yet underexplored task with significant practical applications. The primary challenges stem from the substantial modality gap between 2D sketches and 3D facial structures, including: (1) accurately extracting facial keypoints from 2D sketches; (2) preserving diverse facial expressions and fine-grained texture details; and (3) training a high-performing model with limited data. In this paper, we propose Sketch-1-to-3, a novel framework for realistic 3D face reconstruction from a single sketch, to address these challenges. Specifically, we first introduce the Geometric Contour and Texture Detail (GCTD) module, which enhances the extraction of geometric contours and texture details from facial sketches. Additionally, we design a deep learning architecture with a domain adaptation module and a tailored loss function to align sketches with the 3D facial space, enabling high-fidelity expression and texture reconstruction. To facilitate evaluation and further research, we construct SketchFaces, a real hand-drawn facial sketch dataset, and Syn-SketchFaces, a synthetic facial sketch dataset. Extensive experiments demonstrate that Sketch-1-to-3 achieves state-of-the-art performance in sketch-based 3D face reconstruction.
从单幅素描进行三维人脸重建是一项具有重要实际应用价值但研究较少的任务。该任务的主要挑战在于二维素描与三维面部结构之间的显著模态差异,具体包括:(1)准确提取二维素描中的面部关键点;(2)保留多样化的面部表情和精细的纹理细节;以及(3)在数据有限的情况下训练高性能模型。为此,在本文中我们提出了Sketch-1-to-3框架,这是一种全新的从单幅素描生成逼真的三维人脸重建的方法,旨在解决上述挑战。 具体而言,我们首先引入了几何轮廓与纹理细节(GCTD)模块,该模块能够增强从面部素描中提取几何轮廓和纹理细节的能力。此外,为了使素描与其对应的3D面部空间对齐,我们设计了一个深度学习架构,其中包含领域适应模块及定制的损失函数,从而实现高保真度的表情与纹理重建。 为促进评估以及进一步的研究工作,我们构建了SketchFaces数据集,该数据集中包含了真实的素描人脸,并且还建立了一个合成的人脸素描数据集Syn-SketchFaces。通过广泛的实验表明,Sketch-1-to-3在基于素描的三维人脸识别任务上达到了最先进的性能水平。
https://arxiv.org/abs/2502.17852
Background and Purpose: Functional assessment of the left ventricle using gated myocardial perfusion (MPS) single-photon emission computed tomography relies on the precise extraction of the left ventricular contours while simultaneously ensuring the security of patient data. Methods: In this paper, we introduce the integration of Federated Domain Adaptation with TimeSformer, named 'FedDA-TSformer' for left ventricle segmentation using MPS. FedDA-TSformer captures spatial and temporal features in gated MPS images, leveraging spatial attention, temporal attention, and federated learning for improved domain adaptation while ensuring patient data security. In detail, we employed Divide-Space-Time-Attention mechanism to extract spatio-temporal correlations from the multi-centered MPS datasets, ensuring that predictions are spatio-temporally consistent. To achieve domain adaptation, we align the model output on MPS from three different centers using local maximum mean discrepancy (LMMD) loss. This approach effectively addresses the dual requirements of federated learning and domain adaptation, enhancing the model's performance during training with multi-site datasets while ensuring the protection of data from different hospitals. Results: Our FedDA-TSformer was trained and evaluated using MPS datasets collected from three hospitals, comprising a total of 150 subjects. Each subject's cardiac cycle was divided into eight gates. The model achieved Dice Similarity Coefficients (DSC) of 0.842 and 0.907 for left ventricular (LV) endocardium and epicardium segmentation, respectively. Conclusion: Our proposed FedDA-TSformer model addresses the challenge of multi-center generalization, ensures patient data privacy protection, and demonstrates effectiveness in left ventricular (LV) segmentation.
背景和目的:使用门控心肌灌注(MPS)单光子发射计算机断层扫描评估左心室功能,需要精确提取左心室轮廓并同时确保患者数据的安全性。 方法:本文介绍了将联邦领域适应与TimeSformer结合的方法,命名为“FedDA-TSformer”,用于基于MPS的左心室分割。 FedDA-TSformer能够捕捉门控MPS图像中的空间和时间特征,并利用空间注意力、时间注意力及联邦学习来提高域适应能力,同时确保患者数据的安全性。具体来说,我们采用了Divide-Space-Time-Attention机制从多中心的MPS数据集中提取时空关联,以确保预测在时间和空间上的一致性。为了实现领域适应,我们在三个不同中心的MPS图像上对模型输出进行局部最大平均差异(LMMD)损失对齐,从而有效地解决了联邦学习和领域适应的双重需求,在训练期间使用多站点数据集时提高了模型性能,并确保了来自不同医院的数据安全。 结果:我们的FedDA-TSformer使用从三个医院收集的MPS数据集进行了训练和评估,总共有150个受试者。每个受试者的整个心动周期被分为八个门控区间。该模型在左心室(LV)内膜和外膜分割上分别达到了Dice相似系数(DSC)为0.842 和 0.907。 结论:我们提出的FedDA-TSformer模型解决了多中心泛化挑战,确保了患者数据隐私保护,并且在左心室(LV)分割方面表现出有效性。
https://arxiv.org/abs/2502.16709
This paper presents a motion-coupled mapping algorithm for contour mapping of hybrid rice canopies, specifically designed for Agricultural Unmanned Ground Vehicles (Agri-UGV) navigating complex and unknown rice fields. Precise canopy mapping is essential for Agri-UGVs to plan efficient routes and avoid protected zones. The motion control of Agri-UGVs, tasked with impurity removal and other operations, depends heavily on accurate estimation of rice canopy height and structure. To achieve this, the proposed algorithm integrates real-time RGB-D sensor data with kinematic and inertial measurements, enabling efficient mapping and proprioceptive localization. The algorithm produces grid-based elevation maps that reflect the probabilistic distribution of canopy contours, accounting for motion-induced uncertainties. It is implemented on a high-clearance Agri-UGV platform and tested in various environments, including both controlled and dynamic rice field settings. This approach significantly enhances the mapping accuracy and operational reliability of Agri-UGVs, contributing to more efficient autonomous agricultural operations.
本文提出了一种针对杂交水稻冠层轮廓绘制的运动耦合映射算法,专门设计用于在复杂且未知的稻田中导航的农业无人地面车辆(Agri-UGV)。精确的冠层地图对于Agri-UGVs规划高效路线和避开保护区至关重要。负责清除杂质和其他操作的Agri-UGVs的运动控制严重依赖于准确估计水稻冠层高度和结构的能力。为了实现这一目标,所提出的算法将实时RGB-D传感器数据与运动学和惯性测量相结合,从而能够进行高效的映射并实现自我定位。该算法生成基于网格的高度地图,反映了冠层轮廓的概率分布,并考虑了由运动引起的不确定性。此算法已在高通过性的Agri-UGV平台上实施,并在各种环境(包括受控和动态稻田设置)中进行了测试。这种方法显著提高了Agri-UGVs的映射精度和操作可靠性,有助于实现更高效的自主农业作业。
https://arxiv.org/abs/2502.16134
Despite recent advances in medical image generation, existing methods struggle to produce anatomically plausible 3D structures. In synthetic brain magnetic resonance images (MRIs), characteristic fissures are often missing, and reconstructed cortical surfaces appear scattered rather than densely convoluted. To address this issue, we introduce Cor2Vox, the first diffusion model-based method that translates continuous cortical shape priors to synthetic brain MRIs. To achieve this, we leverage a Brownian bridge process which allows for direct structured mapping between shape contours and medical images. Specifically, we adapt the concept of the Brownian bridge diffusion model to 3D and extend it to embrace various complementary shape representations. Our experiments demonstrate significant improvements in the geometric accuracy of reconstructed structures compared to previous voxel-based approaches. Moreover, Cor2Vox excels in image quality and diversity, yielding high variation in non-target structures like the skull. Finally, we highlight the capability of our approach to simulate cortical atrophy at the sub-voxel level. Our code is available at this https URL.
尽管在医学图像生成领域取得了近期进展,现有方法仍难以产生解剖学上合理的3D结构。在合成大脑磁共振成像(MRI)中,特征性裂隙常常缺失,重建的皮层表面显得分散而不是密集曲折。为解决这一问题,我们引入了Cor2Vox——首个基于扩散模型的方法,将连续的皮层形状先验转化为合成的大脑MRIs。为此,我们利用布朗桥过程,该过程允许在形状轮廓和医学图像之间进行直接结构化映射。具体而言,我们将布朗桥扩散模型的概念扩展到3D,并进一步拓展以适应各种互补的形状表示方式。 实验结果显示,与以前基于体素的方法相比,我们的方法在重建结构的几何准确性方面取得了显著改进。此外,Cor2Vox在图像质量和多样性上表现出色,在如颅骨等非目标结构中产生了高变异性。最后,我们强调了该方法模拟亚体素级皮层萎缩的能力。 我们的代码可在[此处](https://this-is-the-url-of-the-code-repository)获取。
https://arxiv.org/abs/2502.12742
Echocardiography plays a fundamental role in the extraction of important clinical parameters (e.g. left ventricular volume and ejection fraction) required to determine the presence and severity of heart-related conditions. When deploying automated techniques for computing these parameters, uncertainty estimation is crucial for assessing their utility. Since clinical parameters are usually derived from segmentation maps, there is no clear path for converting pixel-wise uncertainty values into uncertainty estimates in the downstream clinical metric calculation. In this work, we propose a novel uncertainty estimation method based on contouring rather than segmentation. Our method explicitly predicts contour location uncertainty from which contour samples can be drawn. Finally, the sampled contours can be used to propagate uncertainty to clinical metrics. Our proposed method not only provides accurate uncertainty estimations for the task of contouring but also for the downstream clinical metrics on two cardiac ultrasound datasets. Code is available at: this https URL.
超声心动图在提取重要临床参数(例如左心室容积和射血分数)方面起着核心作用,这些参数用于确定心脏相关疾病的性质和严重程度。当部署自动技术来计算这些参数时,不确定性估计对于评估其效用至关重要。由于临床参数通常由分割地图导出,因此没有明确的方法将像素级别的不确定性值转换为下游临床度量计算中的不确定性估算。 在本工作中,我们提出了一种基于轮廓而不是分割的新型不确定性估计算法。我们的方法能够直接预测轮廓位置的不确定性,并从中抽取轮廓样本。最终,可以利用这些抽样轮廓来传播到下游临床指标中的不确定性。所提出的这种方法不仅为轮廓任务提供了准确的不确定性估计,而且在两个心脏超声数据集上的下游临床度量上也表现出色。 代码可在此网址获取:[此链接](this https URL)。
https://arxiv.org/abs/2502.12713
This paper presents the challenges agricultural robotic harvesters face in detecting and localising fruits under various environmental disturbances. In controlled laboratory settings, both the traditional HSV (Hue Saturation Value) transformation and the YOLOv8 (You Only Look Once) deep learning model were employed. However, only YOLOv8 was utilised in outdoor experiments, as the HSV transformation was not capable of accurately drawing fruit contours. Experiments include ten distinct fruit patterns with six apples and six oranges. A grid structure for homography (perspective) transformation was employed to convert detected midpoints into 3D world coordinates. The experiments evaluated detection and localisation under varying lighting and background disturbances, revealing accurate performance indoors, but significant challenges outdoors. Our results show that indoor experiments using YOLOv8 achieved 100% detection accuracy, while outdoor conditions decreased performance, with an average accuracy of 69.15% for YOLOv8 under direct sunlight. The study demonstrates that real-world applications reveal significant limitations due to changing lighting, background disturbances, and colour and shape variability. These findings underscore the need for further refinement of algorithms and sensors to enhance the robustness of robotic harvesters for agricultural use.
本文介绍了农业机器人收割机在各种环境干扰下检测和定位水果时所面临的挑战。在受控实验室环境中,既使用了传统的HSV(色相饱和度值)转换方法,也采用了YOLOv8(你只需看一次)深度学习模型。然而,在户外实验中仅使用了YOLOv8,因为HSV转换方法无法准确描绘出水果的轮廓。实验包括十种不同的水果图案,其中包含六只苹果和六只橙子。为了将检测到的中心点转化为三维世界坐标,采用了网格结构进行同态(透视)变换。这些实验评估了在不同光照和背景干扰条件下水果的检测与定位情况,结果表明,在室内环境下表现准确,但在户外环境下面临重大挑战。我们的结果显示,在使用YOLOv8的情况下,室内实验实现了100%的检测准确性,而室外条件下的性能有所下降,平均而言,在直接阳光下时YOLOv8的表现为69.15%的精度。这项研究展示了真实世界的应用揭示了由于光照变化、背景干扰和颜色与形状变异所导致的重要限制。这些发现强调了为了增强农业用途中机器人收割机的鲁棒性,进一步完善算法和传感器的需求。
https://arxiv.org/abs/2502.12403
Augmented reality assembly guidance is essential for intelligent manufacturing and medical applications, requiring continuous measurement of the 6DoF poses of manipulated objects. Although current tracking methods have made significant advancements in accuracy and efficiency, they still face challenges in robustness when dealing with cluttered backgrounds, rotationally symmetric objects, and noisy sequences. In this paper, we first propose a robust contour-based pose tracking method that addresses error-prone contour correspondences and improves noise tolerance. It utilizes a fan-shaped search strategy to refine correspondences and models local contour shape and noise uncertainty as mixed probability distribution, resulting in a highly robust contour energy function. Secondly, we introduce a CPU-only strategy to better track rotationally symmetric objects and assist the contour-based method in overcoming local minima by exploring sparse interior correspondences. This is achieved by pre-sampling interior points from sparse viewpoint templates offline and using the DIS optical flow algorithm to compute their correspondences during tracking. Finally, we formulate a unified energy function to fuse contour and interior information, which is solvable using a re-weighted least squares algorithm. Experiments on public datasets and real scenarios demonstrate that our method significantly outperforms state-of-the-art monocular tracking methods and can achieve more than 100 FPS using only a CPU.
增强现实装配指导在智能制造和医疗应用中至关重要,需要连续测量操作对象的6自由度(6DoF)姿态。尽管当前的跟踪方法已经在准确性和效率方面取得了显著进展,但在处理杂乱背景、旋转对称物体以及噪声序列时仍面临鲁棒性挑战。 本文首先提出了一种基于轮廓的稳健姿态跟踪方法,该方法解决了容易出错的轮廓对应关系,并提高了抗噪能力。它采用扇形搜索策略来细化对应关系,并将局部轮廓形状和噪声不确定性建模为混合概率分布,从而形成一个高度健壮的轮廓能量函数。 其次,我们引入了一种仅使用CPU的战略,以更好地跟踪旋转对称物体,并通过探索稀疏内部对应关系帮助基于轮廓的方法克服局部极小值问题。这通过在离线阶段从稀疏视点模板中预采样内部点并在追踪过程中利用DIS光流算法计算它们的对应关系来实现。 最后,我们提出了一个统一的能量函数来融合轮廓和内部信息,并使用重新加权最小二乘法求解该能量函数。 公开数据集和真实场景中的实验表明,与最先进的单目跟踪方法相比,我们的方法表现出显著的优势,并且仅使用CPU即可达到每秒100帧以上的性能。
https://arxiv.org/abs/2502.11971