State-of-the-art Style Transfer methods often leverage pre-trained encoders optimized for discriminative tasks, which may not be ideal for image synthesis. This can result in significant artifacts and loss of photorealism. Motivated by the ability of multiscale geometric image representations to capture fine-grained details and global structure, we propose GIST: Geometric-based Image Style Transfer, a novel Style Transfer technique that exploits the geometric properties of content and style images. GIST replaces the standard Neural Style Transfer autoencoding framework with a multiscale image expansion, preserving scene details without the need for post-processing or training. Our method matches multiresolution and multidirectional representations such as Wavelets and Contourlets by solving an optimal transport problem, leading to an efficient texture transferring. Experiments show that GIST is on-par or outperforms recent photorealistic Style Transfer approaches while significantly reducing the processing time with no model training.
https://arxiv.org/abs/2412.02214
Prohibited item detection is crucial for ensuring public safety, yet current X-ray image-based detection methods often lack comprehensive data-driven exploration. This paper introduces a novel data augmentation approach tailored for prohibited item detection, leveraging unique characteristics inherent to X-ray imagery. Our method is motivated by observations of physical properties including: 1) X-ray Transmission Imagery: Unlike reflected light images, transmitted X-ray pixels represent composite information from multiple materials along the imaging path. 2) Material-based Pseudo-coloring: Pseudo-color rendering in X-ray images correlates directly with material properties, aiding in material distinction. Building on a novel perspective from physical properties, we propose a simple yet effective X-ray image augmentation technique, Background Mixup (BGM), for prohibited item detection in security screening contexts. The essence is the rich background simulation of X-ray images to induce the model to increase its attention to the foreground. The approach introduces 1) contour information of baggage and 2) variation of material information into the original image by Mixup at patch level. Background Mixup is plug-and-play, parameter-free, highly generalizable and provides an effective solution to the limitations of classical visual augmentations in non-reflected light imagery. When implemented with different high-performance detectors, our augmentation method consistently boosts performance across diverse X-ray datasets from various devices and environments. Extensive experimental results demonstrate that our approach surpasses strong baselines while maintaining similar training resources.
https://arxiv.org/abs/2412.00460
While 3D Gaussian Splatting enables high-quality real-time rendering, existing Gaussian-based frameworks for 3D semantic segmentation still face significant challenges in boundary recognition accuracy. To address this, we propose a novel 3DGS-based framework named GradiSeg, incorporating Identity Encoding to construct a deeper semantic understanding of scenes. Our approach introduces two key modules: Identity Gradient Guided Densification (IGD) and Local Adaptive K-Nearest Neighbors (LA-KNN). The IGD module supervises gradients of Identity Encoding to refine Gaussian distributions along object boundaries, aligning them closely with boundary contours. Meanwhile, the LA-KNN module employs position gradients to adaptively establish locality-aware propagation of Identity Encodings, preventing irregular Gaussian spreads near boundaries. We validate the effectiveness of our method through comprehensive experiments. Results show that GradiSeg effectively addresses boundary-related issues, significantly improving segmentation accuracy without compromising scene reconstruction quality. Furthermore, our method's robust segmentation capability and decoupled Identity Encoding representation make it highly suitable for various downstream scene editing tasks, including 3D object removal, swapping and so on.
https://arxiv.org/abs/2412.00392
Deep learning-based automated contouring and treatment planning has been proven to improve the efficiency and accuracy of radiotherapy. However, conventional radiotherapy treatment planning process has the automated contouring and treatment planning as separate tasks. Moreover in deep learning (DL), the contouring and dose prediction tasks for automated treatment planning are done independently. In this study, we applied the multi-task learning (MTL) approach in order to seamlessly integrate automated contouring and voxel-based dose prediction tasks, as MTL can leverage common information between the two tasks and be able able to increase the efficiency of the automated tasks. We developed our MTL framework using the two datasets: in-house prostate cancer dataset and the publicly available head and neck cancer dataset, OpenKBP. Compared to the sequential DL contouring and treatment planning tasks, our proposed method using MTL improved the mean absolute difference of dose volume histogram metrics of prostate and head and neck sites by 19.82% and 16.33%, respectively. Our MTL model for automated contouring and dose prediction tasks demonstrated enhanced dose prediction performance while maintaining or sometimes even improving the contouring accuracy. Compared to the baseline automated contouring model with the dice score coefficients of 0.818 for prostate and 0.674 for head and neck datasets, our MTL approach achieved average scores of 0.824 and 0.716 for these datasets, respectively. Our study highlights the potential of the proposed automated contouring and planning using MTL to support the development of efficient and accurate automated treatment planning for radiotherapy.
https://arxiv.org/abs/2411.18767
Deep-learning-based MR-to-CT synthesis can estimate the electron density of tissues, thereby facilitating PET attenuation correction in whole-body PET/MR imaging. However, whole-body MR-to-CT synthesis faces several challenges including the issue of spatial misalignment and the complexity of intensity mapping, primarily due to the variety of tissues and organs throughout the whole body. Here we propose a novel whole-body MR-to-CT synthesis framework, which consists of three novel modules to tackle these challenges: (1) Structure-Guided Synthesis module leverages structure-guided attention gates to enhance synthetic image quality by diminishing unnecessary contours of soft tissues; (2) Spatial Alignment module yields precise registration between paired MR and CT images by taking into account the impacts of tissue volumes and respiratory movements, thus providing well-aligned ground-truth CT images during training; (3) Semantic Alignment module utilizes contrastive learning to constrain organ-related semantic information, thereby ensuring the semantic authenticity of synthetic CT this http URL conduct extensive experiments to demonstrate that the proposed whole-body MR-to-CT framework can produce visually plausible and semantically realistic CT images, and validate its utility in PET attenuation correction.
基于深度学习的MR到CT合成技术能够估算组织的电子密度,从而促进全身PET/MR成像中的PET衰减校正。然而,全身MR到CT的合成面临着一些挑战,包括空间错位问题和强度映射复杂性,这主要是由于整个身体中不同组织和器官的存在。在这里,我们提出了一种新颖的全身MR到CT合成框架,该框架由三个新的模块组成以应对这些挑战:(1)结构引导合成模块利用结构引导注意力门来提高合成图像的质量,减少不必要的软组织轮廓;(2)空间对齐模块通过考虑组织体积和呼吸运动的影响,使配对的MR和CT图像之间实现精确配准,从而在训练过程中提供精准对齐的真实CT图像;(3)语义对齐模块利用对比学习来约束与器官相关的语义信息,从而确保合成CT的语义真实性。我们进行了广泛的实验以证明提出的全身MR到CT框架能够生成视觉上合理且语义上真实的CT图像,并验证其在PET衰减校正中的实用性。
https://arxiv.org/abs/2411.17488
Terraced field is a significant engineering practice for soil and water conservation (SWC). Terraced field extraction from remotely sensed imagery is the foundation for monitoring and evaluating SWC. This study is the first to propose a novel dual-modal {\Omega}-like super-resolution Transformer network for intelligent TFVE, offering the following advantages: (1) reducing edge segmentation error from conventional multi-scale downsampling encoder, through fusing original high-resolution features with downsampling features at each step of encoder and leveraging a multi-head attention mechanism; (2) improving the accuracy of TFVE by proposing a {\Omega}-like network structure, which fully integrates rich high-level features from both spectral and terrain data to form cross-scale super-resolution features; (3) validating an optimal fusion scheme for cross-modal and cross-scale (i.e., inconsistent spatial resolution between remotely sensed imagery and DEM) super-resolution feature extraction; (4) mitigating uncertainty between segmentation edge pixels by a coarse-to-fine and spatial topological semantic relationship optimization (STSRO) segmentation strategy; (5) leveraging contour vibration neural network to continuously optimize parameters and iteratively vectorize terraced fields from semantic segmentation results. Moreover, a DMRVD for deep-learning-based TFVE was created for the first time, which covers nine study areas in four provinces of China, with a total coverage area of 22441 square kilometers. To assess the performance of {\Omega}SFormer, classic and SOTA networks were compared. The mIOU of {\Omega}SFormer has improved by 0.165, 0.297 and 0.128 respectively, when compared with best accuracy single-modal remotely sensed imagery, single-modal DEM and dual-modal result.
梯田是一种重要的土壤和水资源保护(SWC)工程实践。从遥感图像中提取梯田信息是监测和评估SWC的基础。本研究首次提出了一种新型的双模态Ω类似超级分辨率Transformer网络,用于智能TFVE(梯田提取),具有以下优势:(1) 通过在每个编码器步骤中融合原始高分辨率特征与下采样特征,并利用多头注意力机制,减少了传统多尺度下采样编码器带来的边缘分割错误;(2) 提出Ω类似网络结构,充分整合光谱和地形数据的丰富高级特征以形成跨尺度超级分辨率特征,从而提高TFVE的准确性;(3) 验证了跨模态和跨尺度(即遥感图像与DEM之间不一致的空间分辨率)超级分辨率特征提取的最佳融合方案;(4) 通过粗到细以及空间拓扑语义关系优化(STSRO)分割策略减轻边缘像素之间的不确定性;(5) 利用轮廓振动神经网络持续优化参数,并从语义分割结果中迭代矢量化梯田。此外,首次创建了基于深度学习的TFVE的DMRVD(Deep Learning-based Multi-source Remote Sensing and DEM Vectorization Database),涵盖中国四个省份的九个研究区域,总面积达22441平方公里。为了评估ΩSFormer的表现,与经典和最新网络进行了比较。相比于最佳准确度单模态遥感图像、单模态DEM以及双模态结果,ΩSFormer的mIOU分别提高了0.165、0.297和0.128。
https://arxiv.org/abs/2411.17088
Purpose: Deformable image registration (DIR) is critical in adaptive radiation therapy (ART) to account for anatomical changes. Conventional intensity-based DIR methods often fail when image intensities differ. This study evaluates a hybrid similarity metric combining intensity and structural information, leveraging CycleGAN-based intensity correction and auto-segmentation across three DIR workflows. Methods: A hybrid similarity metric combining a point-to-distance (PD) score and intensity similarity was implemented. Synthetic CT (sCT) images were generated using a 2D CycleGAN model trained on unpaired CT and CBCT images to enhance soft-tissue contrast. DIR workflows compared included: (1) traditional intensity-based (No PD), (2) auto-segmented contours on sCT (CycleGAN PD), and (3) expert manual contours (Expert PD). A 3D U-Net model trained on 56 images and validated on 14 cases segmented the prostate, bladder, and rectum. DIR accuracy was assessed using Dice Similarity Coefficient (DSC), 95% Hausdorff Distance (HD), and fiducial separation. Results: The hybrid metric improved DIR accuracy. For the prostate, DSC increased from 0.61+/-0.18 (No PD) to 0.82+/-0.13 (CycleGAN PD) and 0.89+/-0.05 (Expert PD), with reductions in 95% HD from 11.75 mm to 4.86 mm and 3.27 mm, respectively. Fiducial separation decreased from 8.95 mm to 4.07 mm (CycleGAN PD) and 4.11 mm (Expert PD) (p < 0.05). Improvements were also observed for the bladder and rectum. Conclusion: This study demonstrates that a hybrid similarity metric using CycleGAN-based auto-segmentation improves DIR accuracy, particularly for low-contrast CBCT images. These findings highlight the potential for integrating AI-based image correction and segmentation into ART workflows to enhance precision and streamline clinical processes.
目的:可变形图像配准(DIR)在自适应放射治疗(ART)中对于考虑解剖变化至关重要。传统的基于强度的DIR方法在图像强度不同情况下经常失败。本研究评估了一种结合强度和结构信息的混合相似性度量,该度量利用了基于CycleGAN的强度校正和自动分割,在三个不同的DIR工作流程中进行应用。 方法:实现了一个结合点到距离(PD)得分和强度相似性的混合相似性度量。使用在未配对的CT和CBCT图像上训练的2D CycleGAN模型生成合成CT(sCT)图像,以增强软组织对比度。比较了三种DIR工作流程:(1) 传统的基于强度的方法(No PD),(2) sCT上的自动分割轮廓(CycleGAN PD),以及 (3) 专家手动轮廓(Expert PD)。一个在56张图像上训练并在14个案例中验证的3D U-Net模型对前列腺、膀胱和直肠进行了分割。使用Dice相似性系数(DSC)、95% Hausdorff距离(HD)和标志点分离来评估DIR准确性。 结果:混合度量提高了DIR精度。对于前列腺,DSC从0.61±0.18(No PD)增加到0.82±0.13(CycleGAN PD)和0.89±0.05(Expert PD),同时95% HD分别减少了从11.75毫米至4.86毫米和3.27毫米。标志点分离也有所减少,从8.95毫米降低到4.07毫米(CycleGAN PD)和4.11毫米(Expert PD)(p < 0.05)。对膀胱和直肠也有类似改进。 结论:本研究表明使用基于CycleGAN的自动分割的混合相似性度量可提高DIR精度,尤其是在低对比度CBCT图像中。这些发现强调了将基于AI的图像校正和分割整合到ART工作流程中的潜力,以提升精确度并简化临床过程。
https://arxiv.org/abs/2411.16992
Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency. Project page: this https URL
扩散模型在生成说话头像方面展现了令人印象深刻的潜力。虽然能够实现逼真的外观和说话效果,但这些方法仍然会因为错误累积和单张图像生成能力的固有限制而遭受时间、3D或表情不一致的问题。本文中,我们提出了一种名为ConsistentAvatar的新框架,用于生成完全一致且高保真度的说话头像。我们的方法不是直接在扩散过程中应用多模态条件,而是首先学习相邻帧之间稳定的时间表示。具体来说,我们提出了一个时间敏感细节(TSD)图,该图包含沿时间轴显著变化的高频特征和轮廓。通过使用一个时间一致的扩散模块,我们学会将初始结果的TSD与视频帧真实数据对齐。最终的头像由一个基于对齐后的TSD、粗略头部法线和情绪提示嵌入条件的完全一致性扩散模块生成。我们发现,表示时间模式的对齐TSD约束了扩散过程以生成时间稳定的说话头像。此外,其可靠的引导补充了其他条件的不准确性,抑制了累积错误并提高了多个方面的连贯性。广泛的实验表明,ConsistentAvatar在生成外观、3D效果、表情和时间一致性方面均优于现有最佳方法。项目页面:此https URL
https://arxiv.org/abs/2411.15436
Deep learning-based segmentation methods are widely utilized for detecting lesions in ultrasound images. Throughout the imaging procedure, the attenuation and scattering of ultrasound waves cause contour blurring and the formation of artifacts, limiting the clarity of the acquired ultrasound images. To overcome this challenge, we propose a contour-based probabilistic segmentation model CP-UNet, which guides the segmentation network to enhance its focus on contour during decoding. We design a novel down-sampling module to enable the contour probability distribution modeling and encoding stages to acquire global-local features. Furthermore, the Gaussian Mixture Model utilizes optimized features to model the contour distribution, capturing the uncertainty of lesion boundaries. Extensive experiments with several state-of-the-art deep learning segmentation methods on three ultrasound image datasets show that our method performs better on breast and thyroid lesions segmentation.
基于深度学习的分割方法广泛用于检测超声图像中的病灶。在整个成像过程中,超声波的衰减和散射会导致轮廓模糊并产生伪影,限制了获取到的超声图像的清晰度。为克服这一挑战,我们提出了一种基于轮廓的概率分割模型CP-UNet,该模型引导分割网络在解码时加强对轮廓的关注。我们设计了一个新颖的下采样模块,使轮廓概率分布建模和编码阶段能够获得全局局部特征。此外,高斯混合模型利用优化后的特征对轮廓分布进行建模,捕捉病灶边界上的不确定性。我们在三个超声图像数据集上与多种最先进的深度学习分割方法进行了广泛的实验,结果表明我们的方法在乳腺和甲状腺病变的分割上表现更优。
https://arxiv.org/abs/2411.14250
Blind face restoration has made great progress in producing high-quality and lifelike images. Yet it remains challenging to preserve the ID information especially when the degradation is heavy. Current reference-guided face restoration approaches either require face alignment or personalized test-tuning, which are unfaithful or time-consuming. In this paper, we propose a tuning-free method named RestorerID that incorporates ID preservation during face restoration. RestorerID is a diffusion model-based method that restores low-quality images with varying levels of degradation by using a single reference image. To achieve this, we propose a unified framework to combine the ID injection with the base blind face restoration model. In addition, we design a novel Face ID Rebalancing Adapter (FIR-Adapter) to tackle the problems of content unconsistency and contours misalignment that are caused by information conflicts between the low-quality input and reference image. Furthermore, by employing an Adaptive ID-Scale Adjusting strategy, RestorerID can produce superior restored images across various levels of degradation. Experimental results on the Celeb-Ref dataset and real-world scenarios demonstrate that RestorerID effectively delivers high-quality face restoration with ID preservation, achieving a superior performance compared to the test-tuning approaches and other reference-guided ones. The code of RestorerID is available at \url{this https URL}.
盲面部修复在生成高质量和逼真的图像方面取得了重大进展。然而,在降质严重的情况下,保留身份信息仍然具有挑战性。当前的参考引导面部修复方法要么需要面部对齐,要么需要个性化测试调整,这些方法不准确或耗时。本文提出了一种无需调优的方法,名为RestorerID,该方法在面部修复过程中融入了身份信息保护功能。RestorerID是一种基于扩散模型的方法,它使用单个参考图像恢复不同降质程度的低质量图像。为了实现这一点,我们提出了一个统一框架来结合身份注入和基础盲面部修复模型。此外,我们设计了一种新颖的面部ID再平衡适配器(FIR-Adapter)来解决由于低质量输入与参考图像之间的信息冲突而引起的不一致内容和轮廓错位问题。通过采用自适应ID比例调整策略,RestorerID可以在不同降质程度下生成优越的修复图像。在Celeb-Ref数据集和真实世界场景中的实验结果表明,RestorerID能够有效地进行高质量面部修复并保持身份信息,其性能优于测试调优方法和其他参考引导的方法。RestorerID的代码可在\url{此 https URL}获得。
https://arxiv.org/abs/2411.14125
This paper presents a study of participants interacting with and using GaMaDHaNi, a novel hierarchical generative model for Hindustani vocal contours. To explore possible use cases in human-AI interaction, we conducted a user study with three participants, each engaging with the model through three predefined interaction modes. Although this study was conducted "in the wild"- with the model unadapted for the shift from the training data to real-world interaction - we use it as a pilot to better understand the expectations, reactions, and preferences of practicing musicians when engaging with such a model. We note their challenges as (1) the lack of restrictions in model output, and (2) the incoherence of model output. We situate these challenges in the context of Hindustani music and aim to suggest future directions for the model design to address these gaps.
本文介绍了一项关于参与者与GaMaDHaNi互动和使用的研究,GaMaDHaNi是一种用于印度斯坦语歌唱轮廓的新型分层生成模型。为了探索在人机交互中可能的应用场景,我们进行了一项用户研究,共有三名参与者,每位参与者通过三种预定义的互动模式与该模型进行了交流。尽管这项研究是在“野外”环境下进行的——即模型未经过从训练数据到实际交互环境适应——我们将它作为初步研究来更好地理解职业音乐家在使用此类模型时的期望、反应和偏好。我们注意到他们面临的挑战包括:(1)模型输出缺乏限制,以及(2)模型输出不连贯。我们把这些挑战放在印度斯坦音乐的背景下,并旨在为该模型的设计提出未来方向,以解决这些不足之处。
https://arxiv.org/abs/2411.13846
Microscopy structure segmentation, such as detecting cells or nuclei, generally requires a human to draw a ground truth contour around each instance. Weakly supervised approaches (e.g. consisting of only single point labels) have the potential to reduce this workload significantly. Our approach uses individual point labels for an entropy estimation to approximate an underlying distribution of cell pixels. We infer full cell masks from this distribution, and use Mask-RCNN to produce an instance segmentation output. We compare this point--annotated approach with training on the full ground truth masks. We show that our method achieves a comparatively good level of performance, despite a 95% reduction in pixel labels.
显微结构分割,如检测细胞或细胞核,通常需要人工在每个实例周围绘制真实边界。弱监督方法(例如仅包含单点标签的方法)有可能大幅减少这种工作量。我们的方法使用单独的点标签进行熵估计来近似底层的细胞像素分布。我们从该分布推断出完整的细胞掩码,并使用Mask-RCNN生成实例分割输出。我们将这种基于点标注的方法与使用完整真实边界训练的结果进行了比较。结果显示,尽管像素标签减少了95%,我们的方法仍能达到相当不错的性能水平。
https://arxiv.org/abs/2411.13528
Automatic Cobb angle measurement from X-ray images is crucial for scoliosis screening and diagnosis. However, most existing regression-based methods and segmentation-based methods struggle with inaccurate spine representations or mask connectivity/fragmentation issues. Besides, landmark-based methods suffer from insufficient training data and annotations. To address these challenges, we propose a novel framework including Self-Generation pipeline and Low-Rank Approximation representation (SG-LRA) for automatic Cobb angle measurement. Specifically, we propose a parameterized spine contour representation based on LRA, which enables eigen-spine decomposition and spine contour reconstruction. We can directly obtain spine contour with only regressed LRA coefficients, which form a more accurate spine representation than rectangular boxes. Also, we combine LRA coefficient regression with anchor box classification to solve inaccurate predictions and mask connectivity issues. Moreover, we develop a data engine with automatic annotation and automatic selection in an iterative manner, which is trained on a private Spinal2023 dataset. With our data engine, we generate the largest scoliosis X-ray dataset named Spinal-AI2024 largely without privacy leaks. Extensive experiments on public AASCE2019, private Spinal2023, and generated Spinal-AI2024 datasets demonstrate that our method achieves state-of-the-art Cobb angle measurement performance. Our code and Spinal-AI2024 dataset are available at this https URL and this https URL, respectively.
自动从X光图像中测量Cobb角度对于脊柱侧弯的筛查和诊断至关重要。然而,大多数现有的基于回归的方法和基于分割的方法在准确表示脊椎或解决掩码连通性/碎片化问题方面都存在问题。此外,基于标志点的方法还面临着训练数据不足和标注不足的问题。为了解决这些挑战,我们提出了一种新的框架——自动生成管道与低秩近似表示(SG-LRA)用于自动Cobb角度测量。具体而言,我们提出了基于LRA的参数化脊椎轮廓表示法,该方法能够实现特征脊椎分解和脊椎轮廓重建。通过仅回归LRA系数,我们可以直接获得脊椎轮廓,这比矩形框形成了更准确的脊椎表示。此外,我们将LRA系数回归与锚点框分类相结合,以解决预测不准确和掩码连通性问题。另外,我们开发了一个数据引擎,在迭代过程中实现自动标注和选择,该引擎在私有Spinal2023数据集上进行训练。借助我们的数据引擎,我们生成了最大的脊柱侧弯X光数据集——Spinal-AI2024,并且几乎不会泄露隐私信息。在公共AASCE2019、私有Spinal2023和自动生成的Spinal-AI2024数据集上进行的广泛实验表明,我们的方法实现了最先进的Cobb角度测量性能。我们的代码和Spinal-AI2024数据集可分别从这个https URL和这个https URL获取。
https://arxiv.org/abs/2411.12604
Image super-resolution (SR) is a classical yet still active low-level vision problem that aims to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts, serving as a key technique for image enhancement. Current approaches to address SR tasks, such as transformer-based and diffusion-based methods, are either dedicated to extracting RGB image features or assuming similar degradation patterns, neglecting the inherent modal disparities between infrared and visible images. When directly applied to infrared image SR tasks, these methods inevitably distort the infrared spectral distribution, compromising the machine perception in downstream tasks. In this work, we emphasize the infrared spectral distribution fidelity and propose a Contourlet refinement gate framework to restore infrared modal-specific features while preserving spectral distribution fidelity. Our approach captures high-pass subbands from multi-scale and multi-directional infrared spectral decomposition to recover infrared-degraded information through a gate architecture. The proposed Spectral Fidelity Loss regularizes the spectral frequency distribution during reconstruction, which ensures the preservation of both high- and low-frequency components and maintains the fidelity of infrared-specific features. We propose a two-stage prompt-learning optimization to guide the model in learning infrared HR characteristics from LR degradation. Extensive experiments demonstrate that our approach outperforms existing image SR models in both visual and perceptual tasks while notably enhancing machine perception in downstream tasks. Our code is available at this https URL.
图像超分辨率(SR)是一个经典的且仍然活跃的低级视觉问题,旨在从其低分辨率(LR)对应图中重建高分辨率(HR)图像,是图像增强的关键技术。目前处理SR任务的方法,如基于变换器和扩散模型的方法,要么专注于提取RGB图像特征,要么假设相似的退化模式,忽视了红外图像与可见光图像之间固有的模态差异。当这些方法直接应用于红外图像超分辨率任务时,它们不可避免地会扭曲红外光谱分布,从而影响下游任务中的机器感知能力。在本研究中,我们强调保持红外光谱分布的真实性,并提出了一种Contourlet细化门框架来恢复特定于红外模态的特征,同时保持光谱分布的真实度。我们的方法通过多尺度和多方向的红外光谱分解捕获高通子带,借助门控架构恢复红外退化信息。所提出的光谱真实损失函数在重建过程中对频谱频率分布进行正则化,确保了高低频成分的同时保留,并保持了特定于红外特征的真实度。我们提出了一个两阶段提示学习优化方法来指导模型从LR退化中学习HR红外特性。广泛的实验表明,我们的方法不仅在视觉和感知任务上超越了现有的图像超分辨率模型,而且显著提升了下游任务中的机器感知能力。我们的代码可在以下链接获取:[此处的HTTPS URL]。
https://arxiv.org/abs/2411.12530
Deploying robots in open-world environments involves complex tasks characterized by long sequences and rich interactions, necessitating efficient transfer of robotic skills across diverse and complex scenarios. To address this challenge, we propose a skill library framework based on knowledge graphs, which endows robots with high-level skill awareness and spatial semantic understanding. The framework hierarchically organizes operational knowledge by constructing a "task graph" and a "scene graph" to represent task and scene semantic information, respectively. We introduce a "state graph" to facilitate interaction between high-level task planning and low-level scene information. Furthermore, we propose a hierarchical transfer framework for operational skills. At the task level, the framework integrates contextual learning and chain-of-thought prompting within a four-stage prompt paradigm, leveraging large language models' (LLMs) reasoning and generalization capabilities to achieve task-level subtask sequence transfer. At the motion level, an adaptive trajectory transfer method is developed using the A* algorithm and the skill library, enabling motion-level adaptive trajectory transfer. At the physical level, we introduce an adaptive contour extraction and posture perception method based on tactile perception. This method dynamically obtains high-precision contour and posture information from visual-tactile texture data and adjusts transferred skills, such as contact positions and postures, to ensure effectiveness in new environments. Experimental results validate the effectiveness of the proposed methods. Project website:this https URL
在开放世界的环境中部署机器人涉及复杂的任务,这些任务具有长时间序列和丰富的交互性,需要高效地将机器人的技能转移到多种复杂的情境中。为了解决这一挑战,我们提出了一种基于知识图谱的技能库框架,该框架赋予了机器人高级别的技能意识和空间语义理解能力。通过构建“任务图”和“场景图”,分别代表任务和场景的语义信息,该框架分层组织操作性知识。“状态图”的引入促进了高层级任务规划与低层级场景信息之间的互动。此外,我们还提出了一种操作技能的分层转移框架。在任务层面,该框架在一个四阶段提示范式中集成了情境学习和思维链提示,利用大型语言模型(LLM)的推理和泛化能力来实现任务级别的子任务序列转移。在运动层面,通过A*算法和技能库开发了一种自适应轨迹转移方法,实现了运动级别的自适应轨迹转移。在物理层面,我们介绍了一种基于触觉感知的自适应轮廓提取和姿态感知方法。该方法能够从视觉-触觉纹理数据中动态获取高精度的轮廓和姿态信息,并调整被转移的技能(如接触位置和姿态),以确保其在新环境中的有效性。实验结果验证了所提出方法的有效性。项目网址:此 https URL
https://arxiv.org/abs/2411.11714
The discovery of the Dead Sea Scrolls over 60 years ago is widely regarded as one of the greatest archaeological breakthroughs in modern history. Recent study of the scrolls presents ongoing computational challenges, including determining the provenance of fragments, clustering fragments based on their degree of similarity, and pairing fragments that originate from the same manuscript -- all tasks that require focusing on individual letter and fragment shapes. This paper presents a computational method for segmenting ink and parchment regions in multispectral images of Dead Sea Scroll fragments. Using the newly developed Qumran Segmentation Dataset (QSD) consisting of 20 fragments, we apply multispectral thresholding to isolate ink and parchment regions based on their unique spectral signatures. To refine segmentation accuracy, we introduce an energy minimization technique that leverages ink contours, which are more distinguishable from the background and less noisy than inner ink regions. Experimental results demonstrate that this Multispectral Thresholding and Energy Minimization (MTEM) method achieves significant improvements over traditional binarization approaches like Otsu and Sauvola in parchment segmentation and is successful at delineating ink borders, in distinction from holes and background regions.
死海古卷的发现已有超过60年的历史,被广泛认为是现代历史上最伟大的考古突破之一。最近对这些古卷的研究提出了持续的计算挑战,包括确定碎片的来源、根据相似程度聚类碎片以及配对来自同一手稿的碎片——所有这些任务都需要关注单个字母和碎片形状。本文介绍了一种在死海古卷碎片多光谱图像中分割墨水和羊皮纸区域的计算方法。使用新开发的Qumran分割数据集(QSD),该数据集包含20个碎片,我们应用多光谱阈值化技术,根据其独特的光谱特征来隔离墨水和羊皮纸区域。为了提高分割精度,我们引入了一种基于能量最小化的技术,利用轮廓更易于区分背景且比内侧墨区噪音更少的墨迹轮廓。实验结果表明,这种多光谱阈值化与能量最小化(MTEM)方法在羊皮纸分割方面显著优于传统二值化方法如Otsu和Sauvola,并成功地界定了墨水边界,区分了孔洞和背景区域。
https://arxiv.org/abs/2411.10668
Few people use the probability theory in order to achieve image segmentation with snake models. In this article, we are presenting an active contour algorithm based on a probability approach inspired by A. Blake work and P. R{é}fr{é}gier's team research in France. Our algorithm, both very fast and highly accurate as far as contour description is concerned, is easily adaptable to any specific application.
很少有人使用概率理论来实现基于snake模型的图像分割。本文提出了一种受A. Blake的工作和法国P. R{é}fr{é}gier团队研究启发的概率方法为基础的主动轮廓算法。我们的算法在轮廓描述方面既非常快速又高度准确,并且很容易适应任何特定的应用程序。
https://arxiv.org/abs/2411.09137
These last years, algorithms allowing to decompose an image into its structures and textures components have emerged. In this paper, we present an application of this type of decomposition to the problem road network detection in aerial or satelite imagery. The algorithmic procedure involves the image decomposition (using a unique property), an alignment detection step based on the Gestalt theory, and a refinement step using statistical active contours.
近年来,能够将图像分解为其结构和纹理成分的算法已经出现。在本文中,我们展示了这种分解方法在检测航空或卫星影像中的道路网络问题上的应用。该算法过程包括基于独特属性的图像分解、基于格式塔理论的对齐检测步骤以及使用统计活动轮廓进行细化的步骤。
https://arxiv.org/abs/2411.08293
Inverse Lithography Technology (ILT) has emerged as a promising solution for photo mask design and optimization. Relying on multi-beam mask writers, ILT enables the creation of free-form curvilinear mask shapes that enhance printed wafer image quality and process window. However, a major challenge in implementing curvilinear ILT for large-scale production is mask rule checking, an area currently under development by foundries and EDA vendors. Although recent research has incorporated mask complexity into the optimization process, much of it focuses on reducing e-beam shots, which does not align with the goals of curvilinear ILT. In this paper, we introduce a GPU-accelerated ILT algorithm that improves not only contour quality and process window but also the precision of curvilinear mask shapes. Our experiments on open benchmarks demonstrate a significant advantage of our algorithm over leading academic ILT engines.
逆光刻技术(ILT)作为光掩模设计和优化的有前景解决方案已经浮现。依靠多束掩模写入器,ILT 能够创建自由形式的曲线掩模形状,从而提高打印晶圆图像的质量和工艺窗口。然而,在大规模生产中实现曲线 ILT 的主要挑战之一是掩模规则检查,这是目前代工厂和 EDA 供应商正在开发的一个领域。尽管最近的研究将掩模复杂性纳入优化过程,但其中大部分关注于减少电子束曝光次数,这并不符合曲线 ILT 的目标。在本文中,我们介绍了一种基于 GPU 加速的ILT算法,该算法不仅提高了轮廓质量和工艺窗口,还提升了曲线掩模形状的精度。我们在开放基准上的实验表明,我们的算法相较于领先的学术ILT引擎具有显著优势。
https://arxiv.org/abs/2411.07311
In this paper, we propose to improve image decomposition algorithms in the case of noisy images. In \cite{gilles1,aujoluvw}, the authors propose to separate structures, textures and noise from an image. Unfortunately, the use of separable wavelets shows some artefacts. In this paper, we propose to replace the wavelet transform by the contourlet transform which better approximate geometry in images. For that, we define contourlet spaces and their associated norms. Then, we get an iterative algorithm which we test on two noisy textured images.
在这篇论文中,我们提出了改进噪声图像分解算法的方法。在文献\cite{gilles1,aujoluvw}中,作者提出将结构、纹理和噪声从图像中分离出来。不幸的是,可分离小波的使用显示出一些伪影。在这篇论文中,我们建议用轮廓波变换代替小波变换,因为轮廓波变换能更好地近似图像中的几何特征。为此,我们定义了轮廓空间及其相关范数。然后,我们得到了一个迭代算法,并在两张带有噪声的纹理图像上进行了测试。
https://arxiv.org/abs/2411.06696