The first violins appeared in late 16th-century Italy. Over the next 200 years, they spread across Europe and luthiers of various royal courts, eager to experiment with new techniques, created a highly diverse family of instruments. Around 1750, size standards were introduced to unify violin making for orchestras and conservatories. Instruments that fell between two standards were then reduced to a smaller size by luthiers. These reductions have an impact on several characteristics of violins, in particular on the contour lines, i.e. lines of constant altitude, which look more like a U for non reduced instruments and a V for reduced ones. While such differences are observed by experts, they have not been studied quantitatively. This paper presents a method for classifying violins as reduced or non-reduced based on their contour lines. We study a corpus of 25 instruments whose 3D geometric meshes were acquired via photogrammetry. For each instrument, we extract 10-20 contour lines regularly spaced every millimetre. Each line is fitted with a parabola-like curve (with an equation of the type y = alpha*abs(x)**beta) depending on two parameters, describing how open (beta) and how vertically stretched (alpha) the curve is. We compute additional features from those parameters, using regressions and counting how many values fall under some threshold. We also deal with outliers and non equal numbers of levels, and eventually obtain a numerical profile for each instrument. We then apply classification methods to assess whether geometry alone can predict size reduction. We find that distinguishing between reduced and non reduced instruments is feasible to some degree, taking into account that a whole spectrum of more or less transformed violins exists, for which it is more difficult to quantify the reduction. We also find the opening parameter beta to be the most predictive.
第一把小提琴出现在16世纪晚期的意大利。在接下来的200年里,它们传播到了整个欧洲,各个宫廷里的制琴师们热衷于尝试新技术,从而创造出了一个高度多样化的乐器家族。大约到1750年,为使乐团和音乐学院中的小提琴制作标准化,制定了尺寸标准。那些介于两种标准之间的小提琴则被制琴师缩小尺寸。这些缩减影响了小提琴的多个特性,尤其是在轮廓线方面:对于未被缩减的小提琴而言,轮廓线看上去更像一个“U”形;而对于已被缩减的小提琴,则更像是一个“V”形。尽管专家们观察到了这样的差异,但这些变化尚未得到定量研究。 本文介绍了一种基于小提琴的轮廓线来分类其是否为缩减型的方法。我们研究了一个包含25把乐器的数据集,其中每把乐器的3D几何网格是通过摄影测量技术获得的。对于每个乐器,我们会提取出10-20条间隔一毫米等距分布的轮廓线,并使用一个类似抛物线的曲线(方程形式为y = alpha * abs(x)**beta)来拟合每一根线条,这条曲线由两个参数描述,即开放程度(beta)和垂直拉伸度(alpha)。根据这些参数,我们计算出额外的特征值,包括回归分析以及统计哪些数值低于某个阈值的数量等,并处理异常值及不同数量级的问题,最终为每把乐器获得一个数值轮廓。之后,我们将应用分类方法来评估仅凭几何形状是否能够预测尺寸缩减。 我们的研究发现,在一定程度上可以区分出被缩减与未被缩减的小提琴,尽管考虑到存在一系列程度不同的变形小提琴,这种区分更加困难。此外,我们还发现开放性参数(beta)具有最强的预测能力。
https://arxiv.org/abs/2507.07743
Out-of-Distribution (OoD) segmentation is critical for safety-sensitive applications like autonomous driving. However, existing mask-based methods often suffer from boundary imprecision, inconsistent anomaly scores within objects, and false positives from background noise. We propose \textbf{\textit{Objectomaly}}, an objectness-aware refinement framework that incorporates object-level priors. Objectomaly consists of three stages: (1) Coarse Anomaly Scoring (CAS) using an existing OoD backbone, (2) Objectness-Aware Score Calibration (OASC) leveraging SAM-generated instance masks for object-level score normalization, and (3) Meticulous Boundary Precision (MBP) applying Laplacian filtering and Gaussian smoothing for contour refinement. Objectomaly achieves state-of-the-art performance on key OoD segmentation benchmarks, including SMIYC AnomalyTrack/ObstacleTrack and RoadAnomaly, improving both pixel-level (AuPRC up to 96.99, FPR$_{95}$ down to 0.07) and component-level (F1$-$score up to 83.44) metrics. Ablation studies and qualitative results on real-world driving videos further validate the robustness and generalizability of our method. Code will be released upon publication.
**翻译** 分布外(OoD,Out-of-Distribution)分割对于自动驾驶等安全敏感型应用至关重要。然而,现有的基于掩码的方法常常会遇到边界不精确、对象内部异常分数不一致以及背景噪声导致的假阳性的问题。为此,我们提出了**Objectomaly**——一个基于目标特性的改进框架,该框架利用了物体级别的先验知识。Objectomaly包括三个阶段:(1)粗略异常评分(CAS),使用现有的OoD骨干网络;(2)目标感知得分校准(OASC),通过SAM生成的实例掩码对对象级别的分数进行归一化处理;以及(3)细致边界精度(MBP),应用拉普拉斯滤波和高斯平滑以优化轮廓细节。在关键的OoD分割基准测试,如SMIYC AnomalyTrack/ObstacleTrack 和RoadAnomaly 上,Objectomaly实现了最先进的性能,在像素级别指标上取得了高达96.99的AuPRC值并把FPR$_{95}$降低到了0.07,并且在组件级别的F1-分数上达到了83.44。通过对实际驾驶视频中的消融研究和定性分析,进一步验证了我们方法的鲁棒性和泛化能力。代码将在发布后公开。
https://arxiv.org/abs/2507.07460
Background: Accurate deformable image registration (DIR) is required for contour propagation and dose accumulation in MR-guided adaptive radiotherapy (MRgART). This study trained and evaluated a deep learning DIR method for domain invariant MR-MR registration. Methods: A progressively refined registration and segmentation (ProRSeg) method was trained with 262 pairs of 3T MR simulation scans from prostate cancer patients using weighted segmentation consistency loss. ProRSeg was tested on same- (58 pairs), cross- (72 1.5T MR Linac pairs), and mixed-domain (42 MRSim-MRL pairs) datasets for contour propagation accuracy of clinical target volume (CTV), bladder, and rectum. Dose accumulation was performed for 42 patients undergoing 5-fraction MRgART. Results: ProRSeg demonstrated generalization for bladder with similar Dice Similarity Coefficients across domains (0.88, 0.87, 0.86). For rectum and CTV, performance was domain-dependent with higher accuracy on cross-domain MRL dataset (DSCs 0.89) versus same-domain data. The model's strong cross-domain performance prompted us to study the feasibility of using it for dose accumulation. Dose accumulation showed 83.3% of patients met CTV coverage (D95 >= 40.0 Gy) and bladder sparing (D50 <= 20.0 Gy) constraints. All patients achieved minimum mean target dose (>40.4 Gy), but only 9.5% remained under upper limit (<42.0 Gy). Conclusions: ProRSeg showed reasonable multi-domain MR-MR registration performance for prostate cancer patients with preliminary feasibility for evaluating treatment compliance to clinical constraints.
背景:在磁共振引导的自适应放射治疗(MRgART)中,准确的可变形图像配准(DIR)对于轮廓传播和剂量积累是必需的。本研究训练并评估了一种用于领域不变的MRI-MRI注册的深度学习DIR方法。 方法:通过使用加权分割一致性损失,在262对前列腺癌患者的3T MR模拟扫描上训练了逐步细化的注册与分割(ProRSeg)方法。该方法在相同域(58对)、跨域(72个1.5T MR Linac对)和混合域(42个MRSim-MRL对)的数据集上进行了测试,评估临床靶区体积(CTV)、膀胱和直肠轮廓传播的准确性。对于接受5次分割MRgART治疗的42名患者进行了剂量积累。 结果:ProRSeg在不同领域中展示了良好的泛化性能,尤其是在膀胱的Dice相似系数方面,在各个域内均保持稳定(分别为0.88、0.87和0.86)。对于直肠和CTV而言,其表现依赖于域别,跨域MRL数据集上的准确性更高(DSCs为0.89)与同一领域的数据相比。模型的强跨领域性能促使我们研究使用该模型进行剂量积累的可行性。在42名患者中,有83.3%的人满足CTV覆盖(D95 >= 40.0 Gy)和膀胱保护(D50 <= 20.0 Gy)的约束条件。所有患者均达到了最低平均靶区剂量(>40.4 Gy),但只有9.5%的患者低于上限(<42.0 Gy)。 结论:对于前列腺癌患者,ProRSeg在多领域MRI-MRI配准方面展示了合理的性能,并初步证明了其用于评估治疗是否符合临床限制条件的可能性。
https://arxiv.org/abs/2507.06966
Diffusion models have demonstrated exceptional performance across various domains due to their ability to model and generate complicated data distributions. However, when applied to PolSAR data, traditional real-valued diffusion models face challenges in capturing complex-valued phase this http URL, these models often struggle to preserve fine structural details. To address these limitations, we leverage the Contourlet transform, which provides rich multiscale and multidirectional representations well-suited for PolSAR imagery. We propose a structural knowledge-guided complex diffusion model for PolSAR image classification in the Contourlet domain. Specifically, the complex Contourlet transform is first applied to decompose the data into low- and high-frequency subbands, enabling the extraction of statistical and boundary features. A knowledge-guided complex diffusion network is then designed to model the statistical properties of the low-frequency components. During the process, structural information from high-frequency coefficients is utilized to guide the diffusion process, improving edge preservation. Furthermore, multiscale and multidirectional high-frequency features are jointly learned to further boost classification accuracy. Experimental results on three real-world PolSAR datasets demonstrate that our approach surpasses state-of-the-art methods, particularly in preserving edge details and maintaining region homogeneity in complex terrain.
扩散模型由于能够建模和生成复杂的数据分布,在多个领域内展示了卓越的性能。然而,当应用于极化合成孔径雷达(PolSAR)数据时,传统的实数值扩散模型在捕捉复杂的相位信息方面遇到挑战,并且难以保留精细的结构细节。为了解决这些问题,我们利用了轮廓波变换(Contourlet transform),该技术提供了丰富的多尺度和多方向表示,特别适合用于极化合成孔径雷达图像处理。 为此,我们提出了一种基于轮廓波域中的结构知识引导型复扩散模型,以提高PolSAR图像分类的准确性。具体而言,首先使用复数轮廓波变换将数据分解为低频和高频子带,从而能够提取统计特性和边界特征。接着设计了一个知识引导下的复扩散网络,用于建模低频分量的统计特性。在此过程中,利用高频系数中的结构信息来指导扩散过程,以提高边缘保留能力。此外,通过联合学习多尺度和多方向的高频特征进一步提升分类精度。 实验结果表明,在三个真实的PolSAR数据集上,我们的方法超越了现有最先进的技术,特别是在复杂地形中保持边界细节和区域同质性方面表现尤为突出。
https://arxiv.org/abs/2507.05666
Precise control over speech characteristics, such as pitch, duration, and speech rate, remains a significant challenge in the field of voice conversion. The ability to manipulate parameters like pitch and syllable rate is an important element for effective identity conversion, but can also be used independently for voice transformation, achieving goals that were historically addressed by vocoder-based methods. In this work, we explore a convolutional neural network-based approach that aims to provide means for modifying fundamental frequency (F0), phoneme sequences, intensity, and speaker identity. Rather than relying on disentanglement techniques, our model is explicitly conditioned on these factors to generate mel spectrograms, which are then converted into waveforms using a universal neural vocoder. Accordingly, during inference, F0 contours, phoneme sequences, and speaker embeddings can be freely adjusted, allowing for intuitively controlled voice transformations. We evaluate our approach on speaker conversion and expressive speech tasks using both perceptual and objective metrics. The results suggest that the proposed method offers substantial flexibility, while maintaining high intelligibility and speaker similarity.
在语音转换领域,对音高、持续时间及语速等语音特征的精确控制仍然是一个重大挑战。操控音高和音节率这类参数的能力是有效身份转换的重要元素,但也可以独立使用这些参数进行声音变换,以实现以往基于声码器的方法才能达到的目标。本文提出了一种基于卷积神经网络的方法,旨在提供修改基频(F0)、音素序列、强度及说话人身份的手段。与依赖于特征解耦技术不同的是,我们的模型直接根据这些因素生成梅尔光谱图,并使用通用神经声码器将其转换为波形信号。因此,在推理过程中可以自由调整F0轮廓、音素序列和说话人嵌入,从而实现直观可控的声音变换。 我们在说话人转换及表情语音任务上采用感知与客观指标对我们的方法进行了评估。实验结果表明,所提出的方法在保持高清晰度和说话人口感相似性的前提下,提供了相当大的灵活性。
https://arxiv.org/abs/2507.04817
Automatically extracting vectorized building contours from remote sensing imagery is crucial for urban planning, population estimation, and disaster assessment. Current state-of-the-art methods rely on complex multi-stage pipelines involving pixel segmentation, vectorization, and polygon refinement, which limits their scalability and real-world applicability. Inspired by the remarkable reasoning capabilities of Large Language Models (LLMs), we introduce VectorLLM, the first Multi-modal Large Language Model (MLLM) designed for regular building contour extraction from remote sensing images. Unlike existing approaches, VectorLLM performs corner-point by corner-point regression of building contours directly, mimicking human annotators' labeling process. Our architecture consists of a vision foundation backbone, an MLP connector, and an LLM, enhanced with learnable position embeddings to improve spatial understanding capability. Through comprehensive exploration of training strategies including pretraining, supervised fine-tuning, and preference optimization across WHU, WHU-Mix, and CrowdAI datasets, VectorLLM significantly outperformed the previous SOTA methods by 5.6 AP, 7.1 AP, 13.6 AP, respectively in the three datasets. Remarkably, VectorLLM exhibits strong zero-shot performance on unseen objects including aircraft, water bodies, and oil tanks, highlighting its potential for unified modeling of diverse remote sensing object contour extraction tasks. Overall, this work establishes a new paradigm for vector extraction in remote sensing, leveraging the topological reasoning capabilities of LLMs to achieve both high accuracy and exceptional generalization. All the codes and weights will be published for promoting community development.
从遥感影像中自动提取向量化的建筑轮廓对于城市规划、人口估算和灾害评估至关重要。当前最先进的方法依赖于复杂的多阶段流水线,包括像素分割、向量化和多边形优化,这限制了它们的可扩展性和实际应用性。受大型语言模型(LLM)卓越推理能力的启发,我们介绍了VectorLLM,这是首个专为从遥感图像中提取规则建筑轮廓而设计的多模态大型语言模型(MLLM)。与现有方法不同的是,VectorLLM直接进行逐角点回归以生成建筑轮廓,模仿人类标注者的标注过程。我们的架构包括视觉基础骨干网络、MLP连接器和一个增强了可学习位置嵌入的LLM,从而提高了空间理解能力。 通过全面探索包括预训练、监督微调和偏好优化在内的多种训练策略,VectorLLM在WHU、WHU-Mix和CrowdAI数据集上分别比之前的最先进方法(SOTA)高出5.6 AP、7.1 AP和13.6 AP的性能。值得注意的是,VectorLLM在包括飞机、水面和油罐在内的未见过物体上表现出强大的零样本性能,突显了其在统一建模各种遥感目标轮廓提取任务中的潜力。 总体而言,这项工作为遥感领域向量抽取建立了一种新的范式,通过利用大型语言模型的拓扑推理能力,在高精度的同时实现了出色的泛化效果。所有代码和权重将发布以促进社区发展。
https://arxiv.org/abs/2507.04664
In this paper, we address the following question: How do generic foundation models (e.g., CLIP, BLIP, LLaVa, DINO) compare against a domain-specific face recognition model (viz., AdaFace or ArcFace) on the face recognition task? Through a series of experiments involving several foundation models and benchmark datasets, we are able to report the following findings: (a) In all datasets considered, domain-specific models outperformed zero-shot foundation models. (b) The performance of zero-shot generic foundation models improves on over-segmented face images than tightly cropped faces thereby suggesting the importance of contextual clues. For example, at a False Match Rate (FMR) of 0.01%, the True Match Rate (TMR) of OpenCLIP improved from 64.97% to 81.73% on the LFW dataset as the face crop increased from 112x112 to 250x250 while the TMR of domain-specific AdaFace dropped from 99.09% to 77.31%. (c) A simple score-level fusion of a foundation model with a domain-specific FR model improved the accuracy at low FMRs. For example, the TMR of AdaFace when fused with BLIP improved from 72.64% to 83.31% at an FMR of 0.0001% on the IJB-B dataset and from 73.17% to 85.81% on the IJB-C dataset. (d) Foundation models, such as ChatGPT, can be used to impart explainability to the FR pipeline (e.g., ``Despite minor lighting and head tilt differences, the two left-profile images show high consistency in forehead slope, nose shape, chin contour...''). In some instances, foundation models are even able to resolve low-confidence decisions made by AdaFace (e.g., ``Although AdaFace assigns a low similarity score of 0.21, both images exhibit visual similarity...and the pair is likely of the same person''), thereby reiterating the importance of combining domain-specific FR models with generic foundation models in a judicious manner.
在这篇论文中,我们探讨了以下问题:通用基础模型(如CLIP、BLIP、LLaVa、DINO)与专用于人脸识别的领域特定模型(例如AdaFace或ArcFace)在人脸识别任务上的表现如何?通过一系列涉及多个基础模型和基准数据集的实验,我们得出了如下发现: (a) 在所有考虑的数据集中,领域特定模型的表现优于零样本基础模型。 (b) 零样本通用基础模型在处理过度分割的人脸图像时比紧密裁剪的人脸图像表现出更好的性能,这表明上下文线索的重要性。例如,在错误匹配率为0.01%的情况下,OpenCLIP在LFW数据集上的真匹配率从64.97%提升到了81.73%,当面部裁剪尺寸由112x112增加到250x250时;相比之下,领域特定的AdaFace模型的真匹配率却从99.09%下降到了77.31%。 (c) 将基础模型与专有的人脸识别模型进行简单的分数级融合可以提高低错误匹配率下的准确性。例如,在IJB-B数据集上,当AdaFace与BLIP融合后,以0.0001%的FMR计算时,真匹配率从72.64%提升到了83.31%,而在IJB-C数据集上的真匹配率则从73.17%提高到85.81%。 (d) 像ChatGPT这样的基础模型可以用于为人脸识别管道赋予解释性(例如,“尽管光线和头部角度略有不同,两张侧面照在前额斜度、鼻形和下巴轮廓方面表现出高度一致性……”)。在某些情况下,基础模型甚至能够解决AdaFace做出的低置信度决策(例如,“虽然AdaFace给出的相似分数为0.21较低,但两幅图像具有视觉上的相似性……这对可能是同一个人的照片。”),这进一步强调了以明智的方式结合领域特定的人脸识别模型与通用基础模型的重要性。
https://arxiv.org/abs/2507.03541
Background and objective: Medical image segmentation is a core task in various clinical applications. However, acquiring large-scale, fully annotated medical image datasets is both time-consuming and costly. Scribble annotations, as a form of sparse labeling, provide an efficient and cost-effective alternative for medical image segmentation. However, the sparsity of scribble annotations limits the feature learning of the target region and lacks sufficient boundary supervision, which poses significant challenges for training segmentation networks. Methods: We propose TAB Net, a novel weakly-supervised medical image segmentation framework, consisting of two key components: the triplet augmentation self-recovery (TAS) module and the boundary-aware pseudo-label supervision (BAP) module. The TAS module enhances feature learning through three complementary augmentation strategies: intensity transformation improves the model's sensitivity to texture and contrast variations, cutout forces the network to capture local anatomical structures by masking key regions, and jigsaw augmentation strengthens the modeling of global anatomical layout by disrupting spatial continuity. By guiding the network to recover complete masks from diverse augmented inputs, TAS promotes a deeper semantic understanding of medical images under sparse supervision. The BAP module enhances pseudo-supervision accuracy and boundary modeling by fusing dual-branch predictions into a loss-weighted pseudo-label and introducing a boundary-aware loss for fine-grained contour refinement. Results: Experimental evaluations on two public datasets, ACDC and MSCMR seg, demonstrate that TAB Net significantly outperforms state-of-the-art methods for scribble-based weakly supervised segmentation. Moreover, it achieves performance comparable to that of fully supervised methods.
背景和目标:医学图像分割是临床应用中的核心任务之一。然而,获取大规模、全面标注的医学影像数据集既耗时又昂贵。草图注解作为一种稀疏标签形式,为医学图像分割提供了一种高效且成本效益高的替代方案。但是,草图注解的稀疏性限制了目标区域的特征学习,并缺乏足够的边界监督,这给训练分割网络带来了重大挑战。 方法:我们提出了一种新型弱监督医学影像分割框架TAB Net,该框架包含两个关键组成部分:三元组增强自我恢复(TAS)模块和边界感知伪标签监督(BAP)模块。TAS模块通过三种互补的增强策略来提高特征学习能力:强度变换增强了模型对纹理和对比度变化的敏感性;裁剪操作迫使网络在遮盖关键区域的情况下捕捉局部解剖结构;拼图增广则通过破坏空间连续性来加强全局解剖布局的建模。TAS模块引导网络从多样化的增强输入中恢复完整的掩码,从而在稀疏监督下促进医学影像的语义理解。 BAP模块通过将双支路预测融合成一个加权伪标签,并引入边界感知损失进行细粒度轮廓精炼来提高伪监督准确性及边界建模。这种方法不仅提高了伪标签的质量,还增强了对图像边缘细节的关注和建模能力。 结果:在两个公开数据集ACDC和MSCMR seg上的实验评估表明,TAB Net显著优于基于草图的弱监督分割方法中的最新技术,并且其性能与全监督方法相当。
https://arxiv.org/abs/2507.02399
The shape of objects is an important source of visual information in a wide range of applications. One of the core challenges of shape quantification is to ensure that the extracted measurements remain invariant to transformations that preserve an object's intrinsic geometry, such as changing its size, orientation, and position in the image. In this work, we introduce ShapeEmbed, a self-supervised representation learning framework designed to encode the contour of objects in 2D images, represented as a Euclidean distance matrix, into a shape descriptor that is invariant to translation, scaling, rotation, reflection, and point indexing. Our approach overcomes the limitations of traditional shape descriptors while improving upon existing state-of-the-art autoencoder-based approaches. We demonstrate that the descriptors learned by our framework outperform their competitors in shape classification tasks on natural and biological images. We envision our approach to be of particular relevance to biological imaging applications.
物体的形状是许多应用中重要的视觉信息来源。量化形状的一个核心挑战在于确保提取出的测量值在保持对象固有几何不变的情况下(如改变其大小、方向和图像中的位置)仍然具有不变性。在这项工作中,我们介绍了ShapeEmbed,这是一个自我监督表示学习框架,旨在将二维图像中物体轮廓(以欧几里得距离矩阵的形式表示)编码为一种形状描述符,这种描述符在平移、缩放、旋转、反射和平点索引下保持不变。我们的方法克服了传统形状描述符的局限性,并且改进了现有的基于自编码器的最佳实践。我们展示了通过我们框架学习到的描述符在自然图像和生物图像中的形状分类任务上优于其竞争对手的表现。我们认为,我们的方法特别适用于生物成像应用。
https://arxiv.org/abs/2507.01009
Image logs are crucial in capturing high-quality geological information about subsurface formations. Among the various geological features that can be gleaned from Formation Micro Imager log, vugs are essential for reservoir evaluation. This paper introduces an automated Vug Detection Model, leveraging advanced computer vision techniques to streamline the vug identification process. Manual and semiautomated methods are limited by individual bias, labour-intensity and inflexibility in parameter finetuning. Our methodology also introduces statistical analysis on vug characteristics. Pre-processing steps, including logical file extraction and normalization, ensured standardized and usable data. The sixstep vug identification methodology encompasses top-k mode extraction, adaptive thresholding, contour identification, aggregation, advanced filtering, and optional filtering for low vuggy regions. The model's adaptability is evidenced by its ability to identify vugs missed by manual picking undertaken by experts. Results demonstrate the model's accuracy through validation against expert picks. Detailed metrics, such as count, mean, and standard deviation of vug areas within zones, were introduced, showcasing the model's capabilities compared to manual picking. The vug area distribution plot enhances understanding of vug types in the reservoir. This research focuses on the identification and characterization of vugs that in turn aids in the better understanding of reservoirs.
图像日志在捕捉地下地层的高质量地质信息方面至关重要。在从岩性微成像测井中获取的各种地质特征中,孔穴(vugs)对于储层评价尤为关键。本文介绍了一种自动化的孔穴检测模型,该模型利用先进的计算机视觉技术来简化孔穴识别过程。手动和半自动方法受限于个体偏差、劳动强度以及参数微调的不灵活性。我们的研究方法还引入了对孔穴特征进行统计分析的方法。在预处理步骤中,包括逻辑文件提取和归一化等措施确保数据标准化且可使用。六步孔穴识别方法涵盖了top-k模式提取、自适应阈值设定、轮廓识别、聚类、高级过滤以及针对低孔隙区的可选过滤。该模型的适应性体现在其能够发现专家手动挑选过程中遗漏的孔穴,这一点通过与专家选择的结果进行对比验证了模型的准确性。详细度量指标包括各区域内的孔穴数量、平均面积和标准差等数据被引入,展示了模型相较于手工操作的能力。孔穴分布图有助于更深入地理解储层中的不同类型的孔穴。 这项研究专注于识别和表征孔穴,从而更好地了解储层特性。
https://arxiv.org/abs/2507.02988
Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation. Extensive experiments demonstrate that our VCT framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset. The code is available at this https URL.
音频视觉分割(AVS)的目标是基于关联的音频信号在视频帧中对发声物体进行分割。现有的AVS方法通常采用以音频为中心的Transformer架构,其中对象查询是从音频特征派生出来的。然而,这类音频中心的Transformer存在两个限制:一是由于音频混合特性导致的感觉模糊问题,二是因为视觉细节损失而导致密集预测能力减弱的问题。 为了解决这些问题,我们提出了一种新的基于视觉的Transformer(VCT)框架,该框架利用来自视觉的信息生成查询,并迭代地获取相应的音频和视觉信息。这使得查询能够更好地从混合音频中区分不同的发声物体,并准确描绘它们的轮廓。 此外,我们在VCT框架内还引入了一个原型提示查询生成(PPQG)模块,通过音频原型提示和像素上下文分组来生成既具有语义感知又能丰富视觉特征的基于视觉的查询。这有助于实现音频与视觉信息的有效聚合。 经过广泛的实验验证,我们的VCT框架在AVSBench数据集的三个子集中实现了新的最先进性能。相关代码可在[此链接](请将此处的URL替换为实际提供的URL)获取。
https://arxiv.org/abs/2506.23623
Fever screening based on infrared thermographs (IRTs) is a viable mass screening approach during infectious disease pandemics, such as Ebola and SARS, for temperature monitoring in public places like hospitals and airports. IRTs have found to be powerful, quick and non-invasive methods to detect elevated temperatures. Moreover, regions medially adjacent to the inner canthi (called the canthi regions in this paper) are preferred sites for fever screening. Accurate localization of the canthi regions can be achieved through multi-modal registration of infrared (IR) and white-light images. We proposed a registration method through a coarse-fine registration strategy using different registration models based on landmarks and edge detection on eye contours. We evaluated the registration accuracy to be within 2.7 mm, which enables accurate localization of the canthi regions.
基于红外热图(IRT)的发热筛查在埃博拉和SARS等传染病大流行期间,是一种可行的大规模筛查方法,适用于医院、机场等人流密集场所的体温监测。IRT已被证明是检测高体温的有效、快速且无创的方法。此外,在此研究中被称为内眼角区域(即靠近内眼角的区域)被认为是发热筛查的理想部位。通过红外图像和白光图像的多模态配准技术,可以实现这些内眼角区域的精确定位。 我们提出了一种粗细结合的注册方法策略,该策略采用基于地标和眼睛轮廓边缘检测的不同注册模型。我们评估了这种配准方法的准确性在2.7毫米以内,这使得能够精确地定位内眼角区域。
https://arxiv.org/abs/2507.02955
Singing voice synthesis (SVS) aims to generate expressive and high-quality vocals from musical scores, requiring precise modeling of pitch, duration, and articulation. While diffusion-based models have achieved remarkable success in image and video generation, their application to SVS remains challenging due to the complex acoustic and musical characteristics of singing, often resulting in artifacts that degrade naturalness. In this work, we propose SmoothSinger, a conditional diffusion model designed to synthesize high quality and natural singing voices. Unlike prior methods that depend on vocoders as a final stage and often introduce distortion, SmoothSinger refines low-quality synthesized audio directly in a unified framework, mitigating the degradation associated with two-stage pipelines. The model adopts a reference-guided dual-branch architecture, using low-quality audio from any baseline system as a reference to guide the denoising process, enabling more expressive and context-aware synthesis. Furthermore, it enhances the conventional U-Net with a parallel low-frequency upsampling path, allowing the model to better capture pitch contours and long term spectral dependencies. To improve alignment during training, we replace reference audio with degraded ground truth audio, addressing temporal mismatch between reference and target signals. Experiments on the Opencpop dataset, a large-scale Chinese singing corpus, demonstrate that SmoothSinger achieves state-of-the-art results in both objective and subjective evaluations. Extensive ablation studies confirm its effectiveness in reducing artifacts and improving the naturalness of synthesized voices.
歌声合成(SVS)的目标是从乐谱生成具有表现力和高质量的声乐,这需要对音高、时长和发音进行精确建模。虽然基于扩散的方法在图像和视频生成方面取得了显著的成功,但由于歌唱中复杂的声学和音乐特性,这些方法在SVS中的应用仍然面临挑战,常常导致降低自然度的声音伪影。为此,我们提出了SmoothSinger,这是一种条件式扩散模型,旨在合成高质量且自然的歌声。与以往依赖于编码器作为最终阶段的方法不同,并且这些方法往往会引入失真,SmoothSinger直接在一个统一框架中对低质量的合成音频进行精炼,从而减轻了两阶段流水线所带来的降质问题。 该模型采用了参考引导式双分支架构,使用任何基准系统生成的低质量音频作为参考来指导去噪过程,这使得更具有表现力和上下文感知的合成成为可能。此外,它通过添加一个并行的低频上采样路径增强了传统的U-Net结构,从而使模型能够更好地捕捉音高轮廓和长期频谱依赖关系。 为了在训练过程中改善对齐情况,我们用退化的真实音频替换了参考音频,解决了参考信号与目标信号之间的时序不匹配问题。在大规模中文歌唱语料库Opencpop数据集上的实验表明,SmoothSinger在客观评价和主观评价中均达到了当前最佳水平。广泛的消融研究表明,它在减少伪影和提升合成声音的自然度方面非常有效。
https://arxiv.org/abs/2506.21478
This paper presents a comprehensive derivation and implementation of the Chan-Vese active contour model for image segmentation. The model, derived from the Mumford-Shah variational framework, evolves contours based on regional intensity differences rather than image gradients, making it highly effective for segmenting noisy images or images with weak boundaries. We provide a rigorous mathematical derivation of the level set formulation, including detailed treatment of each energy term using the divergence theorem and curve evolution theory. The resulting algorithm is implemented in Python using finite difference methods with special care to numerical stability, including an upwind entropy scheme and curvature-based regularization. Experimental results on medical and synthetic images demonstrate accurate segmentation, robustness to noise, and superior performance compared to classical edge-based methods. This study confirms the suitability of the Chan-Vese model for complex segmentation tasks and highlights its potential for use in real-world imaging applications.
本文介绍了Chan-Vese活动轮廓模型在图像分割中的全面推导和实现。该模型基于Mumford-Shah变分框架,通过区域强度差异来演化轮廓线,而不是依赖于图像梯度,因此对于处理噪声大或边界弱的图像非常有效。我们提供了对水平集形式化的严格的数学推导,包括使用散度定理和曲线演化理论详细分析每个能量项。 算法采用Python实现,并利用有限差分方法进行数值稳定性的特别注意,其中包括迎风熵方案和基于曲率的正则化技术。通过医学图像和合成图像的实验结果表明,该模型能够实现准确的分割、对噪声具有鲁棒性,并且相较于传统的边缘基础的方法表现出更优越的性能。 本研究确认了Chan-Vese模型在复杂分割任务中的适用性,并强调其在实际成像应用中使用的潜力。
https://arxiv.org/abs/2506.19344
The Frequency Following Response (FFR) reflects the brain's neural encoding of auditory stimuli including speech. Because the fundamental frequency (F0), a physical correlate of pitch, is one of the essential features of speech, there has been particular interest in characterizing the FFR at F0, especially when F0 varies over time. The standard method for extracting F0 in FFRs has been the Autocorrelation Function (ACF). This paper investigates harmonic-structure-based F0 estimation algorithms, originally developed for speech and music, and resolves their poor performance when applied to FFRs in two steps. Firstly, given that unlike in speech or music, stimulus F0 of FFRs is already known, we introduce a stimulus-aware filterbank that selectively aggregates amplitudes at F0 and its harmonics while suppressing noise at non-harmonic frequencies. This method, called Harmonic Amplitude Summation (HAS), evaluates F0 candidates only within a range centered around the stimulus F0. Secondly, unlike other pitch tracking methods that select the highest peak, our method chooses the most prominent one, as it better reflects the underlying periodicity of FFRs. To the best of our knowledge, this is the first study to propose an F0 estimation algorithm for FFRs that relies on harmonic structure. Analyzing recorded FFRs from 16 normal hearing subjects to 4 natural speech stimuli with a wide F0 variation from 89 Hz to 452 Hz showed that this method outperformed ACF by reducing the average Root-Mean-Square-Error (RMSE) within each response and stimulus F0 contour pair by 8.8% to 47.4%, depending on the stimulus.
频率跟随响应(FFR)反映了大脑对包括语音在内的听觉刺激的神经编码。由于基频(F0),即音高的物理对应物,是语音的基本特征之一,因此在研究FFR中的F0特性时,尤其是在F0随时间变化的情况下,人们对此特别感兴趣。传统上,自相关函数(ACF)被用作从FFR中提取F0的标准方法。本文探讨了基于谐波结构的F0估计算法,这些算法最初是为了语音和音乐开发的,并通过两个步骤解决了当应用于FFRs时性能不佳的问题。 首先,考虑到与语音或音乐不同,在FFRs中的刺激F0是已知的,我们引入了一种了解刺激的滤波器组。这种名为谐振幅叠加(HAS)的方法选择性地在F0及其谐音频率处聚合幅度,并抑制非谐音频率的噪声。这种方法仅在一个围绕刺激F0为中心的范围内评估F0候选值。 其次,不同于其他音高追踪方法会选择最高峰,我们的方法选择了最显著的一个峰值,因为这更好地反映了FFRs中的周期性特征。据我们所知,这是首次提出依赖于谐波结构来估计FFR中基频(F0)的方法的研究。 通过对16名正常听力受试者对4种自然语音刺激(其基频范围从89Hz到452Hz)的记录进行分析,结果显示该方法在减少每个响应和刺激F0轮廓对之间的平均均方根误差(RMSE)方面优于ACF方法。具体而言,根据不同的刺激类型,这种新方法将RMSE减少了8.8%至47.4%不等。
https://arxiv.org/abs/2506.19253
We introduce MARL-MambaContour, the first contour-based medical image segmentation framework based on Multi-Agent Reinforcement Learning (MARL). Our approach reframes segmentation as a multi-agent cooperation task focused on generate topologically consistent object-level contours, addressing the limitations of traditional pixel-based methods which could lack topological constraints and holistic structural awareness of anatomical regions. Each contour point is modeled as an autonomous agent that iteratively adjusts its position to align precisely with the target boundary, enabling adaptation to blurred edges and intricate morphologies common in medical images. This iterative adjustment process is optimized by a contour-specific Soft Actor-Critic (SAC) algorithm, further enhanced with the Entropy Regularization Adjustment Mechanism (ERAM) which dynamically balance agent exploration with contour smoothness. Furthermore, the framework incorporates a Mamba-based policy network featuring a novel Bidirectional Cross-attention Hidden-state Fusion Mechanism (BCHFM). This mechanism mitigates potential memory confusion limitations associated with long-range modeling in state space models, thereby facilitating more accurate inter-agent information exchange and informed decision-making. Extensive experiments on five diverse medical imaging datasets demonstrate the state-of-the-art performance of MARL-MambaContour, highlighting its potential as an accurate and robust clinical application.
我们介绍了MARL-MambaContour,这是首个基于多智能体强化学习(MARL)的轮廓导向医学图像分割框架。我们的方法将分割任务重新定义为一个多代理合作问题,专注于生成具有拓扑一致性的对象级轮廓,以此来解决传统像素基础方法缺乏拓扑约束和整体结构感知的问题,这些问题在解剖区域中尤为明显。每个轮廓点都被建模成一个自治的智能体,它们会迭代地调整自己的位置以精确对齐目标边界,从而能够适应医学图像中常见的模糊边缘和复杂形态。 这一迭代调整过程通过特定于轮廓的软行为者-评论家(SAC)算法进行优化,并进一步利用熵调节机制(ERAM),该机制动态平衡了智能体探索与轮廓平滑度之间的关系。此外,框架还集成了一个基于蟒蛇策略网络,其中包括一种新颖的双向交叉注意力隐藏状态融合机制(BCHFM)。这一机制减轻了长距离建模在状态空间模型中可能产生的记忆混淆限制,从而促进更准确的代理间信息交换和知情决策制定。 广泛的实验表明,在五个不同的医学成像数据集上,MARL-MambaContour展现了顶尖水平的表现,突显了其作为精确且稳健临床应用的巨大潜力。
https://arxiv.org/abs/2506.18679
Dense metric depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, generated via multi-frame projection and interpolation, to guide the learning of accurate depth from sparse radar measurements and RGB images. However, this paradigm is both costly and data-intensive. To address this, we propose RaCalNet, a novel framework that eliminates the need for dense supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of merely around 1% compared to dense-supervised methods. Unlike previous approaches that associate radar points with broad image regions and rely heavily on dense labels, RaCalNet first recalibrates and refines sparse radar points to construct accurate depth priors. These priors then serve as reliable anchors to guide monocular depth prediction, enabling metric-scale estimation without resorting to dense supervision. This design improves structural consistency and preserves fine details. Despite relying solely on sparse supervision, RaCalNet surpasses state-of-the-art dense-supervised methods, producing depth maps with clear object contours and fine-grained textures. Extensive experiments on the ZJU-4DRadarCam dataset and real-world deployment scenarios demonstrate its effectiveness, reducing RMSE by 35.30% and 34.89%, respectively.
使用毫米波雷达进行密集度量深度估计通常需要通过多帧投影和插值生成的密集LiDAR监督来指导从稀疏雷达测量和RGB图像中学习准确的深度信息。然而,这种范式既昂贵又需要大量的数据。为了解决这个问题,我们提出了RaCalNet,这是一种新的框架,它通过使用稀疏LiDAR来监督精炼后的雷达测量的学习过程,从而消除了对密集监督的需求,其监督密度仅为密集监督方法的大约1%。 与之前的方法将雷达点关联到宽泛的图像区域并依赖于密集标签不同,RaCalNet首先重新校准和细化稀疏雷达点以构建准确的深度先验。这些先验作为可靠的锚点来引导单目深度预测,从而实现没有密集监督下的度量级估计。这种设计提高了结构一致性,并保留了细粒度细节。 尽管仅依赖于稀疏监督,RaCalNet在深度图生成方面超过了最先进的密集监督方法,产生了具有清晰物体轮廓和细腻纹理的深度图。在ZJU-4DRadarCam数据集和真实世界部署场景中的广泛实验验证了其有效性,分别将RMSE降低了35.30%和34.89%。
https://arxiv.org/abs/2506.15560
Face super-resolution (FSR) under limited computational costs remains an open problem. Existing approaches typically treat all facial pixels equally, resulting in suboptimal allocation of computational resources and degraded FSR performance. CNN is relatively sensitive to high-frequency facial features, such as component contours and facial outlines. Meanwhile, Mamba excels at capturing low-frequency features like facial color and fine-grained texture, and does so with lower complexity than Transformers. Motivated by these observations, we propose FADPNet, a Frequency-Aware Dual-Path Network that decomposes facial features into low- and high-frequency components and processes them via dedicated branches. For low-frequency regions, we introduce a Mamba-based Low-Frequency Enhancement Block (LFEB), which combines state-space attention with squeeze-and-excitation operations to extract low-frequency global interactions and emphasize informative channels. For high-frequency regions, we design a CNN-based Deep Position-Aware Attention (DPA) module to enhance spatially-dependent structural details, complemented by a lightweight High-Frequency Refinement (HFR) module that further refines frequency-specific representations. Through the above designs, our method achieves an excellent balance between FSR quality and model efficiency, outperforming existing approaches.
在计算成本有限的情况下,人脸超分辨率(FSR)仍然是一个开放性问题。现有方法通常对所有面部像素一视同仁,导致计算资源分配不均衡和FSR性能下降。卷积神经网络(CNN)对于如面部轮廓和边界的高频特征较为敏感;同时,Mamba模型擅长捕捉肤色和细腻纹理等低频特征,并且其复杂度低于Transformer模型。基于这些观察结果,我们提出了频率感知双路径网络FADPNet。该网络将面部特征分解为低频和高频成分,并通过专用分支进行处理。 对于低频区域,我们引入了一个基于Mamba的低频增强块(LFEB),结合状态空间注意力与挤压激励操作以提取低频全局互动并强调信息通道。针对高频区域,设计了一种基于CNN的位置感知深度注意模块DPA来加强依赖于位置的结构细节,并辅以轻量级的高频细化(HFR)模块进一步优化频率特定表示。 通过上述设计,我们的方法在FSR质量和模型效率之间实现了优异的平衡,超越了现有方法。
https://arxiv.org/abs/2506.14121
Overlapping object perception aims to decouple the randomly overlapping foreground-background features, extracting foreground features while suppressing background features, which holds significant application value in fields such as security screening and medical auxiliary diagnosis. Despite some research efforts to tackle the challenge of overlapping object perception, most solutions are confined to the spatial domain. Through frequency domain analysis, we observe that the degradation of contours and textures due to the overlapping phenomenon can be intuitively reflected in the magnitude spectrum. Based on this observation, we propose a general Frequency-Optimized Anti-Overlapping Framework (FOAM) to assist the model in extracting more texture and contour information, thereby enhancing the ability for anti-overlapping object perception. Specifically, we design the Frequency Spatial Transformer Block (FSTB), which can simultaneously extract features from both the frequency and spatial domains, helping the network capture more texture features from the foreground. In addition, we introduce the Hierarchical De-Corrupting (HDC) mechanism, which aligns adjacent features in the separately constructed base branch and corruption branch using a specially designed consistent loss during the training phase. This mechanism suppresses the response to irrelevant background features of FSTBs, thereby improving the perception of foreground contour. We conduct extensive experiments to validate the effectiveness and generalization of the proposed FOAM, which further improves the accuracy of state-of-the-art models on four datasets, specifically for the three overlapping object perception tasks: Prohibited Item Detection, Prohibited Item Segmentation, and Pneumonia Detection. The code will be open source once the paper is accepted.
重叠物体感知旨在解耦随机重叠的前景和背景特征,在提取前景特征的同时抑制背景特征,这一技术在安全检查和医学辅助诊断等领域具有重要的应用价值。尽管已有一些研究努力来应对重叠物体感知挑战,但大多数解决方案仍然局限于空间领域内。通过频域分析,我们观察到由于重叠现象导致边缘和纹理质量下降的情况可以直观地反映在幅度谱上。基于这一观察结果,我们提出了一种通用的频率优化抗重叠框架(FOAM),旨在帮助模型提取更多的纹理和轮廓信息,从而增强其对抗重叠物体感知的能力。 具体而言,我们设计了频域空间变换器模块(FSTB),该模块可以同时从频域和空间域中提取特征,有助于网络捕捉前景中的更多纹理特征。此外,我们引入了分层去混淆机制(HDC),在训练阶段通过特别设计的一致性损失将独立构建的基础分支和干扰分支的相邻特征对齐。这一机制抑制了FSTB对无关背景特征的响应,从而改善了前景轮廓的感知能力。 为了验证所提出的FOAM的有效性和泛化性能,我们进行了广泛的实验,并进一步提高了四个数据集上现有最先进模型在三种重叠物体感知任务——禁止物品检测、禁止物品分割和肺炎检测中的准确性。论文一旦被接受,代码将开源共享。
https://arxiv.org/abs/2506.13501
Purpose: Accurate intraoperative X-ray/CT registration is essential for surgical navigation in orthopedic procedures. However, existing methods struggle with consistently achieving sub-millimeter accuracy, robustness under broad initial pose estimates or need manual key-point annotations. This work aims to address these challenges by proposing a novel multi-view X-ray/CT registration method for intraoperative bone registration. Methods: The proposed registration method consists of a multi-view, contour-based iterative closest point (ICP) optimization. Unlike previous methods, which attempt to match bone contours across the entire silhouette in both imaging modalities, we focus on matching specific subcategories of contours corresponding to bone substructures. This leads to reduced ambiguity in the ICP matches, resulting in a more robust and accurate registration solution. This approach requires only two X-ray images and operates fully automatically. Additionally, we contribute a dataset of 5 cadaveric specimens, including real X-ray images, X-ray image poses and the corresponding CT scans. Results: The proposed registration method is evaluated on real X-ray images using mean reprojection error (mRPD). The method consistently achieves sub-millimeter accuracy with a mRPD 0.67mm compared to 5.35mm by a commercial solution requiring manual intervention. Furthermore, the method offers improved practical applicability, being fully automatic. Conclusion: Our method offers a practical, accurate, and efficient solution for multi-view X-ray/CT registration in orthopedic surgeries, which can be easily combined with tracking systems. By improving registration accuracy and minimizing manual intervention, it enhances intraoperative navigation, contributing to more accurate and effective surgical outcomes in computer-assisted surgery (CAS).
目的:在骨科手术中,术中X射线/CT配准对于手术导航至关重要。然而,现有方法难以持续实现亚毫米级精度、广泛初始姿态估计下的鲁棒性或需要手动关键点标注。本研究旨在通过提出一种新的多视角X射线/CT配准方法来解决这些问题,该方法适用于术中骨配准。 **方法:** 所提出的注册方法包括一个基于轮廓的多视图迭代最近点(ICP)优化过程。与以往试图在两种成像模式下在整个剪影范围内匹配骨骼轮廓的方法不同,我们专注于匹配特定子类别的对应于骨骼亚结构的轮廓。这减少了ICP匹配中的歧义,从而提高了配准解决方案的稳健性和准确性。这种方法仅需两张X射线图像,并且完全自动运行。 此外,我们贡献了一组包含5个解剖标本的真实数据集,包括X射线影像、X射线影像姿态和相应的CT扫描。 **结果:** 所提出的注册方法在使用均方根重投影误差(mRPD)评估真实X射线图像时表现出色。该方法持续实现亚毫米级精度,达到0.67mm的mRPD,而商业解决方案需手动干预且仅为5.35mm。此外,这种方法提供了更实用的应用性,因为它是完全自动化的。 **结论:** 我们的方法为多视角X射线/CT配准提供了一种实际、准确和有效的解决方案,在骨科手术中可以轻松与跟踪系统结合使用。通过提高注册精度并减少人工干预,该方法改善了术中的导航性能,有助于计算机辅助外科手术(CAS)实现更精确和有效的结果。
https://arxiv.org/abs/2506.13292