Breast cancer is one of the most common causes of death among women worldwide, with millions of fatalities annually. Magnetic Resonance Imaging (MRI) can provide various sequences for characterizing tumor morphology and internal patterns, and becomes an effective tool for detection and diagnosis of breast tumors. However, previous deep-learning based tumor segmentation methods have limitations in accurately locating tumor contours due to the challenge of low contrast between cancer and normal areas and blurred boundaries. Leveraging text prompt information holds promise in ameliorating tumor segmentation effect by delineating segmentation regions. Inspired by this, we propose text-guided Breast Tumor Segmentation model (TextBCS) with stage-divided vision-language interaction and evidential learning. Specifically, the proposed stage-divided vision-language interaction facilitates information mutual between visual and text features at each stage of down-sampling, further exerting the advantages of text prompts to assist in locating lesion areas in low contrast scenarios. Moreover, the evidential learning is adopted to quantify the segmentation uncertainty of the model for blurred boundary. It utilizes the variational Dirichlet to characterize the distribution of the segmentation probabilities, addressing the segmentation uncertainties of the boundaries. Extensive experiments validate the superiority of our TextBCS over other segmentation networks, showcasing the best breast tumor segmentation performance on publicly available datasets.
乳腺癌是全球女性最常见的死亡原因之一,每年有数百万人因此丧生。磁共振成像(MRI)能够提供多种序列来表征肿瘤形态和内部模式,并已成为检测和诊断乳腺肿瘤的有效工具。然而,基于深度学习的肿瘤分割方法由于癌症区域与正常区域对比度低以及边界模糊的问题,在准确定位肿瘤轮廓方面存在局限性。利用文本提示信息有望通过勾勒出分割区域来改善肿瘤分割效果。受此启发,我们提出了一个由分阶段视觉-语言交互和证据学习引导的文字辅助乳腺肿瘤分割模型(TextBCS)。具体来说,所提出的分阶段视觉-语言交互促进了在每次下采样过程中视觉特征与文本特征之间的信息交流,在低对比度场景中进一步发挥了文本提示定位病灶区域的优势。此外,采用证据学习来量化模糊边界处的分割不确定性。它利用变分狄利克雷来刻画分割概率分布,从而解决了边界的分割不确定性问题。广泛的实验验证了我们的TextBCS模型相较于其他分割网络的优越性,并在公开数据集上展示了最佳的乳腺肿瘤分割性能。
https://arxiv.org/abs/2603.11206
Delineating the clinical target volume (CTV) in radiotherapy involves complex margins constrained by tumor location and anatomical barriers. While deep learning models automate this process, their rigid reliance on expert-annotated data requires costly retraining whenever clinical guidelines update. To overcome this limitation, we introduce OncoAgent, a novel guideline-aware AI agent framework that seamlessly converts textual clinical guidelines into three-dimensional target contours in a training-free manner. Evaluated on esophageal cancer cases, the agent achieves a zero-shot Dice similarity coefficient of 0.842 for the CTV and 0.880 for the planning target volume, demonstrating performance highly comparable to a fully supervised nnU-Net baseline. Notably, in a blinded clinical evaluation, physicians strongly preferred OncoAgent over the supervised baseline, rating it higher in guideline compliance, modification effort, and clinical acceptability. Furthermore, the framework generalizes zero-shot to alternative esophageal guidelines and other anatomical sites (e.g., prostate) without any retraining. Beyond mere volumetric overlap, our agent-based paradigm offers near-instantaneous adaptability to alternative guidelines, providing a scalable and transparent pathway toward interpretability in radiotherapy treatment planning.
在放射治疗中,划定临床靶区体积(CTV)涉及根据肿瘤位置和解剖障碍来确定复杂的边界。尽管深度学习模型能够自动化这一过程,但它们对专家标注数据的严格依赖意味着每次临床指南更新时都需要进行昂贵的数据再训练。为了解决这个问题,我们引入了OncoAgent,这是一种新型的基于指南的人工智能代理框架,它可以在无需培训的情况下将文本形式的临床指南转换成三维目标轮廓。在针对食管癌病例的评估中,该代理实现了CTV和计划靶区体积(PTV)上的零样本Dice相似系数分别为0.842和0.880,显示出了与全监督nnU-Net基线相比高度相当的表现。值得注意的是,在盲法临床评价中,医生们更倾向于选择OncoAgent而不是监督学习的基准模型,他们认为该代理在遵守指南、修改努力以及临床接受度方面优于其他方法。此外,该框架能够在没有再训练的情况下零样本泛化到不同的食管指南以及其他解剖部位(如前列腺)上。除了体积重叠之外,我们的基于代理的方法还提供了近乎即时地适应不同指南的能力,为放射治疗计划的可解释性提供了一条可扩展且透明的道路。
https://arxiv.org/abs/2603.09448
Medical image segmentation is crucial for computer-aided diagnosis, which necessitates understanding both coarse morphological and semantic structures, as well as carving fine boundaries. The morphological and semantic structures in medical images are beneficial and stable clues for target understanding. While the fine boundaries of medical targets (like tumors and lesions) are usually ambiguous and noisy since lesion overlap, annotation uncertainty, and so on, making it not reliable to serve as early supervision. However, existing methods simultaneously learn coarse structures and fine boundaries throughout the training process. In this paper, we propose a structure and progress-aware diffusion (SPAD) for medical image segmentation, which consists of a semantic-concentrated diffusion (ScD) and a boundary-centralized diffusion (BcD) modulated by a progress-aware scheduler (PaS). Specifically, the semantic-concentrated diffusion introduces anchor-preserved target perturbation, which perturbs pixels within a medical target but preserves unaltered areas as semantic anchors, encouraging the model to infer noisy target areas from the surrounding semantic context. The boundary-centralized diffusion introduces progress-aware boundary noise, which blurs unreliable and ambiguous boundaries, thus compelling the model to focus on coarse but stable anatomical morphology and global semantics. Furthermore, the progress-aware scheduler gradually modulates noise intensity of the ScD and BcD forming a coarse-to-fine diffusion paradigm, which encourage focusing on coarse morphological and semantic structures during early target understanding stages and gradually shifting to fine target boundaries during later contour adjusting stages.
医学图像分割对于计算机辅助诊断至关重要,这需要理解粗略的形态结构和语义结构,并且能够精细地划分边界。在医学图像中的形态和语义结构是理解和定位目标的重要而稳定的线索。然而,医疗目标(如肿瘤和病变)的细小边界通常模糊不清并且带有噪声,由于病灶重叠、标注不确定等因素,这些信息不足以作为早期监督使用。尽管如此,现有的方法在整个训练过程中同时学习粗略结构与精细边界。 在本文中,我们提出了一种基于结构和进度感知的扩散(SPAD)方法用于医学图像分割,该方法由一个语义集中的扩散(ScD)、边界中心化的扩散(BcD),以及一个进度感知调度器(PaS)组成。具体来说: - 语义集中的扩散通过保持锚点不变的目标扰动引入,这种扰动只影响目标内部的像素而不改变背景区域作为语义锚点,从而鼓励模型从周围的语义上下文中推断噪声较大的目标区域。 - 边界中心化的扩散则引入进度感知边界噪声,模糊不明确和不可靠的边界,促使模型专注于粗略但稳定的解剖结构及全局语义。 此外,进度感知调度器逐步调整ScD和BcD的噪声强度形成一个从粗到细的扩散范式。这在早期目标理解阶段鼓励模型关注粗糙的形态学和语义结构,并逐渐过渡到后期轮廓调整阶段专注于细致的目标边界。 这种方法通过分阶段的学习策略解决了现有方法中同时学习粗略结构与精细边界的挑战,提高了医学图像分割任务中的精度和鲁棒性。
https://arxiv.org/abs/2603.07889
Scene understanding plays a critical role in enabling intelligence and autonomy in robotic systems. Traditional approaches often face challenges, including occlusions, ambiguous boundaries, and the inability to adapt attention based on task-specific requirements and sample variations. To address these limitations, this paper presents an efficient RGB-D scene understanding model that performs a range of tasks, including semantic segmentation, instance segmentation, orientation estimation, panoptic segmentation, and scene classification. The proposed model incorporates an enhanced fusion encoder, which effectively leverages redundant information from both RGB and depth inputs. For semantic segmentation, we introduce normalized focus channel layers and a context feature interaction layer, designed to mitigate issues such as shallow feature misguidance and insufficient local-global feature representation. The instance segmentation task benefits from a non-bottleneck 1D structure, which achieves superior contour representation with fewer parameters. Additionally, we propose a multi-task adaptive loss function that dynamically adjusts the learning strategy for different tasks based on scene variations. Extensive experiments on the NYUv2, SUN RGB-D, and Cityscapes datasets demonstrate that our approach outperforms existing methods in both segmentation accuracy and processing speed.
场景理解在增强机器人系统的智能和自主性方面扮演着关键角色。传统方法经常面临遮挡、边界模糊以及无法根据任务特定需求和样本变化调整注意力等问题的挑战。为了克服这些限制,本文提出了一种高效的RGB-D场景理解模型,该模型能够执行一系列任务,包括语义分割、实例分割、姿态估计、全景分割和场景分类。所提出的模型包含了一个增强融合编码器,可以有效利用来自RGB和深度输入数据的冗余信息。 对于语义分割任务,我们引入了标准化关注通道层和上下文特征交互层,旨在解决浅层特征误导以及局部-全局特征表示不足等问题。在实例分割任务中,我们的模型采用了一种非瓶颈1D结构,在减少参数的同时实现了更优的轮廓表示。此外,我们还提出了一种多任务自适应损失函数,可以根据场景变化动态调整不同的学习策略。 通过在NYUv2、SUN RGB-D和Cityscapes数据集上的大量实验验证,我们的方法不仅在分割准确性上超越了现有技术,而且在处理速度方面也表现出色。
https://arxiv.org/abs/2603.07570
Automatic detection of Parkinson's disease (PD) from speech is a promising non-invasive diagnostic tool, but it raises significant privacy concerns. Speaker anonymization mitigates these risks, but it may suppress the pathological information necessary for PD detection. We assess the trade-off between privacy and PD detection for two anonymizers (STT-TTS and kNN-VC) using two Spanish datasets. STT-TTS provides better privacy but severely degrades PD detection by eradicating prosodic information. kNN-VC preserves macro-prosodic features such as duration and F0 contours, achieving F1 scores only 3-7\% lower than original baselines, demonstrating that privacy-preserving PD detection is viable when using appropriate anonymization. Finally, an acoustic distortion analysis characterizes specific weaknesses in kNN-VC, offering insights for designing anonymizers that better preserve PD information.
从语音中自动检测帕金森病(PD)是一种有前景的非侵入性诊断工具,但这种方法会引发重大的隐私问题。说话人匿名化技术可以减轻这些风险,但它可能会抑制用于PD检测所需的关键病理信息。我们使用两个西班牙语数据集评估了两种匿名化方法(STT-TTS和kNN-VC)在保护隐私与PD检测准确性之间的权衡。 STT-TTS提供了更好的隐私保护,但这种方法会严重降低PD的检测能力,因为它消除了语音中的韵律特征。另一方面,kNN-VC能够保持音长、基频轮廓等宏观韵律特性,在两个数据集中实现的F1得分仅比原始基准低3%-7%,这表明在使用适当的匿名化技术时,可以实现既保护隐私又具有较高检测准确性的PD诊断。 最后,声学失真分析揭示了kNN-VC的一些特定弱点,并为设计能够在保持PD相关信息的同时更好地保护隐私的技术提供了见解。
https://arxiv.org/abs/2603.07544
Artificial intelligence-based radiation therapy (RT) planning has the potential to reduce planning time and inter-planner variability, improving efficiency and consistency in clinical workflows. Most existing automated approaches rely on multiple dose evaluations and corrections, resulting in plan generation times of several minutes. We introduce AIRT (Artificial Intelligence-based Radiotherapy), an end-to-end deep-learning framework that directly infers deliverable treatment plans from CT images and structure contours. AIRT generates single-arc VMAT prostate plans, from imaging and anatomical inputs to leaf sequencing, in under one second on a single Nvidia A100 GPU. The framework includes a differentiable dose feedback, an adversarial fluence map shaping, and a plan generation augmentation to improve plan quality and robustness. The model was trained on more than 10,000 intact prostate cases. Non-inferiority to RapidPlan Eclipse was demonstrated across target coverage and OAR sparing metrics. Target homogeneity (HI = 0.10 $\pm$ 0.01) and OAR sparing were similar to reference plans when evaluated using AcurosXB. These results represent a significant step toward ultra-fast standardized RT planning and a streamlined clinical workflow.
基于人工智能的放射治疗(RT)计划有可能减少规划时间并降低不同规划者之间的差异,从而提高临床工作流程中的效率和一致性。大多数现有的自动化方法依赖于多次剂量评估和修正,导致生成方案的时间长达几分钟。我们介绍了AIRT(基于人工智能的放射治疗),这是一种端到端深度学习框架,可以直接从CT图像和结构轮廓中推断出可交付的治疗计划。AIRT能够在一个Nvidia A100 GPU上,在不到一秒的时间内,根据影像学及解剖学输入生成单弧VMAT前列腺治疗方案,并进行叶片序列设计。该框架包括一个可微剂量反馈、对抗性束流分布整形以及一种用于提高计划质量和鲁棒性的计划生成增强方法。模型是在超过10,000例完整前列腺病例上训练出来的。AIRT在靶区覆盖和器官保护指标方面与RapidPlan Eclipse表现无显著差异。使用AcurosXB评估时,靶区均匀度(HI = 0.10 ± 0.01)和组织保护措施与参考方案相似。 这些结果代表了向超快速标准化放射治疗计划和精简临床工作流程迈出的重要一步。
https://arxiv.org/abs/2603.06338
With the increasing deployment of intelligent sensing technologies in highly sensitive environments such as restrooms and locker rooms, visual surveillance systems face a profound privacy-security paradox. Existing privacy-preserving approaches, including physical desensitization, encryption, and obfuscation, often compromise semantic understanding or fail to ensure mathematically provable irreversibility. Although Privacy Camera 1.0 eliminated visual data at the source to prevent leakage, it provided only textual judgments, leading to evidentiary blind spots in disputes. To address these limitations, this paper proposes a novel privacy-preserving perception framework based on the AI Flow paradigm and a collaborative edge-cloud architecture. By deploying a visual desensitizer at the edge, raw images are transformed in real time into abstract feature vectors through nonlinear mapping and stochastic noise injection under the Information Bottleneck principle, ensuring identity-sensitive information is stripped and original images are mathematically unreconstructable. The abstract representations are transmitted to the cloud for behavior recognition and semantic reconstruction via a "dynamic contour" visual language, achieving a critical balance between perception and privacy while enabling illustrative visual reference without exposing raw images.
随着智能传感技术在如卫生间和更衣室等高度敏感环境中的部署日益增加,视觉监控系统面临着一个深刻的隐私与安全矛盾。现有的隐私保护方法,包括物理去敏感化、加密和混淆处理,在提供语义理解方面往往存在缺陷,或者无法确保数学意义上的不可逆性。尽管Privacy Camera 1.0通过在源头消除视觉数据来防止泄露,但它仅提供了文本判断结果,这在争议中导致了证据盲点的出现。为解决这些限制,本文提出了一种基于AI流范式和协作边缘-云架构的新颖隐私保护感知框架。 该方案通过在边缘部署视觉去敏感化器,在非线性映射和信息瓶颈原则下注入随机噪声,将原始图像实时转换为抽象特征向量。这样确保了身份相关的信息被剥离且原始图像无法从数学上重建。接着,这些抽象表示传输到云端进行行为识别及语义重构,通过“动态轮廓”视觉语言实现感知与隐私之间的关键平衡,同时支持可视化参考而不暴露原始图像。
https://arxiv.org/abs/2603.04775
Agile quadrotor flight pushes the limits of control, actuation, and onboard perception. While time-optimal trajectory planning has been extensively studied, existing approaches typically neglect the tight coupling between vehicle dynamics, environmental geometry, and the visual requirements of onboard state estimation. As a result, trajectories that are dynamically feasible may fail in closed-loop execution due to degraded visual quality. This paper introduces a unified time-optimal trajectory optimization framework for vision-based quadrotors that explicitly incorporates perception constraints alongside full nonlinear dynamics, rotor actuation limits, aerodynamic effects, camera field-of-view constraints, and convex geometric gate representations. The proposed formulation solves minimum-time lap trajectories for arbitrary racetracks with diverse gate shapes and orientations, while remaining numerically robust and computationally efficient. We derive an information-theoretic position uncertainty metric to quantify visual state-estimation quality and integrate it into the planner through three perception objectives: position uncertainty minimization, sequential field-of-view constraints, and look-ahead alignment. This enables systematic exploration of the trade-offs between speed and perceptual reliability. To accurately track the resulting perception-aware trajectories, we develop a model predictive contouring tracking controller that separates lateral and progress errors. Experiments demonstrate real-world flight speeds up to 9.8 m/s with 0.07 m average tracking error, and closed-loop success rates improved from 55% to 100% on a challenging Split-S course. The proposed system provides a scalable benchmark for studying the fundamental limits of perception-aware, time-optimal autonomous flight.
本文介绍了基于视觉的四旋翼无人机时间最优轨迹优化框架,该框架明确地将感知约束与完整的非线性动力学、电机操作限制、气动效应、摄像机视场约束以及凸几何门表示结合在一起。现有的方法通常忽略了车辆动态特性、环境几何形状和车载状态估计所需的视觉需求之间的紧密联系,导致虽然某些轨迹在动力学上可行,但在闭环执行中由于视觉质量下降而失败。 该提出的框架能够为具有各种门形状和方向的任意赛道生成最小时间圈迹,同时保持数值稳健性和计算效率。我们引入了一个信息论位置不确定性度量来量化基于视觉的状态估计质量,并通过三个感知目标将其集成到规划器中:减少位置不确定性、顺序视场约束以及前瞻对齐。这使系统能够有条不紊地探索速度与感知可靠性的权衡。 为了准确跟踪这些感知敏感轨迹,我们开发了一种模型预测轮廓追踪控制器,该控制器将横向和进度误差分开处理。实验表明,在具有挑战性的“Split-S”赛道上,闭环成功几率从55%提高到了100%,同时实现了高达9.8米/秒的现实世界飞行速度,并且平均跟踪误差为0.07米。 所提出的系统为研究感知意识的时间最优自主飞行的基本限制提供了可扩展的基准。
https://arxiv.org/abs/2603.04305
Image reconstruction in the presence of severe degradation remains a challenging inverse problem, particularly in beam diagnostics for high-energy physics accelerators. As modern facilities demand precise detection of beam halo structures to control losses, traditional analysis tools have reached their performance limits. This work reviews existing image-processing techniques for data cleaning, contour extraction, and emittance reconstruction, and introduces a novel approach based on convolutional filtering and neural networks with optimized early-stopping strategies in order to control overfitting. Despite the absence of training datasets, the proposed unsupervised framework achieves robust denoising and high-fidelity reconstruction of beam emittance images under low signal-to-noise conditions. The method extends measurable amplitudes beyond seven standard deviations, enabling unprecedented halo resolution.
在严重退化情况下进行图像重建仍然是一个具有挑战性的逆向问题,特别是在高能物理加速器的束流诊断中。随着现代设施对精确检测束晕结构以控制损失的需求日益增加,传统的分析工具已达到其性能极限。本文回顾了现有的数据清洗、轮廓提取和发射度重建的图像处理技术,并提出了一种基于卷积滤波和优化早期停止策略的神经网络的新方法,以控制过拟合问题。尽管缺乏训练数据集,所提出的无监督框架在低信噪比条件下实现了对束流发射度图像的稳健去噪和高保真重构。该方法将可测量幅度扩展到七个标准差以上,从而实现前所未有的束晕分辨率。
https://arxiv.org/abs/2603.06689
Gait recognition is a non-intrusive biometric technique for security applications, yet existing studies are dominated by silhouette- and parsing-based representations. Silhouettes are sparse and miss internal structural details, limiting discriminability. Parsing enriches silhouettes with part-level structures, but relies heavily on upstream human parsers (e.g., label granularity and boundary precision), leading to unstable performance across datasets and sometimes even inferior results to silhouettes. We revisit gait representations from a structural perspective and describe a design space defined by edge density and supervision form: silhouettes use sparse boundary edges with weak single-label supervision, while parsing uses denser cues with strong semantic priors. In this space, we identify an underexplored paradigm: dense part-level structure without explicit semantic labels, and introduce SKETCH as a new visual modality for gait recognition. Sketch extracts high-frequency structural cues (e.g., limb articulations and self-occlusion contours) directly from RGB images via edge-based detectors in a label-free manner. We further show that label-guided parsing and label-free sketch are semantically decoupled and structurally complementary. Based on this, we propose SKETCHGAIT, a hierarchically disentangled multi-modal framework with two independent streams for modality-specific learning and a lightweight early-stage fusion branch to capture structural complementarity. Extensive experiments on SUSTech1K and CCPG validate the proposed modality and framework: SketchGait achieves 92.9% Rank-1 on SUSTech1K and 93.1% mean Rank-1 on CCPG.
步态识别是一种用于安全应用的非侵入性生物特征技术,但现有的研究主要集中在轮廓和解析表示上。轮廓是稀疏的,并且缺少内部结构细节,这限制了其鉴别能力;而解析则通过引入部分级结构来丰富轮廓,不过它严重依赖于上游的人体解析器(例如标签粒度和边界精度),导致在不同数据集上的表现不稳定,在某些情况下甚至比单纯的轮廓识别效果更差。我们从结构角度重新审视步态表示,并定义了一个由边缘密度和监督形式构成的设计空间:轮廓使用稀疏的边界边缘并带有弱单标签监督,而解析则利用更密集的线索以及强大的语义先验。在这个设计空间中,我们发现了一种被忽视的方法:没有显式的语义标签但具有密集的部分级结构,并引入SKETCH作为一种新的步态识别视觉模式。通过基于边缘检测器从RGB图像直接提取高频结构线索(例如肢体关节和自我遮挡轮廓),Sketch在无标签的情况下工作。我们进一步证明,引导式解析与非引导式草图在语义上是解耦的,在结构上是互补的。基于此,我们提出了SKETCHGAIT,这是一个分层解缠的多模态框架,包括两个独立的学习流和一个轻量级的早期融合分支来捕捉结构上的互补性。在SUSTech1K和CCPG数据集上的广泛实验验证了所提出的模式和框架的有效性:SketchGait在SUSTech1K上达到了92.9%的第一名准确率,并且在CCPG上实现了93.1%的平均第一名准确率。
https://arxiv.org/abs/2603.05537
Face recognition remains vulnerable to presentation attacks, calling for robust Face Anti-Spoofing (FAS) solutions. Recent MLLM-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves SOTA performance while providing fine-grained visual investigation for trustworthy spoof detection.
面部识别仍然容易受到呈现攻击的影响,因此需要强大的防欺骗解决方案(Face Anti-Spoofing, FAS)。最近基于多模态大型语言模型 (MLLM) 的 FAS 方法将二元分类任务重新表述为生成简短的文本描述,以提高跨域泛化能力。然而,它们的泛化能力仍然有限,因为这些描述主要捕捉直观的语义线索(例如,面具轮廓),而在感知细微视觉模式方面却显得力不从心。为了克服这一限制,我们整合了外部视觉工具到 MLLM 中,鼓励更深入地调查微妙的欺骗线索。 具体而言,我们提出了带有视觉工具增强推理的 FAS (TAR-FAS) 框架,该框架将 FAS 任务重新表述为“带有视觉工具的链式思考”(CoT-VT)模式。这使 MLLM 可以从直观观察开始,并根据需要灵活调用外部视觉工具进行细微级别的调查。为此,我们设计了一种增强工具的数据注释流水线并构建了 ToolFAS-16K 数据集,该数据集中包含了多轮次的工具使用推理轨迹。 此外,我们还引入了一个考虑工具的 FAS 训练流水线,在这个过程中,多样化的工具组相对策略优化(DT-GRPO)使模型能够自主学习高效的工具使用方法。在具有挑战性的十一种跨域协议下进行的大量实验表明,TAR-FAS 达到了最先进的性能,并提供了细粒度的视觉调查,从而为值得信赖的欺骗检测提供支持。
https://arxiv.org/abs/2603.01038
We present Neural Image-Space Tessellation (NIST), a lightweight screen-space post-processing approach that produces the visual effect of tessellated geometry while rendering only the original low-polygon meshes. Inspired by our observation from Phong tessellation, NIST leverages the discrepancy between geometric normals and shading normals as a minimal, view-dependent cue for silhouette refinement. At its core, NIST performs multi-scale neural tessellation by progressively deforming image-space contours with convolutional operators, while jointly reassigning appearance information through an implicit warping mechanism to preserve texture coherence and visual fidelity. Experiments demonstrate that our approach produces smooth, visually coherent silhouettes comparable to geometric tessellation, while incurring a constant per-frame cost and fully decoupled from geometric complexity, making it well-suited for large-scale real-time rendering scenarios. To the best of our knowledge, our NIST is the first work to reformulate tessellation as a post-processing operation, shifting it from a pre-rendering geometry pipeline to a screen space neural post-processing stage.
我们介绍了神经图像空间镶嵌(Neural Image-Space Tessellation,NIST),这是一种轻量级的屏幕空间后期处理方法,在仅渲染原始低多边形网格的情况下即可产生几何镶嵌的效果。受到从Phong镶嵌中观察到的现象启发,NIST利用了几何法线和光照法线之间的差异作为最小化、视图依赖的轮廓细化提示符。在核心方面,NIST通过使用卷积操作逐步变形图像空间中的轮廓来执行多尺度神经镶嵌,并通过隐式扭曲机制重新分配外观信息,以保持纹理一致性和视觉保真度。实验表明,我们的方法可以生成平滑且与几何镶嵌相媲美的视觉上连贯的轮廓效果,同时每帧成本固定并且完全独立于几何复杂性,使其非常适合大规模实时渲染场景。据我们所知,NIST是首个将镶嵌重新定义为后期处理操作的工作,将其从预渲染的几何管道转移至屏幕空间神经后期处理阶段。
https://arxiv.org/abs/2602.23754
The binary segmentation of whiteboard strokes is hindered by extreme class imbalance, caused by stroke pixels that constitute only $1.79%$ of the image on average, and in addition, the thin-stroke subset averages $1.14% \pm 0.41%$ in the foreground. Standard region metrics (F1, IoU) can mask thin-stroke failures because the vast majority of the background dominates the score. In contrast, adding boundary-aware metrics and a thin-subset equity analysis changes how loss functions rank and exposes hidden trade-offs. We contribute an evaluation protocol that jointly examines region metrics, boundary metrics (BF1, B-IoU), a core/thin-subset equity analysis, and per-image robustness statistics (median, IQR, worst-case) under seeded, multi-run training with non-parametric significance testing. Five losses -- cross-entropy, focal, Dice, Dice+focal, and Tversky -- are trained three times each on a DeepLabV3-MobileNetV3 model and evaluated on 12 held-out images split into core and thin subsets. Overlap-based losses improve F1 by more than 20 points over cross-entropy ($0.663$ vs $0.438$, $p < 0.001$). In addition, the boundary metrics confirm that the gain extends to the precision of the contour. Adaptive thresholding and Sauvola binarization at native resolution achieve a higher mean F1 ($0.787$ for Sauvola) but with substantially worse worst-case performance (F1 $= 0.452$ vs $0.565$ for Tversky), exposing a consistency-accuracy trade-off: classical baselines lead on mean F1 while the learned model delivers higher worst-case reliability. Doubling training resolution further increases F1 by 12.7 points.
白板笔迹的二值分割受到极端类别不平衡的影响,这是因为笔迹像素平均只占图像的$1.79\%$,而细线子集在前景中的占比仅为$1.14\% \pm 0.41\%$。标准区域指标(F1, IoU)可能会掩盖细线分割错误,因为绝大多数背景占据了评分的主要部分。相比之下,增加边界感知度的评估指标和对细线子集公平性的分析改变了损失函数的排序方式,并揭示了隐藏的权衡。我们提出了一种评估协议,在种子多轮训练中同时考察区域指标、边界指标(BF1, B-IoU)、核心/细线子集公平性分析以及每张图片的鲁棒统计量(中位数,IQR,最坏情况),并通过非参数显著性测试进行验证。五种损失函数——交叉熵、焦点、Dice、Dice+焦点和Tversky,在DeepLabV3-MobileNetV3模型上各自训练三次,并在12张保留的图像(分成核心集和细线子集)上进行了评估。基于重叠的损失函数使F1得分相对于交叉熵提高了超过20个百分点($0.663$ 对比 $0.438$,$p < 0.001$)。此外,边界指标证实这种增益扩展到了轮廓精度。自适应阈值和Sauvola二值化在原生分辨率下实现了更高的平均F1得分(Sauvola的平均F1得分为$0.787$),但最坏情况下的性能表现明显较差(F1得分从$0.452$ 降到 $0.565$,对应于Tversky模型)。这揭示了一个一致性与准确性之间的权衡:经典基线在平均F1得分上处于领先地位,而学习型模型则提供了更高的最坏情况下的可靠性。将训练分辨率加倍进一步提高了F1得分为$12.7$个百分点。
https://arxiv.org/abs/2603.00163
Automating garment assembly from sewing patterns remains a significant challenge due to the lack of standardized annotation protocols and the frequent absence of semantic cues. Existing methods often rely on panel labels or handcrafted heuristics, which limit their applicability to real-world, non-conforming patterns. We present AutoSew, a fully automatic, geometry-based approach for predicting stitch correspondences directly from 2D pattern contours. AutoSew formulates the problem as a graph matching task, leveraging a Graph Neural Network to capture local and global geometric context, and employing a differentiable optimal transport solver to infer stitching relationships-including multi-edge connections. To support this task, we update the GarmentCodeData dataset modifying over 18k patterns with realistic multi-edge annotations, reflecting industrial assembly scenarios. AutoSew achieves 96% F1-score and successfully assembles 73.3% of test garments without error, outperforming existing methods while relying solely on geometric input. Our results demonstrate that geometry alone can robustly guide stitching prediction, enabling scalable garment assembly without manual input.
从裁剪图案自动组装服装仍然是一个重大挑战,因为缺乏标准化的标注协议以及语义线索的频繁缺失。现有方法通常依赖于面板标签或手工制作的经验法则,这限制了它们在处理现实世界中不规则图案时的应用能力。我们提出了AutoSew,这是一个完全自动化的基于几何的方法,可以从2D图案轮廓直接预测缝合对应关系。AutoSew将问题表述为图匹配任务,利用图神经网络捕捉局部和全局的几何上下文,并采用可微分最优传输求解器来推断包括多边连接在内的缝合关系。 为了支持这一任务,我们更新了GarmentCodeData数据集,修改超过18,000个图案并添加了现实主义的多边注释,反映工业组装场景。AutoSew达到了96%的F1分数,并成功地将测试服装中的73.3%无误地组合在一起,在仅依赖几何输入的情况下超越了现有方法。 我们的结果表明,仅凭几何信息就可以稳健地指导缝合预测,使没有手动干预的大规模服装组装成为可能。
https://arxiv.org/abs/2602.22052
We present SPRITETOMESH, a fully automatic pipeline for converting 2D game sprite images into triangle meshes compatible with skeletal animation frameworks such as Spine2D. Creating animation-ready meshes is traditionally a tedious manual process requiring artists to carefully place vertices along visual boundaries, a task that typically takes 15-60 minutes per sprite. Our method addresses this through a hybrid learned-algorithmic approach. A segmentation network (EfficientNet-B0 encoder with U-Net decoder) trained on over 100,000 sprite-mask pairs from 172 games achieves an IoU of 0.87, providing accurate binary masks from arbitrary input images. From these masks, we extract exterior contour vertices using Douglas-Peucker simplification with adaptive arc subdivision, and interior vertices along visual boundaries detected via bilateral-filtered multi-channel Canny edge detection with contour-following placement. Delaunay triangulation with mask-based centroid filtering produces the final mesh. Through controlled experiments, we demonstrate that direct vertex position prediction via neural network heatmap regression is fundamentally not viable for this task: the heatmap decoder consistently fails to converge (loss plateau at 0.061) while the segmentation decoder trains normally under identical conditions. We attribute this to the inherently artistic nature of vertex placement - the same sprite can be meshed validly in many different ways. This negative result validates our hybrid design: learned segmentation where ground truth is unambiguous, algorithmic placement where domain heuristics are appropriate. The complete pipeline processes a sprite in under 3 seconds, representing a speedup of 300x-1200x over manual creation. We release our trained model to the game development community.
我们介绍了SPRITETOMESH,这是一种全自动流程,用于将2D游戏精灵图像转换为与Spine2D等骨骼动画框架兼容的三角网格。传统的做法是手动创建适合动画制作的网格,这个过程既耗时又繁琐,需要艺术家仔细地沿着视觉边界放置顶点,通常每个精灵花费15到60分钟的时间。我们的方法通过混合学习算法的方法解决了这个问题。 该流程包括一个经过训练的分割网络(使用EfficientNet-B0编码器和U-Net解码器),它在一个包含超过10万对游戏遮罩图像的数据库上进行训练,涉及来自172个不同游戏的数据集,获得了0.87的交并比(IoU),能够从任意输入图像中生成准确的二值掩膜。基于这些掩膜,我们利用道格拉斯-普克简化算法结合自适应弧细分来提取外部轮廓顶点,并通过双边滤波多通道Canny边缘检测与跟踪放置方法沿着视觉边界找到内部顶点。最后,在使用遮罩中心点过滤的基础上进行Delaunay三角剖分生成最终网格。 在受控实验中,我们证明了直接通过神经网络热图回归预测顶点位置对于此任务是不可行的:在相同的条件下,热图解码器始终无法收敛(损失函数停滞在0.061),而分割解码器正常训练。我们认为这源于顶点放置本质上是一种艺术性的过程——同一个精灵可以通过多种不同的方式被合理地网格化。这一结果验证了我们混合设计的有效性:对于地面实况明确的情况下采用学习分割方法,而对于领域启发式适用的场景则使用算法放置的方法。 整个流程可以在不到3秒内处理完一个精灵图像,相比于手动创建而言速度提高了300到1200倍。我们向游戏开发社区开放了训练好的模型。
https://arxiv.org/abs/2602.21153
We present SAM-H and WOFTSAM, novel planar trackers that combine robust long-term segmentation tracking provided by SAM 2 with 8 degrees-of-freedom homography pose estimation. SAM-H estimates homographies from segmentation mask contours and is thus highly robust to target appearance changes. WOFTSAM significantly improves the current state-of-the-art planar tracker WOFT by exploiting lost target re-detection provided by SAM-H. The proposed methods are evaluated on POT-210 and PlanarTrack tracking benchmarks, setting the new state-of-the-art performance on both. On the latter, they outperform the second best by a large margin, +12.4 and +15.2pp on the p@15 metric. We also present improved ground-truth annotations of initial PlanarTrack poses, enabling more accurate benchmarking in the high-precision p@5 metric. The code and the re-annotations are available at this https URL
我们介绍了SAM-H和WOFTSAM,这两种是结合了由SAM 2提供的稳健长期分割跟踪与8自由度单应性姿态估计的新型平面追踪器。SAM-H通过从分割掩模轮廓中估算出单应矩阵,因此对目标外观变化具有高度鲁棒性。WOFTSAM通过利用SAM-H提供的丢失目标重新检测功能,显著提升了当前最先进的平面追踪器WOFT的表现。 我们提出的这些方法在POT-210和PlanarTrack追踪基准测试上进行了评估,并在这两个基准测试中均设立了新的性能标杆。特别是在后者中,它们超越了第二名一大截,在p@15指标上分别高出+12.4和+15.2个百分点。 此外,我们还提供了改进后的初始PlanarTrack姿态的地面真实注释,这使得在高精度的p@5度量标准下进行更准确的基准测试成为可能。代码以及重新标注的数据可在以下网址获得:[此处应提供链接](请根据实际情况填写具体URL)。
https://arxiv.org/abs/2602.19624
Augmented reality can improve tumor localization in laparoscopic liver surgery. Existing registration pipelines typically depend on organ contours; deformable (non-rigid) alignment is often handled with finite-element (FE) models coupled to dimensionality-reduction or machine-learning components. We integrate laparoscopic depth maps with a foundation pose estimator for camera-liver pose estimation and replace FE-based deformation with non-rigid iterative closest point (NICP) to lower engineering/modeling complexity and expertise requirements. On real patient data, the depth-augmented foundation pose approach achieved 9.91 mm mean registration error in 3 cases. Combined rigid-NICP registration outperformed rigid-only registration, demonstrating NICP as an efficient substitute for finite-element deformable models. This pipeline achieves clinically relevant accuracy while offering a lightweight, engineering-friendly alternative to FE-based deformation.
https://arxiv.org/abs/2602.17517
Eyeframe lens tracing is an important process in the optical industry that requires sub-millimeter precision to ensure proper lens fitting and optimal vision correction. Traditional frame tracers rely on mechanical tools that need precise positioning and calibration, which are time-consuming and require additional equipment, creating an inefficient workflow for opticians. This work presents a novel approach based on artificial vision that utilizes multi-view information. The proposed algorithm operates on images captured from an InVision system. The full pipeline includes image acquisition, frame segmentation to isolate the eyeframe from background, depth estimation to obtain 3D spatial information, and multi-view processing that integrates segmented RGB images with depth data for precise frame contour measurement. To this end, different configurations and variants are proposed and analyzed on real data, providing competitive measurements from still color images with respect to other solutions, while eliminating the need for specialized tracing equipment and reducing workflow complexity for optical technicians.
https://arxiv.org/abs/2602.16281
In mechanistic interpretability, recent work scrutinizes transformer "circuits" - sparse, mono or multi layer sub computations, that may reflect human understandable functions. Yet, these network circuits are rarely acid-tested for their stability across different instances of the same deep learning architecture. Without this, it remains unclear whether reported circuits emerge universally across labs or turn out to be idiosyncratic to a particular estimation instance, potentially limiting confidence in safety-critical settings. Here, we systematically study stability across-refits in increasingly complex transformer language models of various sizes. We quantify, layer by layer, how similarly attention heads learn representations across independently initialized training runs. Our rigorous experiments show that (1) middle-layer heads are the least stable yet the most representationally distinct; (2) deeper models exhibit stronger mid-depth divergence; (3) unstable heads in deeper layers become more functionally important than their peers from the same layer; (4) applying weight decay optimization substantially improves attention-head stability across random model initializations; and (5) the residual stream is comparatively stable. Our findings establish the cross-instance robustness of circuits as an essential yet underappreciated prerequisite for scalable oversight, drawing contours around possible white-box monitorability of AI systems.
https://arxiv.org/abs/2602.16740
Synthetic datasets are being recognized in the deep learning realm as a valuable alternative to exhaustively labeled real data. One such synthetic data generation method is Formula Driven Supervised Learning (FDSL), which can provide an infinite number of perfectly labeled data through a formula driven approach, such as fractals or contours. FDSL does not have common drawbacks like manual labor, privacy and other ethical concerns. In this work we generate 3D fractals using 3D Iterated Function Systems (IFS) for pre-training an action recognition model. The fractals are temporally transformed to form a video that is used as a pre-training dataset for downstream task of action recognition. We find that standard methods of generating fractals are slow and produce degenerate 3D fractals. Therefore, we systematically explore alternative ways of generating fractals and finds that overly-restrictive approaches, while generating aesthetically pleasing fractals, are detrimental for downstream task performance. We propose a novel method, Targeted Smart Filtering, to address both the generation speed and fractal diversity issue. The method reports roughly 100 times faster sampling speed and achieves superior downstream performance against other 3D fractal filtering methods.
合成数据在深度学习领域中被认为是一种有价值的现实标记数据的替代方案。其中一种生成合成数据的方法是公式驱动监督学习(Formula Driven Supervised Learning,FDSL),它可以通过公式驱动的方式,如分形或轮廓,提供无限量的完美标注数据,而无需人工劳动,并且避免了隐私和其他伦理问题。在这项工作中,我们使用3D迭代函数系统(IFS)生成3D分形来预训练动作识别模型。这些分形在时间上变换形成视频,作为下游动作识别任务的预训练数据集。 我们发现传统的生成分形的方法速度慢,并且产生的三维分形质量较差。因此,我们系统地探索了替代方法以生成分形,结果发现过于严格的限制性方法虽然可以产生美观的分形图像,但对下游任务性能有害。为此,我们提出了一种新的方法——目标智能过滤法(Targeted Smart Filtering),旨在同时解决生成速度和分形多样性的问题。该方法报告称采样速度快了大约100倍,并且在与其他3D分形筛选方法相比时,在下游任务中表现出色。
https://arxiv.org/abs/2602.11810