Underwater video analysis is particularly challenging due to factors such as low lighting, color distortion, and turbidity, which compromise visual data quality and directly impact the performance of perception modules in robotic applications. This work proposes AquaFeat+, a plug-and-play pipeline designed to enhance features specifically for automated vision tasks, rather than for human perceptual quality. The architecture includes modules for color correction, hierarchical feature enhancement, and an adaptive residual output, which are trained end-to-end and guided directly by the loss function of the final application. Trained and evaluated in the FishTrack23 dataset, AquaFeat+ achieves significant improvements in object detection, classification, and tracking metrics, validating its effectiveness for enhancing perception tasks in underwater robotic applications.
水下视频分析由于低光照、色彩失真和浑浊等因素而面临诸多挑战,这些因素会降低视觉数据质量,并直接影响机器人应用中感知模块的性能。本研究提出了一种名为AquaFeat+的即插即用管道方案,旨在增强自动视觉任务所需的特定特征,而非追求人类观察的质量标准。该架构包括颜色校正、分层特征增强和自适应残差输出等模块,这些模块被端到端训练,并直接由最终应用的损失函数引导。 在FishTrack23数据集中进行训练和评估后,AquaFeat+在目标检测、分类和跟踪指标方面取得了显著提升,验证了其对水下机器人感知任务增强的有效性。
https://arxiv.org/abs/2601.09652
Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.
基于强化学习(RL)的大规模语言模型(LLMs)的增强通常会导致输出多样性降低,这会削弱它们在开放式任务(如创意写作)中的实用性。当前的方法缺乏明确的机制来引导多样性的探索,而是优先考虑优化效率和性能而非多样性。本文提出了一种以半结构化的长思维链(CoT)为基础的RL框架,在这种框架中,生成过程被分解为一系列显式的中间步骤。我们引入了多样规划分支方法,在这一过程中,根据多样性的变化在规划阶段战略性地引入分歧,并且通过群体意识的多样性奖励来鼓励不同的轨迹。实验结果表明,在创意写作基准测试上,我们的方法显著提高了输出多样性而不会牺牲生成质量,并且始终优于现有的基线方法。
https://arxiv.org/abs/2601.09609
Current patent claim generation systems face three fundamental limitations: poor cross-jurisdictional generalization, inadequate semantic relationship modeling between claims and prior art, and unreliable quality assessment. We introduce a novel three-stage framework that addresses these challenges through relationship-aware similarity analysis, domain-adaptive claim generation, and unified quality assessment. Our approach employs multi-head attention with eight specialized heads for explicit relationship modeling, integrates curriculum learning with dynamic LoRA adapter selection across five patent domains, and implements cross-attention mechanisms between evaluation aspects for comprehensive quality assessment. Extensive experiments on USPTO HUPD dataset, EPO patent collections, and Patent-CE benchmark demonstrate substantial improvements: 7.6-point ROUGE-L gain over GPT-4o, 8.3\% BERTScore enhancement over Llama-3.1-8B, and 0.847 correlation with human experts compared to 0.623 for separate evaluation models. Our method maintains 89.4\% cross-jurisdictional performance retention versus 76.2\% for baselines, establishing a comprehensive solution for automated patent prosecution workflows.
当前的专利申请生成系统面临三个基本限制:跨司法辖区泛化能力差、声明与现有技术之间的语义关系建模不足以及质量评估不可靠。我们提出了一种新颖的三阶段框架,通过关系感知相似度分析、领域自适应声明生成和统一的质量评估来解决这些挑战。 我们的方法采用多头注意力机制,并使用八个专门化的头部进行显式关系建模;将课程学习与动态LoRA适配器选择集成在五个专利领域中;以及实施评价方面之间的交叉注意机制,以实现全面的质量评估。在美国专利商标局(USPTO)HUPD数据集、欧洲专利局(EPO)专利集合和Patent-CE基准上的广泛实验表明,我们的方法取得了显著改进:ROUGE-L得分比GPT-4o高出7.6分;BERTScore相对于Llama-3.1-8B提高了8.3%;与单独的评估模型相比,我们的人类专家相关性评分达到了0.847(后者为0.623)。此外,在跨司法辖区性能保持方面,我们的方法达到89.4%,而基准线仅为76.2%。这表明我们的方法为自动化专利申请工作流程提供了一个全面的解决方案。
https://arxiv.org/abs/2601.09120
Stylometry--the identification of an author through analysis of a text's style (i.e., authorship attribution)--serves many constructive purposes: it supports copyright and plagiarism investigations, aids detection of harmful content, offers exploratory cues for certain medical conditions (e.g., early signs of dementia or depression), provides historical context for literary works, and helps uncover misinformation and disinformation. In contrast, when stylometry is employed as a tool for authorship verification--confirming whether a text truly originates from a claimed author--it can also be weaponized for malicious purposes. Techniques such as de-anonymization, re-identification, tracking, profiling, and downstream effects like censorship illustrate the privacy threats that stylometric analysis can enable. Building on these concerns, this paper further explores how adversarial stylometry combined with steganography can counteract stylometric analysis. We first present enhancements to our adversarial attack, $\textit{TraceTarnish}$, providing stronger evidence of its capacity to confound stylometric systems and reduce their attribution and verification accuracy. Next, we examine how steganographic embedding can be fine-tuned to mask an author's stylistic fingerprint, quantifying the level of authorship obfuscation achievable as a function of the proportion of words altered with zero-width Unicode characters. Based on our findings, steganographic coverage of 33% or higher seemingly ensures authorship obfuscation. Finally, we reflect on the ways stylometry can be used to undermine privacy and argue for the necessity of defensive tools like $\textit{TraceTarnish}$.
语体测量法——通过分析文本的风格来识别作者(即,确定作者身份)——在许多建设性方面发挥着重要作用:它支持版权和抄袭调查,有助于检测有害内容,为某些医疗状况提供探索线索(例如痴呆症或抑郁症的早期迹象),为文学作品提供历史背景,并帮助揭露错误信息和虚假信息。然而,当语体测量被用作作者身份验证工具——确认文本是否真正来自声称的作者时——它也可能被用于恶意目的。诸如去匿名化、重新识别、跟踪、个人画像以及由此产生的审查等隐私威胁都可能因语体分析而引发。 基于这些担忧,本文进一步探讨了通过结合对抗性语体测量和隐写术来对抗语体分析的方法。首先,我们介绍了对我们的对抗攻击技术$\textit{TraceTarnish}$的改进,提供了更强有力的证据证明其能够混淆语体系统的识别能力,并降低其归因与验证的准确性。接下来,我们将研究如何通过调整隐写术嵌入以掩盖作者的独特风格特征,量化使用零宽度Unicode字符修改的比例对作者身份模糊化程度的影响。根据我们的研究成果,在33%或更高的隐写覆盖范围内,似乎能够确保有效的作者身份混淆。 最后,我们反思了语体测量如何被用于侵犯隐私,并主张需要防御性工具如$\textit{TraceTarnish}$来保护个人免受此类技术带来的风险。
https://arxiv.org/abs/2601.09056
Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: this https URL
视频对象分割方法如SAM2通过基于内存的架构实现了强大的性能,但因依赖于外观特征,在视角变化较大的情况下表现不佳。传统三维实例分割方法解决了视角一致性问题,但却需要摄像机姿态、深度图及昂贵的预处理步骤。我们引入了3AM,这是一种在训练期间增强的方法,它将MUSt3R的具有三维感知特性的特征集成到了SAM2中。我们的轻量级特征融合器整合了多个级别的MUSt3R特征,这些特征编码了隐式的几何对应关系。结合SAM2的外观特征后,模型能够基于空间位置和视觉相似度实现几何一致性识别。 我们还提出了一种视场感知采样策略,确保帧观察到的空间一致的对象区域可以可靠地学习三维对应关系。重要的是,我们的方法在推理时只需要RGB输入,并不需要摄像机姿态或预处理步骤。 在具有大基线运动的挑战性数据集(如ScanNet++、Replica)上,3AM显著优于SAM2及其扩展版本,在ScanNet++选定子集中达到了90.6%的IoU和71.7%的正面IoU,分别比最先进的视频对象分割方法提高了+15.9和+30.4个百分点。 项目页面:[此链接](https://this.is.the.url.of.project)
https://arxiv.org/abs/2601.08831
We propose a novel piecewise smooth image model with piecewise constant local parameters that are automatically adapted to each image. Technically, the model is formulated in terms of factor graphs with NUP (normal with unknown parameters) priors, and the pertinent computations amount to iterations of conjugate-gradient steps and Gaussian message passing. The proposed model and algorithms are demonstrated with applications to denoising and contrast enhancement.
我们提出了一种新颖的分段平滑图像模型,该模型具有局部常数参数,并且这些参数会自动适应每一张图像。技术上讲,该模型通过带有未知参数正态先验(NUP)的因子图进行表述,相关的计算涉及共轭梯度步骤和高斯消息传递的迭代。我们通过去噪和对比度增强的应用展示了所提出的模型和算法的有效性。
https://arxiv.org/abs/2601.08749
With the development of large language models (LLMs) in the field of programming, intelligent programming coaching systems have gained widespread attention. However, most research focuses on repairing the buggy code of programming learners without providing the underlying causes of the bugs. To address this gap, we introduce a novel task, namely \textbf{LPR} (\textbf{L}earner-Tailored \textbf{P}rogram \textbf{R}epair). We then propose a novel and effective framework, \textbf{\textsc{\MethodName{}}} (\textbf{L}earner-Tailored \textbf{S}olution \textbf{G}enerator), to enhance program repair while offering the bug descriptions for the buggy code. In the first stage, we utilize a repair solution retrieval framework to construct a solution retrieval database and then employ an edit-driven code retrieval approach to retrieve valuable solutions, guiding LLMs in identifying and fixing the bugs in buggy code. In the second stage, we propose a solution-guided program repair method, which fixes the code and provides explanations under the guidance of retrieval solutions. Moreover, we propose an Iterative Retrieval Enhancement method that utilizes evaluation results of the generated code to iteratively optimize the retrieval direction and explore more suitable repair strategies, improving performance in practical programming coaching scenarios. The experimental results show that our approach outperforms a set of baselines by a large margin, validating the effectiveness of our framework for the newly proposed LPR task.
随着编程领域中大型语言模型(LLMs)的发展,智能编程辅导系统获得了广泛的关注。然而,大多数研究主要集中在修复编程学习者的错误代码上,而不提供导致这些错误的根本原因。为了解决这一缺口,我们引入了一个新颖的任务,即**LPR**(面向学习者的问题修复)。随后,我们提出了一种新型且有效的框架,称为**MethodName**(面向学习者的解决方案生成器),旨在增强程序修复能力的同时提供对错误代码的描述。 在第一阶段,我们利用一个修复方案检索框架来构建解决方案检索数据库,并采用编辑驱动的代码检索方法来获取有价值的解决方案。这些操作指导LLMs识别并修正错误代码中的问题。 第二阶段中,我们提出了一种基于检索方案引导的程序修复方法,在此过程中不仅修复代码而且还提供解释说明,这都是在检索到的解决方案指导下完成的。此外,我们还提出了一个迭代检索增强的方法,该方法利用生成代码的评估结果来不断优化检索方向,并探索更合适的修复策略,从而提高实际编程辅导场景下的性能。 实验结果显示,我们的方法相对于一系列基准模型具有显著的优势,验证了我们在新提出的LPR任务中所提出框架的有效性。
https://arxiv.org/abs/2601.08545
This technical report represents the award-winning solution to the Cross-platform 3D Object Detection task in the RoboSense2025 Challenge. Our approach is built upon PVRCNN++, an efficient 3D object detection framework that effectively integrates point-based and voxel-based features. On top of this foundation, we improve cross-platform generalization by narrowing domain gaps through tailored data augmentation and a self-training strategy with pseudo-labels. These enhancements enabled our approach to secure the 3rd place in the challenge, achieving a 3D AP of 62.67% for the Car category on the phase-1 target domain, and 58.76% and 49.81% for Car and Pedestrian categories respectively on the phase-2 target domain.
这份技术报告展示了在RoboSense2025挑战赛跨平台3D物体检测任务中获奖的解决方案。我们的方法基于PVRCNN++,这是一种高效的3D物体检测框架,能够有效结合点基和体素基特征。在此基础上,我们通过定制的数据增强技术和带有伪标签的自我训练策略来缩小领域差距,从而提升跨平台泛化能力。这些改进使我们在挑战赛中获得了第三名的成绩,在第一阶段目标域的Car类别上实现了3D AP 62.67%,在第二阶段目标域的Car和行人(Pedestrian)类别上分别达到了58.76% 和49.81% 的性能。
https://arxiv.org/abs/2601.08174
Geometric Representation Learning (GRL) aims to approximate the non-Euclidean topology of high-dimensional data through discrete graph structures, grounded in the manifold hypothesis. However, traditional static graph construction methods based on Euclidean distance often fail to capture the intrinsic curvature characteristics of the data manifold. Although Ollivier-Ricci Curvature Flow (OCF) has proven to be a powerful tool for dynamic topological optimization, its core reliance on Optimal Transport (Wasserstein distance) leads to prohibitive computational complexity, severely limiting its application in large-scale datasets and deep learning frameworks. To break this bottleneck, this paper proposes a novel geometric evolution framework: Resistance Curvature Flow (RCF). Leveraging the concept of effective resistance from circuit physics, RCF transforms expensive curvature optimization into efficient matrix operations. This approach achieves over 100x computational acceleration while maintaining geometric optimization capabilities comparable to OCF. We provide an in-depth exploration of the theoretical foundations and dynamical principles of RCF, elucidating how it guides the redistribution of edge weights via curvature gradients to eliminate topological noise and strengthen local cluster structures. Furthermore, we provide a mechanistic explanation of RCF's role in manifold enhancement and noise suppression, as well as its compatibility with deep learning models. We design a graph optimization algorithm, DGSL-RCF, based on this framework. Experimental results across deep metric learning, manifold learning, and graph structure learning demonstrate that DGSL-RCF significantly improves representation quality and downstream task performance.
几何表示学习(GRL)旨在通过离散图结构来近似高维数据的非欧几里得拓扑,这一过程基于流形假设。然而,传统的静态图构建方法通常依赖于欧氏距离,这种做法往往无法捕捉到数据流形内在的曲率特性。虽然奥利维尔-里奇曲率流动(OCF)已被证明是动态拓扑优化的强大工具,但其核心依赖于最优传输(Wasserstein距离),这导致了计算复杂度极高,严重限制了它在大规模数据集和深度学习框架中的应用。为了突破这一瓶颈,本文提出了一种新的几何演化框架:电阻曲率流动(RCF)。利用电路物理学中有效电阻的概念,RCF将昂贵的曲率优化转化为高效的矩阵运算。该方法实现了超过100倍的计算加速,并且在保持与OCF相当的几何优化能力的同时,还提供了更好的性能。 本文深入探讨了RCF的理论基础和动态原理,阐明了它是如何通过曲率梯度指导边权重的重新分配来消除拓扑噪声并强化局部集群结构的。此外,我们还从机制层面解释了RCF在流形增强和噪声抑制方面的作用及其与深度学习模型兼容性方面的优势。 基于这一框架,我们设计了一种图优化算法——DGSL-RCF(Deep Graph Structure Learning with Resistance Curvature Flow)。通过跨深度度量学习、流形学习以及图结构学习的实验结果表明,DGSL-RCF显著提高了表示质量和下游任务性能。
https://arxiv.org/abs/2601.08149
The transformative potential of 3D content creation has been progressively unlocked through advancements in generative models. Recently, intuitive drag editing with geometric changes has attracted significant attention in 2D editing yet remains challenging for 3D scenes. In this paper, we introduce 3DGS-Drag -- a point-based 3D editing framework that provides efficient, intuitive drag manipulation of real 3D scenes. Our approach bridges the gap between deformation-based and 2D-editing-based 3D editing methods, addressing their limitations to geometry-related content editing. We leverage two key innovations: deformation guidance utilizing 3D Gaussian Splatting for consistent geometric modifications and diffusion guidance for content correction and visual quality enhancement. A progressive editing strategy further supports aggressive 3D drag edits. Our method enables a wide range of edits, including motion change, shape adjustment, inpainting, and content extension. Experimental results demonstrate the effectiveness of 3DGS-Drag in various scenes, achieving state-of-the-art performance in geometry-related 3D content editing. Notably, the editing is efficient, taking 10 to 20 minutes on a single RTX 4090 GPU.
三维内容创作的变革潜力通过生成模型的进步逐步被解锁。最近,直观的拖拽编辑结合几何变化在二维编辑中引起了广泛关注,但在三维场景中的实现仍面临挑战。本文介绍了3DGS-Drag——一种基于点的三维编辑框架,它提供了一种高效且直观地操作真实三维场景的方法。我们的方法连接了基于变形和基于二维编辑的三维编辑方式之间的差距,并解决了它们在与几何相关的三维内容编辑方面的局限性。我们利用两个关键创新:使用3D高斯斑点进行一致的几何修改的变形引导,以及用于内容校正和视觉质量增强的扩散引导。一种渐进式编辑策略进一步支持了激进的三维拖拽编辑。我们的方法能够实现一系列编辑操作,包括运动变化、形状调整、纹理修复(inpainting)和内容扩展。实验结果表明,3DGS-Drag在各种场景中表现出色,在与几何相关的三维内容编辑方面达到了最先进的性能水平。值得注意的是,该编辑过程高效快捷,使用单个RTX 4090 GPU时,整个编辑过程仅需10到20分钟。
https://arxiv.org/abs/2601.07963
While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6\% improvement on ImageNet classification, a 6.3\% gain on NLP, a 12.6\% improvement on image generation, and a 41\% enhancement on video generation under the same time complexity.
虽然Transformer架构在许多领域占主导地位,但其二次方的自注意力复杂度限制了它在大规模应用中的使用。线性注意力提供了一种高效的替代方案,但直接应用通常会降低性能,而现有修复方法往往通过引入额外模块(如深度可分离卷积)来重新增加计算开销,从而违背了最初的意图。在这项工作中,我们确定了这些方法的关键失败模式:全局上下文崩溃,在这种情况下模型失去了表示多样性。为了解决这个问题,我们提出了多头线性注意力(MHLA),该方法通过在标记维度上划分的头部内进行注意计算来保持这一多样性。我们证明MHLA在保持线性复杂度的同时恢复了softmax注意力大部分的表达能力,并验证其在多个领域的有效性,在相同时间复杂度下实现了ImageNet分类3.6%的改进,NLP领域6.3%的增长,图像生成领域12.6%的进步以及视频生成方面41%的提升。
https://arxiv.org/abs/2601.07832
Automatic License Plate Recognition is a frequent research topic due to its wide-ranging practical applications. While recent studies use synthetic images to improve License Plate Recognition (LPR) results, there remain several limitations in these efforts. This work addresses these constraints by comprehensively exploring the integration of real and synthetic data to enhance LPR performance. We subject 16 Optical Character Recognition (OCR) models to a benchmarking process involving 12 public datasets acquired from various regions. Several key findings emerge from our investigation. Primarily, the massive incorporation of synthetic data substantially boosts model performance in both intra- and cross-dataset scenarios. We examine three distinct methodologies for generating synthetic data: template-based generation, character permutation, and utilizing a Generative Adversarial Network (GAN) model, each contributing significantly to performance enhancement. The combined use of these methodologies demonstrates a notable synergistic effect, leading to end-to-end results that surpass those reached by state-of-the-art methods and established commercial systems. Our experiments also underscore the efficacy of synthetic data in mitigating challenges posed by limited training data, enabling remarkable results to be achieved even with small fractions of the original training data. Finally, we investigate the trade-off between accuracy and speed among different models, identifying those that strike the optimal balance in each intra-dataset and cross-dataset settings.
自动车牌识别(Automatic License Plate Recognition,ALPR)是一个由于其实用性广泛而备受研究的课题。尽管最近的研究使用合成图像来改进车牌识别(License Plate Recognition,LPR)结果,但仍存在一些限制。这项工作通过全面探索真实数据和合成数据的整合来解决这些限制,以提高LPR性能。我们对16种光学字符识别(OCR)模型进行了基准测试,涉及从不同地区获取的12个公共数据集。我们的研究得出了一些关键发现: 首先,大量使用合成数据显著提升了模型在内部分布和跨分布场景中的表现。 其次,我们探讨了三种不同的生成合成数据的方法:基于模板的生成、字符排列以及利用生成对抗网络(GAN)模型,每种方法都对性能提升有重大贡献。这几种方法结合使用的综合效果明显,使得从端到端的结果超越现有的最先进的方法和已建立的商业系统。 此外,我们的实验强调了合成数据在缓解由于训练数据量有限带来的挑战方面的有效性,在使用少量原始训练数据的情况下也能获得显著结果。 最后,我们研究了不同模型之间准确性和速度之间的权衡,并根据每个内部分布和跨分布场景识别出达到最佳平衡的那些模型。
https://arxiv.org/abs/2601.07671
Autonomous driving systems rely heavily on multi-view images to ensure accurate perception and robust decision-making. To effectively develop and evaluate perception stacks and planning algorithms, realistic closed-loop simulators are indispensable. While 3D reconstruction techniques such as Gaussian Splatting offer promising avenues for simulator construction, the rendered novel views often exhibit artifacts, particularly in extrapolated perspectives or when available observations are sparse. We introduce ViewMorpher3D, a multi-view image enhancement framework based on image diffusion models, designed to elevate photorealism and multi-view coherence in driving scenes. Unlike single-view approaches, ViewMorpher3D jointly processes a set of rendered views conditioned on camera poses, 3D geometric priors, and temporally adjacent or spatially overlapping reference views. This enables the model to infer missing details, suppress rendering artifacts, and enforce cross-view consistency. Our framework accommodates variable numbers of cameras and flexible reference/target view configurations, making it adaptable to diverse sensor setups. Experiments on real-world driving datasets demonstrate substantial improvements in image quality metrics, effectively reducing artifacts while preserving geometric fidelity.
自动驾驶系统高度依赖多视角图像以确保准确的感知和稳健的决策制定。为了有效地开发和评估感知栈与规划算法,现实主义闭环模拟器不可或缺。虽然像高斯点阵(Gaussian Splatting)这样的三维重建技术为构建模拟器提供了有前景的方法,但渲染的新视角往往会出现伪影,尤其是在外推视图或可用观察数据稀疏的情况下。我们推出了ViewMorpher3D,这是一个基于图像扩散模型的多视角图像增强框架,旨在提升驾驶场景中的照片真实感和多视角一致性。 与单视角方法不同,ViewMorpher3D同时处理一组渲染视角,这些视角以相机姿态、三维几何先验知识以及时间上相邻或空间上重叠的参考视图为条件。这使得模型能够推断缺失细节,抑制渲染伪影,并确保跨视角的一致性。我们的框架支持可变数量的摄像头和灵活的参考/目标视角配置,使其适用于各种传感器设置。 在真实驾驶数据集上的实验表明,在图像质量指标上取得了显著提升,有效减少了伪影同时保持了几何保真度。
https://arxiv.org/abs/2601.07540
Infrared and visible image fusion generates all-weather perception-capable images by combining complementary modalities, enhancing environmental awareness for intelligent unmanned systems. Existing methods either focus on pixel-level fusion while overlooking downstream task adaptability or implicitly learn rigid semantics through cascaded detection/segmentation models, unable to interactively address diverse semantic target perception needs. We propose CtrlFuse, a controllable image fusion framework that enables interactive dynamic fusion guided by mask prompts. The model integrates a multi-modal feature extractor, a reference prompt encoder (RPE), and a prompt-semantic fusion module (PSFM). The RPE dynamically encodes task-specific semantic prompts by fine-tuning pre-trained segmentation models with input mask guidance, while the PSFM explicitly injects these semantics into fusion features. Through synergistic optimization of parallel segmentation and fusion branches, our method achieves mutual enhancement between task performance and fusion quality. Experiments demonstrate state-of-the-art results in both fusion controllability and segmentation accuracy, with the adapted task branch even outperforming the original segmentation model.
红外与可见光图像融合通过结合互补模态生成全天候感知能力的图像,增强了智能无人系统的环境意识。现有方法要么侧重于像素级别的融合而忽略了下游任务的适应性,要么通过级联检测/分割模型隐式学习刚性的语义信息,无法交互地解决多样化的语义目标感知需求。我们提出了CtrlFuse,这是一种可控的图像融合框架,能够根据遮罩提示进行互动动态融合。该模型集成了一个多模态特征提取器、参考提示编码器(RPE)以及提示-语义融合模块(PSFM)。RPE通过使用输入遮罩引导微调预训练分割模型来动态编码特定任务的语义提示,而PSFM则明确将这些语义注入到融合特征中。通过并行优化分割和融合分支,我们的方法实现了任务性能与融合质量之间的相互增强。实验表明,在图像融合可控性和分割准确性方面,本方法均达到了最先进的水平,并且适应的任务分支甚至超过了原始的分割模型。
https://arxiv.org/abs/2601.08619
Vision-language models are increasingly employed as multimodal conversational agents (MCAs) for diverse conversational tasks. Recently, reinforcement learning (RL) has been widely explored for adapting MCAs to various human-AI interaction scenarios. Despite showing great enhancement in generalization performance, fine-tuning MCAs via RL still faces challenges in handling the extremely large text token space. To address this, we learn a compact latent action space for RL fine-tuning instead. Specifically, we adopt the learning from observation mechanism to construct the codebook for the latent action space, where future observations are leveraged to estimate current latent actions that could further be used to reconstruct future observations. However, the scarcity of paired image-text data hinders learning a codebook with sufficient coverage. Thus, we leverage both paired image-text data and text-only data to construct the latent action space, using a cross-modal projector for transforming text embeddings into image-text embeddings. We initialize the cross-modal projector on paired image-text data, and further train it on massive text-only data with a novel cycle consistency loss to enhance its robustness. We show that our latent action based method outperforms competitive baselines on two conversation tasks across various RL algorithms.
视觉-语言模型越来越多地被用作多模态对话代理(MCAs),以处理各种对话任务。最近,强化学习(RL)已被广泛探索用于将MCAs适应于多种人机交互场景。尽管在泛化性能方面表现出显著提升,通过RL微调MCAs仍然面临挑战,尤其是在处理庞大的文本标记空间时。为了解决这个问题,我们采用了一种紧凑的潜在动作空间来代替传统的RL微调方法。具体来说,我们采用了从观察学习机制来构建该潜在动作空间的代码本,其中未来观测值被用于估计当前能够重构未来观测值的潜在动作。 然而,由于成对图像-文本数据稀缺的问题,难以学到覆盖范围足够广泛的代码本。为此,我们将配对的图像-文本数据和仅文本的数据结合使用,通过跨模态投影器将文本嵌入转换为图像-文本嵌入来构建潜在的动作空间。我们首先在配对的图像-文本数据上初始化该跨模态投影器,并进一步在其上用大量纯文本数据训练它,采用一种新颖的循环一致性损失方法以增强其鲁棒性。 实验结果表明,在两种对话任务中使用我们的基于潜在动作的方法能够超越竞争性的基准模型,并且适用于多种RL算法。
https://arxiv.org/abs/2601.07516
Reliable humanoid-robot interaction (HRI) in household environments is constrained by two fundamental requirements, namely robustness to unconstrained user positions and preservation of user privacy. Millimeter-wave (mmWave) sensing inherently supports privacy-preserving interaction, making it a promising modality for room-scale HRI. However, existing mmWave-based interaction-sensing systems exhibit poor spatial generalization at unseen distances or viewpoints. To address this challenge, we introduce WaveMan, a spatially adaptive room-scale perception system that restores reliable human interaction sensing across arbitrary user positions. WaveMan integrates viewpoint alignment and spectrogram enhancement for spatial consistency, with dual-channel attention for robust feature extraction. Experiments across five participants show that, under fixed-position evaluation, WaveMan achieves the same cross-position accuracy as the baseline with five times fewer training positions. In random free-position testing, accuracy increases from 33.00% to 94.33%, enabled by the proposed method. These results demonstrate the feasibility of reliable, privacy-preserving interaction for household humanoid robots across unconstrained user positions.
https://arxiv.org/abs/2601.07454
Reliable learning on low-quality multimodal data is a widely concerning issue, especially in safety-critical applications. However, multimodal noise poses a major challenge in this domain and leads existing methods to suffer from two key limitations. First, they struggle to reliably remove heterogeneous data noise, hindering robust multimodal representation learning. Second, they exhibit limited adaptability and generalization when encountering previously unseen noise. To address these issues, we propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD). On one hand, TAHCD introduces the Adaptive Stable Subspace Alignment and Sample-Adaptive Confidence Alignment to reliably remove heterogeneous noise. They account for noise at both global and instance levels and enable jointly removal of modality-specific and cross-modality noise, achieving robust learning. On the other hand, TAHCD introduces test-time cooperative enhancement, which adaptively updates the model in response to input noise in a label-free manner, improving adaptability and generalization. This is achieved by collaboratively enhancing the joint removal process of modality-specific and cross-modality noise across global and instance levels according to sample noise. Experiments on multiple benchmarks demonstrate that the proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
在低质量多模态数据上进行可靠学习是一个广泛关心的问题,特别是在安全关键的应用中。然而,多模态噪声在这个领域构成了一个主要挑战,并导致现有方法存在两个关键限制:首先,它们难以有效地移除异构数据噪声,从而阻碍了鲁棒的多模态表示学习;其次,当遇到以前未见过的噪声时,它们表现出有限的适应性和泛化能力。为了解决这些问题,我们提出了测试时间自适应分层协同去噪网络(TAHCD)。 一方面,TAHCD 引入了自适应稳定子空间对齐和样本自适应置信度对齐,以可靠地移除异构噪声。这些方法考虑到了全局和实例级别的噪声,并且能够同时移除模态特定的和跨模态的噪声,从而实现了稳健的学习。 另一方面,TAHCD 引入了测试时间协同增强,该方法能够在无标签的情况下根据输入噪声自适应更新模型,提高了适应性和泛化能力。这通过协作性地增强了全局和实例级别上针对样本噪声的模态特定的和跨模态噪声联合移除过程来实现。 在多个基准上的实验表明,所提出的方法相比最先进的可靠多模态学习方法,在分类性能、稳健性和泛化方面均表现出优越的结果。
https://arxiv.org/abs/2601.07163
Recent advancements in Low-Light Image Enhancement (LLIE) have focused heavily on Diffusion Probabilistic Models, which achieve high perceptual quality but suffer from significant computational latency (often exceeding 2-4 seconds per image). Conversely, traditional CNN-based baselines offer real-time inference but struggle with "over-smoothing," failing to recover fine structural details in extreme low-light conditions. This creates a practical gap in the literature: the lack of a model that provides generative-level texture recovery at edge-deployable speeds. In this paper, we address this trade-off by proposing a hybrid Attention U-Net GAN. We demonstrate that the heavy iterative sampling of diffusion models is not strictly necessary for texture recovery. Instead, by integrating Attention Gates into a lightweight U-Net backbone and training within a conditional adversarial framework, we can approximate the high-frequency fidelity of generative models in a single forward pass. Extensive experiments on the SID dataset show that our method achieves a best-in-class LPIPS score of 0.112 among efficient models, significantly outperforming efficient baselines (SID, EnlightenGAN) while maintaining an inference latency of 0.06s. This represents a 40x speedup over latent diffusion models, making our approach suitable for near real-time applications.
最近在低光图像增强(LLIE)领域的进展主要集中在扩散概率模型上,这类模型虽然能够达到很高的感知质量,但计算延迟严重(每张图片处理时间往往超过2-4秒)。相比之下,传统的基于CNN的方法则能提供实时推理,但在极端低光照条件下却难以避免“过度平滑”问题,无法恢复精细的结构细节。这在文献中造成了一个实际差距:缺乏既能生成高质量纹理又能实现边缘部署速度的模型。 在这篇论文中,我们通过提出一种混合注意力U-Net GAN来解决上述权衡问题。我们证明了扩散模型中的密集迭代采样并非进行纹理恢复所必需。相反,通过将注意力门控集成到轻量级U-Net骨干网络,并在条件对抗框架内训练,可以在一次前向传递中近似生成模型的高频保真度。 我们在SID数据集上的广泛实验表明,我们的方法达到了0.112的最佳LPIPS分数(在高效模型类别中),远超其他高效基准(如SID和EnlightenGAN)的表现,并且保持了0.06秒的推理延迟。这代表了对潜在扩散模型40倍的速度提升,使得我们的方法适用于近乎实时的应用场景。
https://arxiv.org/abs/2601.06518
Radio astronomy is an indispensable discipline for observing distant celestial objects. Measurements of wave signals from radio telescopes, called visibility, need to be transformed into images for astronomical observations. These dirty images blend information from real sources and artifacts. Therefore, astronomers usually perform reconstruction before imaging to obtain cleaner images. Existing methods consider only a single modality of sparse visibility data, resulting in images with remaining artifacts and insufficient modeling of correlation. To enhance the extraction of visibility information and emphasize output quality in the image domain, we propose VVTRec, a multimodal radio interferometric data reconstruction method with visibility-guided visual and textual modality enrichment. In our VVTRec, sparse visibility is transformed into image-form and text-form features to obtain enhancements in terms of spatial and semantic information, improving the structural integrity and accuracy of images. Also, we leverage Vision-Language Models (VLMs) to achieve additional training-free performance improvements. VVTRec enables sparse visibility, as a foreign modality unseen by VLMs, to accurately extract pre-trained knowledge as a supplement. Our experiments demonstrate that VVTRec effectively enhances imaging results by exploiting multimodal information without introducing excessive computational overhead.
无线电天文学是观察遥远宇宙物体不可或缺的学科。来自射电望远镜的波信号测量,称为可见度数据,需要被转换成图像以进行天文观测。这些所谓的“脏图”包含了真实来源的信息和一些伪迹信息。因此,天文学家通常在成像前会进行重建工作以获得更清晰的图像。现有的方法只考虑了稀疏可见度数据单一模态,导致生成的图片中仍存在残留伪迹,并且对相关性的建模不足。 为了提高从稀疏可见度数据中提取信息的能力并突出成像领域的输出质量,我们提出了一种新的多模态射电干涉测量数据重建方法——VVTRec(Visibility-aided Visual and Textual Modality Enrichment)。在我们的VVTRec框架下,稀疏的可见度数据被转换为图像形式和文本形式特征,以获得空间信息和语义信息方面的增强,从而提高了成像结构完整性和准确性。此外,我们利用视觉语言模型(Vision-Language Models, VLMs)来实现无需额外训练的成本效益改进。VVTRec使得稀疏的可见度数据——这种以前未被VLMs处理过的模态——能够精确地提取预训练知识作为补充。 我们的实验表明,通过利用多模态信息,VVTRec能够在不引入过多计算开销的情况下有效提升成像结果的质量。
https://arxiv.org/abs/2601.06475
Millimeter-wave radar enables robust environment perception in autonomous systems under adverse conditions yet suffers from sparse, noisy point clouds with low angular resolution. Existing diffusion-based radar enhancement methods either incur high learning complexity by modeling full LiDAR distributions or fail to prioritize critical structures due to uniform regional processing. To address these issues, we propose R3D, a regional-guided residual radar diffusion framework that integrates residual diffusion modeling-focusing on the concentrated LiDAR-radar residual encoding complementary high-frequency details to reduce learning difficulty-and sigma-adaptive regional guidance-leveraging radar-specific signal properties to generate attention maps and applying lightweight guidance only in low-noise stages to avoid gradient imbalance while refining key regions. Extensive experiments on the ColoRadar dataset demonstrate that R3D outperforms state-of-the-art methods, providing a practical solution for radar perception enhancement. Our anonymous code and pretrained models are released here: this https URL
毫米波雷达能够在恶劣条件下使自主系统具备稳健的环境感知能力,但会产生稀疏、嘈杂且角分辨率低的点云数据。现有的基于扩散的方法要么通过建模完整的激光雷达分布而导致学习复杂性高,要么由于区域处理的均匀化而未能优先考虑关键结构。为解决这些问题,我们提出了一种新的框架R3D(Regional-guided Residual Radar Diffusion),该框架集成了残差扩散模型——专注于集中式的激光雷达-雷达残差编码,以减少高频细节的学习难度,并结合了sigma自适应区域指导——利用毫米波雷达特有的信号特性生成注意力图,在低噪声阶段仅应用轻量级引导来避免梯度不平衡并细化关键区域。在ColoRadar数据集上的广泛实验表明,R3D优于现有的最先进的方法,为雷达感知增强提供了一种实用的解决方案。我们的匿名代码和预训练模型在此网址发布:this https URL
https://arxiv.org/abs/2601.06465