Recent advancements in large reasoning models (LRMs) have significantly enhanced language models' capabilities in complex problem-solving by emulating human-like deliberative thinking. However, these models often exhibit overthinking (i.e., the generation of unnecessarily verbose and redundant content), which hinders efficiency and inflates inference cost. In this work, we explore the representational and behavioral origins of this inefficiency, revealing that LRMs inherently possess the capacity for more concise reasoning. Empirical analyses show that correct reasoning paths vary significantly in length, and the shortest correct responses often suffice, indicating untapped efficiency potential. Exploiting these findings, we propose two lightweight methods to enhance LRM efficiency. First, we introduce Efficiency Steering, a training-free activation steering technique that modulates reasoning behavior via a single direction in the model's representation space. Second, we develop Self-Rewarded Efficiency RL, a reinforcement learning framework that dynamically balances task accuracy and brevity by rewarding concise correct solutions. Extensive experiments on seven LRM backbones across multiple mathematical reasoning benchmarks demonstrate that our methods significantly reduce reasoning length while preserving or improving task performance. Our results highlight that reasoning efficiency can be improved by leveraging and guiding the intrinsic capabilities of existing models in a self-guided manner.
最近在大规模推理模型(LRMs)方面的进展显著增强了语言模型解决复杂问题的能力,通过模拟人类的审慎思考。然而,这些模型经常表现出过度思考的现象,即生成冗长且不必要的内容,这会降低效率并增加推断成本。在这项工作中,我们探索了这种低效性的表征和行为根源,并揭示出LRMs本身就具备进行更加简洁推理的能力。实证分析表明,正确的推理路径长度差异很大,而最短的正确答案通常就足够了,这意味着存在未被开发的效率潜力。 基于这些发现,我们提出了两种轻量级方法来提高LRM的效率。首先,我们引入了“Efficiency Steering”,这是一种无须重新训练的激活控制技术,通过在模型表征空间中的单一方向调节推理行为。其次,我们开发了一种名为“Self-Rewarded Efficiency RL”的强化学习框架,该框架动态平衡任务准确性和简洁性,通过奖励简短且正确的解决方案来实现。 我们在七个LRM基础架构上进行了广泛的实验,并针对多个数学推理基准测试展示了我们的方法显著减少了推理长度,同时保持甚至提高了任务性能。我们的结果表明,可以通过引导现有模型的内在能力来自我指导地改进推理效率。
https://arxiv.org/abs/2506.15647
Deep learning-based myocardial scar segmentation from late gadolinium enhancement (LGE) cardiac MRI has shown great potential for accurate and timely diagnosis and treatment planning for structural cardiac diseases. However, the limited availability and variability of LGE images with high-quality scar labels restrict the development of robust segmentation models. To address this, we introduce CLAIM: \textbf{C}linically-Guided \textbf{L}GE \textbf{A}ugmentation for Real\textbf{i}stic and Diverse \textbf{M}yocardial Scar Synthesis and Segmentation framework, a framework for anatomically grounded scar generation and segmentation. At its core is the SMILE module (Scar Mask generation guided by cLinical knowledgE), which conditions a diffusion-based generator on the clinically adopted AHA 17-segment model to synthesize images with anatomically consistent and spatially diverse scar patterns. In addition, CLAIM employs a joint training strategy in which the scar segmentation network is optimized alongside the generator, aiming to enhance both the realism of synthesized scars and the accuracy of the scar segmentation performance. Experimental results show that CLAIM produces anatomically coherent scar patterns and achieves higher Dice similarity with real scar distributions compared to baseline models. Our approach enables controllable and realistic myocardial scar synthesis and has demonstrated utility for downstream medical imaging task.
基于深度学习的心肌疤痕分割技术在使用延迟钆增强(LGE)心脏MRI进行准确且及时的结构性心脏病诊断和治疗规划方面展现出巨大潜力。然而,高质量心肌疤痕标签的有限可用性和变异性限制了稳健分割模型的发展。为了解决这一问题,我们引入了一个名为CLAIM的新框架:**C**linically-Guided **L**GE **A**ugmentation for Real**i**stic and Diverse **M**yocardial Scar Synthesis and Segmentation(临床指导的LGE增强用于真实且多样的心肌疤痕合成和分割)。该框架旨在生成解剖结构合理、疤痕模式多样的心肌图像。 CLAIM的核心是SMILE模块(Scar Mask generation guided by cLinical knowledgE,基于临床知识引导的心脏疤痕掩模生成),它利用临床上广泛采用的AHA 17段模型作为条件来合成具有解剖一致性和空间多样性心脏疤痕模式的图像。此外,CLAIM还采用了联合训练策略,在该策略中,心肌疤痕分割网络与生成器同步优化,旨在提高生成疤痕的真实感和心肌疤痕分割性能的准确性。 实验结果显示,与基准方法相比,CLAIM能产生结构上连贯的心脏疤痕模式,并且在与真实疤痕分布比较时具有更高的Dice相似度。这种方法使可控且逼真的心肌疤痕合成成为可能,并已在下游医学成像任务中证明了其有效性。
https://arxiv.org/abs/2506.15549
Cervical cancer remains a significant health problem, especially in developing countries. Early detection is critical for effective treatment. Convolutional neural networks (CNN) have shown promise in automated cervical cancer screening, but their performance depends on Pap smear image quality. This study investigates the impact of various image preprocessing techniques on CNN performance for cervical cancer classification using the SIPaKMeD dataset. Three preprocessing techniques were evaluated: perona-malik diffusion (PMD) filter for noise reduction, contrast-limited adaptive histogram equalization (CLAHE) for image contrast enhancement, and the proposed hybrid PMD filter-CLAHE approach. The enhanced image datasets were evaluated on pretrained models, such as ResNet-34, ResNet-50, SqueezeNet-1.0, MobileNet-V2, EfficientNet-B0, EfficientNet-B1, DenseNet-121, and DenseNet-201. The results show that hybrid preprocessing PMD filter-CLAHE can improve the Pap smear image quality and CNN architecture performance compared to the original images. The maximum metric improvements are 13.62% for accuracy, 10.04% for precision, 13.08% for recall, and 14.34% for F1-score. The proposed hybrid PMD filter-CLAHE technique offers a new perspective in improving cervical cancer classification performance using CNN architectures.
宫颈癌仍然是一个重要的健康问题,特别是在发展中国家。早期检测对于有效的治疗至关重要。卷积神经网络(CNN)在自动化宫颈癌筛查中显示出前景,但其性能依赖于巴氏涂片图像的质量。本研究调查了各种图像预处理技术对使用SIPaKMeD数据集进行宫颈癌分类的CNN性能的影响。评估了三种预处理技术:Perona-Malik扩散(PMD)滤波器用于噪声减少、对比度受限自适应直方图均衡化(CLAHE)用于增强图像对比度以及提出的混合PMD滤波器-CLAHE方法。使用预训练模型,如ResNet-34、ResNet-50、SqueezeNet-1.0、MobileNet-V2、EfficientNet-B0、EfficientNet-B1、DenseNet-121和DenseNet-201对增强后的图像数据集进行了评估。结果显示,混合预处理PMD滤波器-CLAHE方法可以改善巴氏涂片图像质量和CNN架构性能,相较于原始图像,最大指标改进分别为:准确率提高13.62%,精确度提高10.04%,召回率提高13.08%以及F1分数提高14.34%。提出的混合PMD滤波器-CLAHE技术为利用CNN架构改善宫颈癌分类性能提供了新的视角。
https://arxiv.org/abs/2506.15489
Synthetic Aperture Radar (SAR) enables submeter-resolution imaging and all-weather monitoring via active microwave and advanced signal processing. Currently, SAR has found extensive applications in critical maritime domains such as ship detection. However, SAR ship detection faces several challenges, including significant scale variations among ships, the presence of small offshore vessels mixed with noise, and complex backgrounds for large nearshore ships. To address these issues, this paper proposes a novel feature enhancement and fusion framework named C-AFBiFPN. C-AFBiFPN constructs a Convolutional Feature Enhancement (CFE) module following the backbone network, aiming to enrich feature representation and enhance the ability to capture and represent local details and contextual information. Furthermore, C-AFBiFPN innovatively integrates BiFormer attention within the fusion strategy of BiFPN, creating the AFBiFPN network. AFBiFPN improves the global modeling capability of cross-scale feature fusion and can adaptively focus on critical feature regions. The experimental results on SAR Ship Detection Dataset (SSDD) indicate that the proposed approach substantially enhances detection accuracy for small targets, robustness against occlusions, and adaptability to multi-scale features.
合成孔径雷达(SAR)通过主动微波和先进信号处理技术实现了亚米级分辨率成像及全天候监测。目前,SAR在诸如船舶检测等关键海洋领域得到了广泛应用。然而,SAR船舶检测面临着若干挑战,包括船型大小不一、海上小型船只混杂以及近岸大型船只的复杂背景环境。为解决这些问题,本文提出了一种名为C-AFBiFPN的新颖特征增强与融合框架。该框架在骨干网络之后构建了卷积特征增强(CFE)模块,旨在丰富特征表示,并提高捕捉和表达局部细节及上下文信息的能力。 此外,C-AFBiFPN创新性地将BiFormer注意力机制整合到了BiFPN的融合策略中,形成了AFBiFPN网络。这种新方法提升了跨尺度特征融合的全局建模能力,并能自适应聚焦于关键特征区域。在SAR Ship Detection Dataset(SSDD)上的实验结果表明,所提出的方案显著提高了对小目标检测精度、遮挡情况下的鲁棒性以及多尺度特征的适应性。
https://arxiv.org/abs/2506.15231
Point cloud analysis is the cornerstone of many downstream tasks, among which aggregating local structures is the basis for understanding point cloud data. While numerous works aggregate neighbor using three-dimensional relative coordinates, there are irrelevant point interference and feature hierarchy gap problems due to the limitation of local coordinates. Although some works address this limitation by refining spatial description though explicit modeling of cross-stage structure, these enhancement methods based on direct geometric structure encoding have problems of high computational overhead and noise sensitivity. To overcome these problems, we propose the Point Distribution Set Abstraction module (PDSA) that utilizes the correlation in the high-dimensional space to correct the feature distribution during aggregation, which improves the computational efficiency and robustness. PDSA distinguishes the point correlation based on a lightweight cross-stage structural descriptor, and enhances structural homogeneity by reducing the variance of the neighbor feature matrix and increasing classes separability though long-distance modeling. Additionally, we introducing a key point mechanism to optimize the computational overhead. The experimental result on semantic segmentation and classification tasks based on different baselines verify the generalization of the method we proposed, and achieve significant performance improvement with less parameter cost. The corresponding ablation and visualization results demonstrate the effectiveness and rationality of our method. The code and training weight is available at: this https URL
点云分析是许多下游任务的基础,其中聚合局部结构是理解点云数据的基本步骤。尽管有许多工作利用三维相对坐标来聚合同邻近的点,但由于局部坐标的限制,这些问题会导致无关点干扰和特征层次差距的问题。虽然有一些研究通过显式建模跨阶段结构来改进空间描述以解决这些局限性,但基于直接几何结构编码的方法仍然面临着计算开销大及对噪声敏感等问题。 为了克服这些问题,我们提出了一种名为“点分布集抽象”(Point Distribution Set Abstraction, PDSA)模块的解决方案。该模块利用高维空间中的相关性来校正聚合过程中的特征分布,从而提高计算效率和鲁棒性。PDSA通过一个轻量级的跨阶段结构描述符区分不同点之间的关联,并通过长距离建模降低邻近特征矩阵方差及增加类别间的分离度以增强结构同质性。此外,我们引入了一种关键点机制来优化计算开销。 基于不同的基线在语义分割和分类任务上的实验结果验证了所提出方法的泛化能力,并且即使参数成本较低也能实现显著性能提升。相应的消融分析和可视化结果也证实了我们方法的有效性和合理性。相关代码及训练权重可在以下链接获得:[提供具体URL]
https://arxiv.org/abs/2506.15160
Ensuring equitable public transit access remains challenging, particularly in densely populated cities like New York City (NYC), where low-income and minority communities often face limited transit accessibility. Bike-sharing systems (BSS) can bridge these equity gaps by providing affordable first- and last-mile connections. However, strategically expanding BSS into underserved neighborhoods is difficult due to uncertain bike-sharing demand at newly planned ("cold-start") station locations and limitations in traditional accessibility metrics that may overlook realistic bike usage potential. We introduce Transit for All (TFA), a spatial computing framework designed to guide the equitable expansion of BSS through three components: (1) spatially-informed bike-sharing demand prediction at cold-start stations using region representation learning that integrates multimodal geospatial data, (2) comprehensive transit accessibility assessment leveraging our novel weighted Public Transport Accessibility Level (wPTAL) by combining predicted bike-sharing demand with conventional transit accessibility metrics, and (3) strategic recommendations for new bike station placements that consider potential ridership and equity enhancement. Using NYC as a case study, we identify transit accessibility gaps that disproportionately impact low-income and minority communities in historically underserved neighborhoods. Our results show that strategically placing new stations guided by wPTAL notably reduces disparities in transit access related to economic and demographic factors. From our study, we demonstrate that TFA provides practical guidance for urban planners to promote equitable transit and enhance the quality of life in underserved urban communities.
确保公共交通的公平准入仍然具有挑战性,特别是在像纽约市(NYC)这样的高人口密度城市中,低收入和少数族裔社区经常面临有限的交通可达性。共享单车系统(BSS)可以通过提供经济实惠的第一公里和最后一公里连接来弥合这些平等差距。然而,由于在新规划站点处不确定的自行车共享需求以及传统可达性指标可能忽视实际自行车使用潜力的限制,向服务不足地区战略性扩展BSS变得困难。 我们引入了Transit for All(TFA),这是一个基于空间计算框架的设计,旨在通过三个组成部分指导共享单车系统的公平扩张:(1) 使用区域表示学习在冷启动站点进行空间信息引导的共享单车需求预测,并整合多模式地理空间数据;(2) 采用我们的新型加权公共交通可达性水平(wPTAL)综合评估全面交通可达性,该指标结合了预测的自行车共享需求与传统的交通可达性指标;(3) 考虑潜在乘客和提升公平性的战略建议为新的自行车站点选址。 以NYC为例作为案例研究,我们识别出了历史上服务不足的社区中低收入和少数族裔群体面临的主要公共交通可达性差距。我们的结果表明,根据wPTAL指导的新站点战略性放置显著减少了与经济和人口统计因素相关的交通访问不平等。通过我们的研究,我们证明了TFA为城市规划者提供了一种实用指南来促进公平的公共交通,并提升服务不足的城市社区的生活质量。
https://arxiv.org/abs/2506.15113
Speech enhancement, particularly denoising, is vital in improving the intelligibility and quality of speech signals for real-world applications, especially in noisy environments. While prior research has introduced various deep learning models for this purpose, many struggle to balance noise suppression, perceptual quality, and speaker-specific feature preservation, leaving a critical research gap in their comparative performance evaluation. This study benchmarks three state-of-the-art models Wave-U-Net, CMGAN, and U-Net, on diverse datasets such as SpEAR, VPQAD, and Clarkson datasets. These models were chosen due to their relevance in the literature and code accessibility. The evaluation reveals that U-Net achieves high noise suppression with SNR improvements of +71.96% on SpEAR, +64.83% on VPQAD, and +364.2% on the Clarkson dataset. CMGAN outperforms in perceptual quality, attaining the highest PESQ scores of 4.04 on SpEAR and 1.46 on VPQAD, making it well-suited for applications prioritizing natural and intelligible speech. Wave-U-Net balances these attributes with improvements in speaker-specific feature retention, evidenced by VeriSpeak score gains of +10.84% on SpEAR and +27.38% on VPQAD. This research indicates how advanced methods can optimize trade-offs between noise suppression, perceptual quality, and speaker recognition. The findings may contribute to advancing voice biometrics, forensic audio analysis, telecommunication, and speaker verification in challenging acoustic conditions.
语音增强,尤其是降噪技术,在改善真实世界应用场景中语音信号的可懂度和质量方面至关重要,尤其是在噪音环境中。尽管此前的研究已经提出了各种用于此目的的深度学习模型,但许多模型在噪声抑制、感知质量以及说话人特定特征保留之间难以取得平衡,留下了比较性能评估中的一个重要研究缺口。本研究对Wave-U-Net、CMGAN和U-Net这三种最先进的模型,在SpEAR、VPQAD和Clarkson数据集等多样化数据集上进行了基准测试。这些模型因其在文献中的相关性和代码可获取性而被选中进行研究。 评价结果表明,U-Net在噪声抑制方面表现出色,在SpEAR数据集上的信噪比(SNR)提高了71.96%,VPQAD数据集上提高了64.83%,Clarkson数据集上则提高了364.2%。CMGAN模型在感知质量方面表现优异,分别在SpEAR和VPQAD数据集中获得了最高的PESQ评分4.04和1.46,使其非常适合需要自然且易于理解的语音的应用场景。Wave-U-Net模型在保留说话人特定特征的同时也实现了噪声抑制方面的改进,这体现在VeriSpeak评分上的提升:在SpEAR数据集上提高了10.84%,VPQAD数据集上则提升了27.38%。 这项研究揭示了先进方法如何优化噪声抑制、感知质量和说话人识别之间的权衡。该研究的发现可能会推进语音生物识别技术、法医音频分析、电信通讯和在复杂声学条件下的说话人验证等领域的发展。
https://arxiv.org/abs/2506.15000
Previous methods for image geo-localization have typically treated the task as either classification or retrieval, often relying on black-box decisions that lack interpretability. The rise of large vision-language models (LVLMs) has enabled a rethinking of geo-localization as a reasoning-driven task grounded in visual cues. However, two major challenges persist. On the data side, existing reasoning-focused datasets are primarily based on street-view imagery, offering limited scene diversity and constrained viewpoints. On the modeling side, current approaches predominantly rely on supervised fine-tuning, which yields only marginal improvements in reasoning capabilities. To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images. We introduce GLOBE, Group-relative policy optimization for Locatability assessment and Optimized visual-clue reasoning, yielding Bi-objective geo-Enhancement for the VLM in recognition and reasoning. GLOBE incorporates task-specific rewards that jointly enhance locatability assessment, visual clue reasoning, and geolocation accuracy. Both qualitative and quantitative results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks, particularly in diverse visual scenes, while also generating more insightful and interpretable reasoning trajectories.
之前的图像地理定位方法通常将其视为分类或检索任务,常常依赖于缺乏可解释性的黑盒决策。随着大型视觉语言模型(LVLM)的兴起,人们开始重新思考将地理定位作为基于视觉线索的推理驱动型任务。然而,仍然存在两个主要挑战。在数据方面,现有的以推理为中心的数据集主要基于街景图像,这提供了有限的场景多样性以及受限的视角选择。在建模方面,当前的方法主要依赖于监督微调,这仅能带来有限的推理能力提升。 为了解决这些挑战,我们提出了一种新的工作流程,构建了一个侧重于推理的地理定位数据集——MP16-Reason,该数据集使用了多样化的社交媒体图像。同时,我们引入了GLOBE(基于组相对策略优化的可定位性评估和优化视觉线索推理),旨在为LVLM在识别与推理任务中的表现带来双目标增强。GLOBE整合了特定于任务的奖励机制,以共同提升可定位性评估、视觉线索推理以及地理坐标的准确性。 无论是定性的还是定量的结果都表明,GLOBE超越了现有开源LVLM模型,在多样化视觉场景下的地理定位任务中表现出色,并且生成更具有洞察力和解释性的推理路径。
https://arxiv.org/abs/2506.14674
The timely exchange of information among robots within a team is vital, but it can be constrained by limited wireless capacity. The inability to deliver information promptly can result in estimation errors that impact collaborative efforts among robots. In this paper, we propose a new metric termed Loss of Information Utility (LoIU) to quantify the freshness and utility of information critical for cooperation. The metric enables robots to prioritize information transmissions within bandwidth constraints. We also propose the estimation of LoIU using belief distributions and accordingly optimize both transmission schedule and resource allocation strategy for device-to-device transmissions to minimize the time-average LoIU within a robot team. A semi-decentralized Multi-Agent Deep Deterministic Policy Gradient framework is developed, where each robot functions as an actor responsible for scheduling transmissions among its collaborators while a central critic periodically evaluates and refines the actors in response to mobility and interference. Simulations validate the effectiveness of our approach, demonstrating an enhancement of information freshness and utility by 98%, compared to alternative methods.
团队中的机器人之间及时交换信息至关重要,但这种交流可能会受到有限无线容量的限制。无法及时传递信息会导致估算错误,进而影响机器人的协作效果。在本文中,我们提出了一种新的衡量标准,称为信息效用损失(Loss of Information Utility, LoIU),用于量化对合作至关重要的信息的新鲜度和实用性。该指标使机器人能够在带宽约束下优先处理信息传输。我们还提出了使用信念分布来估计LoIU,并相应地优化设备间传输的传输计划和资源分配策略,以最小化团队中机器人的时间平均LoIU。为了实现这一目标,开发了一种半分散式的多智能体深度确定性政策梯度框架,在该框架下每个机器人作为一个行动者负责在协作伙伴之间调度传输,而一个中央评判者会定期评估并改进这些行动者以响应移动性和干扰的变化。仿真结果验证了我们方法的有效性,与替代方法相比,信息的新鲜度和实用性提高了98%。
https://arxiv.org/abs/2506.14237
Face super-resolution (FSR) under limited computational costs remains an open problem. Existing approaches typically treat all facial pixels equally, resulting in suboptimal allocation of computational resources and degraded FSR performance. CNN is relatively sensitive to high-frequency facial features, such as component contours and facial outlines. Meanwhile, Mamba excels at capturing low-frequency features like facial color and fine-grained texture, and does so with lower complexity than Transformers. Motivated by these observations, we propose FADPNet, a Frequency-Aware Dual-Path Network that decomposes facial features into low- and high-frequency components and processes them via dedicated branches. For low-frequency regions, we introduce a Mamba-based Low-Frequency Enhancement Block (LFEB), which combines state-space attention with squeeze-and-excitation operations to extract low-frequency global interactions and emphasize informative channels. For high-frequency regions, we design a CNN-based Deep Position-Aware Attention (DPA) module to enhance spatially-dependent structural details, complemented by a lightweight High-Frequency Refinement (HFR) module that further refines frequency-specific representations. Through the above designs, our method achieves an excellent balance between FSR quality and model efficiency, outperforming existing approaches.
在计算成本有限的情况下,人脸超分辨率(FSR)仍然是一个开放性问题。现有方法通常对所有面部像素一视同仁,导致计算资源分配不均衡和FSR性能下降。卷积神经网络(CNN)对于如面部轮廓和边界的高频特征较为敏感;同时,Mamba模型擅长捕捉肤色和细腻纹理等低频特征,并且其复杂度低于Transformer模型。基于这些观察结果,我们提出了频率感知双路径网络FADPNet。该网络将面部特征分解为低频和高频成分,并通过专用分支进行处理。 对于低频区域,我们引入了一个基于Mamba的低频增强块(LFEB),结合状态空间注意力与挤压激励操作以提取低频全局互动并强调信息通道。针对高频区域,设计了一种基于CNN的位置感知深度注意模块DPA来加强依赖于位置的结构细节,并辅以轻量级的高频细化(HFR)模块进一步优化频率特定表示。 通过上述设计,我们的方法在FSR质量和模型效率之间实现了优异的平衡,超越了现有方法。
https://arxiv.org/abs/2506.14121
An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM's effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).
一个有效的奖励模型在强化学习中起着关键作用,特别是在视觉生成模型训练后的增强方面。然而,当前的奖励建模方法由于依赖大量的手动标注偏好数据或精心设计的质量维度(这些维度往往不完整且工程复杂度高),而面临实施上的复杂性问题。受对抗训练在生成对抗网络(GAN)中的启发,本文提出了一种名为GAN-RM的有效奖励建模框架,该框架消除了对人工偏好注释和显式质量维度工程的需求。我们的方法通过区分少量代表性的未配对目标样本(称为偏好代理数据)和模型生成的普通输出来训练奖励模型,只需要几百个目标样本即可完成训练。 全面的实验表明,GAN-RM在包括测试时间缩放(如Best-of-N采样过滤)、训练后的改进方法(如监督微调SFT和直接偏好优化DPO)等关键应用中均表现出显著的有效性。
https://arxiv.org/abs/2506.13846
Depth map enhancement using paired high-resolution RGB images offers a cost-effective solution for improving low-resolution depth data from lightweight ToF sensors. Nevertheless, naively adopting a depth estimation pipeline to fuse the two modalities requires groundtruth depth maps for supervision. To address this, we propose a self-supervised learning framework, SelfToF, which generates detailed and scale-aware depth maps. Starting from an image-based self-supervised depth estimation pipeline, we add low-resolution depth as inputs, design a new depth consistency loss, propose a scale-recovery module, and finally obtain a large performance boost. Furthermore, since the ToF signal sparsity varies in real-world applications, we upgrade SelfToF to SelfToF* with submanifold convolution and guided feature fusion. Consequently, SelfToF* maintain robust performance across varying sparsity levels in ToF data. Overall, our proposed method is both efficient and effective, as verified by extensive experiments on the NYU and ScanNet datasets. The code will be made public.
使用配对的高分辨率RGB图像来增强深度图,为改善来自轻量级ToF(飞行时间)传感器的低分辨率深度数据提供了一种成本效益高的解决方案。然而,简单地采用传统的深度估计管道融合这两种模态需要地面实况深度图来进行监督。为了应对这一挑战,我们提出了一种自监督学习框架SelfToF,该框架能够生成详细且尺度感知的深度图。从基于图像的自监督深度估计管道开始,我们将低分辨率深度图作为输入加入进来,并设计了新的深度一致性损失函数,提出了尺度恢复模块,最终实现了性能的重大提升。 由于实际应用中ToF信号稀疏性会有所不同,我们进一步升级SelfToF框架为SelfToF*版本,通过添加子流形卷积和引导特征融合来适应不同情况。这样,SelfToF*在处理不同稀疏度级别的ToF数据时能够保持稳健的性能表现。 总的来说,我们的方法既高效又有效,这一点已经在NYU(纽约大学)和ScanNet数据集上的广泛实验中得到了验证。代码将在后续公开发布。
https://arxiv.org/abs/2506.13444
Typically, research on Explainable Artificial Intelligence (XAI) focuses on black-box models within the context of a general policy in a known, specific domain. This paper advocates for the need for knowledge-agnostic explainability applied to the subfield of XAI called Explainable Search, which focuses on explaining the choices made by intelligent search techniques. It proposes Monte-Carlo Tree Search (MCTS) enhancements as a solution to obtaining additional data and providing higher-quality explanations while remaining knowledge-free, and analyzes the most popular enhancements in terms of the specific types of explainability they introduce. So far, no other research has considered the explainability of MCTS enhancements. We present a proof-of-concept that demonstrates the advantages of utilizing enhancements.
通常,可解释人工智能(XAI)的研究集中在已知特定领域的通用政策下的黑盒模型上。本文提倡在XAI的一个子领域——可解释搜索中应用知识不可见的可解释性,该子领域专注于解释智能搜索技术所做选择的原因。文章提出了蒙特卡洛树搜索(MCTS)增强作为解决方案,在无需额外知识的情况下获取更多数据并提供更高质量的解释,并分析了最受欢迎的增强方法在引入特定类型可解释性方面的表现。迄今为止,还没有其他研究考虑过MCTS增强的可解释性。我们提出一个概念验证来展示利用增强的优势。
https://arxiv.org/abs/2506.13223
The integration of robotics and augmented reality (AR) holds transformative potential for advancing human-robot interaction (HRI), offering enhancements in usability, intuitiveness, accessibility, and collaborative task performance. This paper introduces and evaluates a novel multimodal AR-based robot puppeteer framework that enables intuitive teleoperation via virtual counterpart through large language model (LLM)-driven voice commands and hand gesture interactions. Utilizing the Meta Quest 3, users interact with a virtual counterpart robot in real-time, effectively "puppeteering" its physical counterpart within an AR environment. We conducted a within-subject user study with 42 participants performing robotic cube pick-and-place with pattern matching tasks under two conditions: gesture-only interaction and combined voice-and-gesture interaction. Both objective performance metrics and subjective user experience (UX) measures were assessed, including an extended comparative analysis between roboticists and non-roboticists. The results provide key insights into how multimodal input influences contextual task efficiency, usability, and user satisfaction in AR-based HRI. Our findings offer practical design implications for designing effective AR-enhanced HRI systems.
机器人技术和增强现实(AR)的融合在推进人机交互(HRI)方面具有变革潜力,为可用性、直观性、可访问性和协作任务表现提供了改进。本文介绍并评估了一种新颖的基于多模态AR的机器人木偶师框架,该框架通过大型语言模型(LLM)驱动的声音命令和手部手势互动实现直观的远程操作。利用Meta Quest 3设备,用户可以在实时环境中与虚拟机器人的对应体进行交互,从而在AR环境下有效“操控”其实体对应物。 我们进行了一个涉及42名参与者的同组用户研究,在两种条件下测试了机器人立方体抓取和放置任务(包括模式匹配):仅手势互动和结合声音及手势互动。评估既包括客观性能指标也包括主观用户体验(UX)测量,还进行了一项扩展的比较分析,对比了机器人专家与非机器人专家之间的差异。 结果提供了关于多模态输入如何影响AR环境中的上下文任务效率、可用性和用户满意度的重要见解。我们的研究发现为设计有效的增强现实人机交互系统提出了实用的设计建议。
https://arxiv.org/abs/2506.13189
[...] Since then, various APR approaches, especially those leveraging the power of large language models (LLMs), have been rapidly developed to fix general software bugs. Unfortunately, the effectiveness of these advanced techniques in the context of regression bugs remains largely unexplored. This gap motivates the need for an empirical study evaluating the effectiveness of modern APR techniques in fixing real-world regression bugs. In this work, we conduct an empirical study of APR techniques on Java regression bugs. To facilitate our study, we introduce RegMiner4APR, a high-quality benchmark of Java regression bugs integrated into a framework designed to facilitate APR research. The current benchmark includes 99 regression bugs collected from 32 widely used real-world Java GitHub repositories. We begin by conducting an in-depth analysis of the benchmark, demonstrating its diversity and quality. Building on this foundation, we empirically evaluate the capabilities of APR to regression bugs by assessing both traditional APR tools and advanced LLM-based APR approaches. Our experimental results show that classical APR tools fail to repair any bugs, while LLM-based APR approaches exhibit promising potential. Motivated by these results, we investigate impact of incorporating bug-inducing change information into LLM-based APR approaches for fixing regression bugs. Our results highlight that this context-aware enhancement significantly improves the performance of LLM-based APR, yielding 1.8x more successful repairs compared to using LLM-based APR without such context.
从那时起,各种自动程序修复(APR)方法,特别是那些利用大型语言模型(LLM)能力的方法,迅速发展起来以解决通用软件缺陷。不幸的是,在回归错误背景下这些高级技术的有效性尚未得到充分探索。这种研究空白促使了对现代APR技术在处理真实世界回归错误有效性进行实证研究的需求。在这项工作中,我们针对Java的回归错误进行了APR技术的实证研究,并为此引入了一个高质量基准——RegMiner4APR,这是一个整合进旨在促进APR研究框架中的Java回归错误集合。 当前的基准包括从32个广泛使用的现实世界Java GitHub存储库中收集到的99个回归错误。我们首先对这个基准进行了深入分析,展示了其多样性和质量。在此基础上,我们通过评估传统APR工具和先进的LLM基础APR方法的能力,实证地检验了APR对于解决回归错误的有效性。我们的实验结果显示,传统的APR工具未能修复任何错误,而基于LLM的APR方法则展现出了巨大的潜力。 受到这些结果的激励,我们进一步研究了将引起错误的变化信息整合到基于LLM的APR方法中以修复回归错误的影响。我们的发现表明,这种上下文感知改进显著提升了基于LLM的APR性能,在使用该上下文信息时成功修复次数是不使用该信息情况下的1.8倍。
https://arxiv.org/abs/2506.13182
In recent years, complexity compression of neural network (NN)-based speech enhancement (SE) models has gradually attracted the attention of researchers, especially in scenarios with limited hardware resources or strict latency requirements. The main difficulties and challenges lie in achieving a balance between complexity and performance according to the characteristics of the task. In this paper, we propose an intra-inter set knowledge distillation (KD) framework with time-frequency calibration (I$^2$S-TFCKD) for SE. Different from previous distillation strategies for SE, the proposed framework fully utilizes the time-frequency differential information of speech while promoting global knowledge flow. Firstly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. Secondly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through residual fusion to form the fused feature set that enables inter-set knowledge interaction. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.
近年来,基于神经网络(NN)的语音增强(SE)模型在复杂度压缩方面逐渐引起了研究人员的关注,尤其是在硬件资源有限或有严格延迟要求的情况下。主要的困难和挑战在于根据任务特点实现复杂性和性能之间的平衡。本文提出了一种结合时间频率校准的内部-外部集知识蒸馏框架(I²S-TFCKD)用于语音增强。不同于之前的SE蒸馏策略,该框架充分利用了语音的时间频率差分信息,并促进了全局知识流动。 首先,我们提出了基于双流时频交叉校准的多层交互式蒸馏方法,分别计算时间域和频率域中的教师-学生相似性校准权重并进行跨权重调整,从而根据语音特性实现不同层次之间精细的蒸馏贡献分配。其次,我们构建了一个用于内部集与外部集关联性的协作蒸馏范式。在相关联的一组内,多层教师-学生特征以成对方式进行匹配以便进行校准后的蒸馏。随后,通过残差融合从每个相关的集合中生成代表性特征,形成使跨集合知识交互成为可能的融合特征集。 提出的蒸馏策略应用于在L3DAS23挑战赛的SE赛道上排名第一的双路径膨胀卷积循环网络(DPDCRN)模型中。客观评估表明,所提出的蒸馏策略能够持续有效地提升低复杂度学生模型的表现,并优于其他蒸馏方案。
https://arxiv.org/abs/2506.13127
Large Multimodal Models (LMMs) demonstrate significant cross-modal reasoning capabilities. However, financial applications face challenges due to the lack of high-quality multimodal reasoning datasets and the inefficiency of existing training paradigms for reasoning enhancement. To address these issues, we propose an integrated framework, FinLMM-R1, combining an automated and scalable pipeline for data construction with enhanced training strategies to improve the multimodal reasoning of LMM. The Automated and Scalable Pipeline (ASP) resolves textual-visual misalignment in financial reports through a separate paradigm of question-answer generation and image-question alignment, ensuring data integrity and extraction efficiency. Through ASP, we collect 89,378 aligned image-question pairs from 23,397 financial reports, covering tasks such as arithmetic reasoning, statistics reasoning, financial explanation, and financial knowledge. Moreover, we introduce the Thinking with Adversarial Reward in LMM (TAR-LMM), extending the prior two-stage training framework [1] with additional reward mechanisms. In the first stage, we focus on text-only tasks with format and accuracy rewards to guide the model in generating well-structured thinking contents. In the second stage, we construct multi-image contrastive samples with additional reward components including image selection, thinking content length, and adversarial reward to jointly optimize the LMM across visual perception, reasoning efficiency, and logical coherence. Extensive experiments on 7 benchmarks show ASP-derived dataset and training framework significantly improve answer accuracy and reasoning depth over existing reasoning LMMs in both general and financial multimodal contexts.
大型多模态模型(LMM)展示了显著的跨模态推理能力。然而,金融应用面临着由于缺乏高质量的多模态推理数据集以及现有训练范式中用于增强推理效率低下的挑战。为了解决这些问题,我们提出了一种集成框架FinLMM-R1,该框架结合了自动化和可扩展的数据构建管道与增强的训练策略以提高大型多模态模型(LMM)的多模态推理能力。 **自动化的可扩展数据管道(ASP)通过单独的问题-答案生成及图像-问题对齐范式解决了金融报告中的文本-视觉错位,确保了数据完整性和提取效率。** 通过ASP,我们从23,397份财务报告中收集了89,378个对齐的图像-问题对,涵盖了算术推理、统计推理、财务解释和金融知识等任务。 此外,我们还提出了“LMM中的对抗奖励思考”(TAR-LMM),它在先前的两阶段训练框架基础上增加了额外的奖励机制。**第一阶段主要关注仅文本的任务,通过格式和准确性奖励来指导模型生成结构良好的思维内容;第二阶段则构建多图像对比样本,并引入包括图像选择、思维内容长度以及对抗性奖励在内的附加奖励成分,以同时优化视觉感知、推理效率及逻辑一致性。** 在7个基准测试上的广泛实验表明,ASP产生的数据集和训练框架显著提高了现有推理LMM在通用与金融多模态环境下的答案准确性和推理深度。
https://arxiv.org/abs/2506.13066
Recent advancements in large language models (LLMs) have witnessed a surge in the development of advanced reasoning paradigms, which are now being integrated into multimodal large language models (MLLMs). However, existing approaches often fall short: methods solely employing reinforcement learning (RL) can struggle with sample inefficiency and activating entirely absent reasoning capabilities, while conventional pipelines that initiate with a cold-start supervised fine-tuning (SFT) phase before RL may restrict the model's exploratory capacity and face suboptimal convergence. In this work, we introduce \textbf{Metis-RISE} (\textbf{R}L \textbf{I}ncentivizes and \textbf{S}FT \textbf{E}nhances) for multimodal reasoning model learning. Unlike conventional approaches, Metis-RISE distinctively omits an initial SFT stage, beginning instead with an RL phase (e.g., using a Group Relative Policy Optimization variant) to incentivize and activate the model's latent reasoning capacity. Subsequently, the targeted SFT stage addresses two key challenges identified during RL: (1) \textit{inefficient trajectory sampling} for tasks where the model possesses but inconsistently applies correct reasoning, which we tackle using self-distilled reasoning trajectories from the RL model itself; and (2) \textit{fundamental capability absence}, which we address by injecting expert-augmented knowledge for prompts where the model entirely fails. This strategic application of RL for incentivization followed by SFT for enhancement forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard demonstrate that both models achieve state-of-the-art performance among similar-sized models, with the 72B version ranking fourth overall.
最近,在大型语言模型(LLM)领域的进展见证了高级推理范式的快速发展,这些范式现在被整合到了多模态大型语言模型(MLLMs)中。然而,现有的方法往往存在不足:仅使用强化学习(RL)的方法可能在样本效率和激活完全缺失的推理能力方面遇到困难;而传统的流程则从冷启动监督微调(SFT)阶段开始,之后再进行RL,这种方法可能会限制模型的探索能力和导致次优收敛。为此,在这项工作中,我们引入了**Metis-RISE**(**R**L **I**ncentivizes and **S**FT **E**nhances),用于多模态推理模型的学习。 与传统方法不同,**Metis-RISE** 独特地省略了初始的 SFT 阶段,而是从一个 RL 阶段开始(例如,使用 Group Relative Policy Optimization 的变体)来激励和激活模型潜在的推理能力。随后,在目标微调阶段解决在RL过程中识别出的两个关键挑战:(1) **任务采样效率低下**,即对于模型拥有但不一致应用正确推理的任务,我们通过使用来自 RL 模型本身的自我蒸馏推理轨迹来应对这一问题;以及 (2) **根本能力缺失**,即当模型完全失败时,我们通过为提示注入专家增强的知识来解决。这种将RL用于激励随后进行SFT以提高性能的策略构成了Metis-RISE的核心,从而形成了两个版本的我们的MLLM(70亿和720亿参数)。在OpenCompass多模态推理排行榜上的评估表明,这两种模型均实现了同类规模模型中的最佳性能,其中720亿参数版本总排名第四。
https://arxiv.org/abs/2506.13056
Recently, multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose {ChartIR}, an iterative refinement method based on structured instruction. First, we distinguish two tasks: visual understanding and code translation. To accomplish the visual understanding component, we design two types of structured instructions: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations, thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement, enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o.
近期,多模态大型语言模型(MLLM)由于其强大的视觉理解能力而引起了越来越多的研究关注。尽管这些模型在各种视觉任务上取得了令人印象深刻的结果,但在图表到代码生成任务上的表现仍然不尽如人意。该任务要求MLLM生成能够再现给定图表的可执行代码,不仅需要精确的视觉理解能力,还需要将视觉元素准确地转化为结构化代码。直接提示MLLM完成这一复杂任务通常会得到不令人满意的结果。 为了解决这一挑战,我们提出了一种基于结构化指令的迭代改进方法{ChartIR}。首先,我们将任务分为两部分:视觉理解和代码翻译。为了实现视觉理解组件,我们设计了两种类型的结构化指令:描述和差异。描述性指令捕捉参考图表中的视觉元素,而差异性指令则刻画参考图表与生成图表之间的不同之处。这些指令有效地将视觉特征转化为语言表示,从而为后续的代码转换过程铺平道路。 其次,我们将整体的图表生成管道分解成两个阶段:初始代码生成和迭代改进,从而逐步提升最终输出的质量。实验结果表明,与其他方法相比,我们的方法在开源模型Qwen2-VL和闭源模型GPT-4o上都取得了更优的表现。
https://arxiv.org/abs/2506.14837
We present our solution to the MiGA Challenge at IJCAI 2025, which aims to recognize micro-gestures (MGs) from skeleton sequences for the purpose of hidden emotion understanding. MGs are characterized by their subtlety, short duration, and low motion amplitude, making them particularly challenging to model and classify. We adopt PoseC3D as the baseline framework and introduce three key enhancements: (1) a topology-aware skeleton representation specifically designed for the iMiGUE dataset to better capture fine-grained motion patterns; (2) an improved temporal processing strategy that facilitates smoother and more temporally consistent motion modeling; and (3) the incorporation of semantic label embeddings as auxiliary supervision to improve the model generalization. Our method achieves a Top-1 accuracy of 67.01\% on the iMiGUE test set. As a result of these contributions, our approach ranks third on the official MiGA Challenge leaderboard. The source code is available at \href{this https URL}{this https URL\_track1}.
我们提出了针对2025年IJCAI MiGA挑战赛的解决方案,旨在从骨骼序列中识别微手势(MGs),以理解隐藏的情绪。微手势因其细微、短暂和低幅度运动的特点而难以建模和分类。我们的方法基于PoseC3D框架,并引入了三项关键改进:(1) 一种专为iMiGUE数据集设计的拓扑感知骨骼表示,能够更好地捕捉细微的动作模式;(2) 改进的时间处理策略,有助于更平滑、时间上更加一致地建模动作;以及 (3) 引入语义标签嵌入作为辅助监督,以提高模型泛化能力。我们的方法在iMiGUE测试集上的Top-1准确率达到了67.01%。由于这些贡献,我们在官方MiGA挑战赛排行榜上排名第三。源代码可在[\href{this https URL}{此处}](\url{this https URL\_track1})获取。
https://arxiv.org/abs/2506.12848