Encountering shifted data at test time is a ubiquitous challenge when deploying predictive models. Test-time adaptation (TTA) methods address this issue by continuously adapting a deployed model using only unlabeled test data. While TTA can extend the model's lifespan, it is only a temporary solution. Eventually the model might degrade to the point that it must be taken offline and retrained. To detect such points of ultimate failure, we propose pairing TTA with risk monitoring frameworks that track predictive performance and raise alerts when predefined performance criteria are violated. Specifically, we extend existing monitoring tools based on sequential testing with confidence sequences to accommodate scenarios in which the model is updated at test time and no test labels are available to estimate the performance metrics of interest. Our extensions unlock the application of rigorous statistical risk monitoring to TTA, and we demonstrate the effectiveness of our proposed TTA monitoring framework across a representative set of datasets, distribution shift types, and TTA methods.
在部署预测模型时,遇到测试数据偏移是一个普遍的挑战。测试时间适应(TTA)方法通过仅使用未标记的测试数据来持续更新已部署的模型,从而解决这个问题。虽然TTA可以延长模型的有效期,但这是暂时性的解决方案。最终,模型可能会退化到必须离线重新训练的程度。为了检测这种最终失效点,我们建议将风险监控框架与TTA结合使用,这些框架跟踪预测性能并在预定义性能标准被违反时发出警报。 具体来说,我们将现有的基于置信序列的顺序测试监控工具扩展到了模型在测试时间更新且没有可用标签来估计所需性能指标的情况。我们的扩展解锁了严格的统计风险监测应用于TTA的可能性,并展示了我们提出的TTA监控框架在具有代表性的数据集、分布偏移类型和TTA方法上的有效性。
https://arxiv.org/abs/2507.08721
The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation, addressing existing text-only limitations. Our evaluations using MMEval highlight significant remaining challenges for models in precise visual information capture, instruction following, and advanced programming knowledge. Our work aims to revolutionize industrial programming by enabling LLMs to interpret and implement complex specifications conveyed through both text and visual designs.
大型语言模型(LLMs)的快速进步显著提高了代码生成能力,但大多数模型仍然只处理文本内容,忽视了在现实世界软件开发中至关重要的视觉辅助工具,如统一建模语言(UML)图和流程图。为弥补这一差距,我们引入了MM-Coder——一个多语言多模式的软件开发者。MM-Coder将视觉设计输入——例如UML图和流程图(称为视觉工作流)与文本指令相结合,以提高代码生成的准确性和架构一致性。 为了实现这一点,我们开发了MMc-Instruct,这是一个多样化的多模态指令调优数据集,包括基于视觉工作流的代码生成。这使得MM-Coder能够像人类开发者一样综合处理文本和图形信息,区别于之前专注于狭窄任务的工作。此外,我们还引入了MMEval——一个新的用于评估多模式代码生成的基准测试工具,以解决现有仅限于文本的限制。 使用MMEval进行的评估表明,模型在精确捕获视觉信息、遵循指令以及掌握高级编程知识方面仍存在重大挑战。我们的工作旨在通过使LLMs能够理解和实施同时通过文本和视觉设计传达的复杂规范,从而革新工业编程。
https://arxiv.org/abs/2507.08719
Scaling laws have achieved success in LLM and foundation models. To explore their potential in ISAC research, we propose Great-X. This single-engine multimodal data twin platform reconstructs the ray-tracing computation of Sionna within Unreal Engine and is deeply integrated with autonomous driving tools. This enables efficient and synchronized simulation of multimodal data, including CSI, RGB, Radar, and LiDAR. Based on this platform, we construct an open-source, large-scale, low-altitude UAV multimodal synaesthesia dataset named Great-MSD, and propose a baseline CSI-based UAV 3D localization algorithm, demonstrating its feasibility and generalizability across different CSI simulation engines. The related code and dataset are publicly available at: this https URL.
规模定律在大型语言模型(LLM)和基础模型研究中取得了成功。为了探索其在ISAC(智能传感与通信)研究中的潜力,我们提出了Great-X平台。这是一个单引擎多模态数据孪生平台,它将Sionna中的光线追踪计算重构到Unreal Engine中,并且与自动驾驶工具深度集成。这使得包括信道状态信息(CSI)、RGB图像、雷达和激光雷达在内的多模态数据的高效同步仿真成为可能。 基于此平台,我们构建了一个开源的大规模低空无人机多模态通感数据集,名为Great-MSD,并提出了一种基于CSI的无人机3D定位算法基线方案。该方案展示了其在不同CSI模拟引擎中的可行性和泛化能力。 相关代码和数据集可在以下链接获取:[此网址](this https URL)(请将"this https URL"替换为实际的网址)。
https://arxiv.org/abs/2507.08716
For developing innovative systems architectures, modeling and optimization techniques have been central to frame the architecting process and define the optimization and modeling problems. In this context, for system-of-systems the use of efficient dedicated approaches (often physics-based simulations) is highly recommended to reduce the computational complexity of the targeted applications. However, exploring novel architectures using such dedicated approaches might pose challenges for optimization algorithms, including increased evaluation costs and potential failures. To address these challenges, surrogate-based optimization algorithms, such as Bayesian optimization utilizing Gaussian process models have emerged.
在开发创新系统架构时,建模和优化技术一直是定义架构过程及优化与建模问题的核心。在这种背景下,对于由多个子系统组成的复杂系统(即“系统之系统”),推荐使用高效的专用方法(通常是基于物理的模拟)来降低目标应用的计算复杂性。然而,利用这些专用方法探索新的架构可能会给优化算法带来挑战,包括增加的评估成本和潜在的失败风险。为了解决这些问题,基于代理的优化算法,如利用高斯过程模型的贝叶斯优化算法应运而生。
https://arxiv.org/abs/2507.08715
Multiple Instance Learning (MIL) offers a natural solution for settings where only coarse, bag-level labels are available, without having access to instance-level annotations. This is usually the case in digital pathology, which consists of gigapixel sized images. While deterministic attention-based MIL approaches achieve strong bag-level performance, they often overlook the uncertainty inherent in instance relevance. In this paper, we address the lack of uncertainty quantification in instance-level attention scores by introducing \textbf{SGPMIL}, a new probabilistic attention-based MIL framework grounded in Sparse Gaussian Processes (SGP). By learning a posterior distribution over attention scores, SGPMIL enables principled uncertainty estimation, resulting in more reliable and calibrated instance relevance maps. Our approach not only preserves competitive bag-level performance but also significantly improves the quality and interpretability of instance-level predictions under uncertainty. SGPMIL extends prior work by introducing feature scaling in the SGP predictive mean function, leading to faster training, improved efficiency, and enhanced instance-level performance. Extensive experiments on multiple well-established digital pathology datasets highlight the effectiveness of our approach across both bag- and instance-level evaluations. Our code will be made publicly available.
多示例学习(Multiple Instance Learning,MIL)为仅提供粗略的袋级标签而无实例级注释可用的情况提供了自然解决方案。这种情况通常出现在数字病理学中,因为该领域涉及的是数十亿像素大小的图像。虽然基于确定性注意力机制的MIL方法能够实现强大的袋级性能,但它们往往忽略了实例相关性的内在不确定性。在本文中,我们通过引入**SGPMIL**(一种基于稀疏高斯过程(SGP)的概率注意机制框架),解决了实例级别注意力分数中的不确定量化缺乏的问题。SGPMIL通过学习关注分数的后验分布来进行原则化的不确定性估计,从而产生更可靠和校准良好的实例相关性图。 我们的方法不仅保持了袋级性能的竞争水平,还显著提高了在不确定性下的实例级预测的质量和可解释性。SGPMIL通过在SGP预测均值函数中引入特征缩放,扩展了先前的工作,这带来了更快的训练速度、更高的效率以及增强的实例级别表现。在多个知名数字病理学数据集上的广泛实验表明,在袋级和实例级评估方面,我们的方法表现出色。我们将公开提供代码以供使用。
https://arxiv.org/abs/2507.08711
We propose a novel embedding-based captioning metric termed as L-CLIPScore that can be used for efficiently evaluating caption quality and training captioning model. L-CLIPScore is calculated from a lightweight CLIP (L-CLIP), which is a dual-encoder architecture compressed and distilled from CLIP. To compress, we apply two powerful techniques which are weight multiplexing and matrix decomposition for reducing the parameters of encoders and word embedding matrix, respectively. To distill, we design a novel multi-modal Similarity Regulator (SR) loss to transfer more vision-language alignment knowledge. Specifically, SR loss amplifies the multi-modal embedding similarity if the given image-text pair is matched and diminishes the similarity if the pair is non-matched. By compressing and distilling by this novel SR loss, our L-CLIP achieves comparable multi-modal alignment ability to the original CLIP while it requires fewer computation resources and running time. We carry out exhaustive experiments to validate the efficiency and effectiveness of L-CLIPScore when using it as the judge to evaluate caption quality. We also discover that when using L-CLIPScore as the supervisor to train the captioning model, it should be mixed up by an n-gram-based metric and meanwhile analyze why using L-CLIPScore only will cause fail training.
我们提出了一种新的基于嵌入的描述符度量方法,称为L-CLIPScore,它可以高效地评估描述符的质量并用于训练描述生成模型。L-CLIPScore 是从一个轻量级 CLIP(即 L-CLIP)计算得出的,该架构是从原始 CLIP 压缩和蒸馏而来的双编码器结构。为了压缩,我们采用了两种强大的技术:权重复用和矩阵分解,分别用于减少编码器和词嵌入矩阵中的参数数量。为了进行蒸馏,我们设计了一种新颖的多模态相似性调节器(SR)损失函数,以转移更多的视觉-语言对齐知识。具体来说,当给定的图像-文本配对匹配时,SR 损失会放大多模态嵌入之间的相似度;而不匹配时,则减小这种相似度。通过使用 SR 损失进行压缩和蒸馏,我们的 L-CLIP 达到了与原始 CLIP 相当的多模态对齐能力,但需要较少的计算资源和运行时间。 我们进行了详尽的实验以验证在评估描述符质量时使用 L-CLIPScore 的效率和有效性。同时,我们也发现,在利用 L-CLIPScore 作为监督信号训练描述生成模型时,应该将其与基于 n-gram 的度量混合使用,并且分析了仅使用 L-CLIPScore 进行训练为什么会失败的原因。
https://arxiv.org/abs/2507.08710
Inverse Reinforcement Learning (IRL) presents a powerful paradigm for learning complex robotic tasks from human demonstrations. However, most approaches make the assumption that expert demonstrations are available, which is often not the case. Those that allow for suboptimality in the demonstrations are not designed for long-horizon goals or adversarial tasks. Many desirable robot capabilities fall into one or both of these categories, thus highlighting a critical shortcoming in the ability of IRL to produce field-ready robotic agents. We introduce Sample-efficient Preference-based inverse reinforcement learning for Long-horizon Adversarial tasks from Suboptimal Hierarchical demonstrations (SPLASH), which advances the state-of-the-art in learning from suboptimal demonstrations to long-horizon and adversarial settings. We empirically validate SPLASH on a maritime capture-the-flag task in simulation, and demonstrate real-world applicability with sim-to-real translation experiments on autonomous unmanned surface vehicles. We show that our proposed methods allow SPLASH to significantly outperform the state-of-the-art in reward learning from suboptimal demonstrations.
逆向强化学习(IRL)为从人类演示中学习复杂机器人任务提供了一个强大的框架。然而,大多数方法假设专家演示是可用的,这在实践中常常不成立。那些允许演示存在非最优性的方法并不适用于长期目标或对抗性任务的设计。许多理想的机器人能力都属于上述一种或两种情况,因此突显了IRL生成可直接应用的机器人代理的能力上的一个关键缺陷。 我们提出了SPLASH(从次优层次化演示中进行样本高效偏好评价逆向强化学习以解决长时序和对抗性任务),该方法在从非最优演示中学习除外,在长期目标与对抗性任务设置方面也实现了对现有技术的重大突破。我们在模拟环境中通过海上夺旗任务验证了SPLASH的效果,并通过自主无人水面艇的仿真到现实转换实验展示了其实际应用潜力。我们证明,我们的方法使SPLASH在从非最优演示中学习奖励时显著超越现有的最先进技术。
https://arxiv.org/abs/2507.08707
We present elsciRL, an open-source Python library to facilitate the application of language solutions on reinforcement learning problems. We demonstrate the potential of our software by extending the Language Adapter with Self-Completing Instruction framework defined in (Osborne, 2024) with the use of LLMs. Our approach can be re-applied to new applications with minimal setup requirements. We provide a novel GUI that allows a user to provide text input for an LLM to generate instructions which it can then self-complete. Empirical results indicate that these instructions \textit{can} improve a reinforcement learning agent's performance. Therefore, we present this work to accelerate the evaluation of language solutions on reward based environments to enable new opportunities for scientific discovery.
我们介绍了elsciRL,这是一个开源的Python库,旨在促进在强化学习问题上应用语言解决方案。通过将Osborne (2024) 定义的语言适配器与自我完成指令框架扩展为使用大模型(LLMs),我们展示了我们的软件的潜力。我们的方法可以在新的应用场景中以最少的设置要求重新应用。我们提供了一个新颖的图形用户界面,允许用户输入文本供大型语言模型生成指令,然后由该模型自行完成这些指令。实证结果表明,这些指令可以提升强化学习代理的表现。因此,我们提出这项工作是为了加速在基于奖励的环境中对语言解决方案进行评估,从而为科学研究带来新的机遇。
https://arxiv.org/abs/2507.08705
Knowledge graphs (KGs) play a critical role in enhancing large language models (LLMs) by introducing structured and grounded knowledge into the learning process. However, most existing KG-enhanced approaches rely on parameter-intensive fine-tuning, which risks catastrophic forgetting and degrades the pretrained model's generalization. Moreover, they exhibit limited adaptability to real-time knowledge updates due to their static integration frameworks. To address these issues, we introduce the first test-time KG-augmented framework for LLMs, built around a dedicated knowledge graph-guided attention (KGA) module that enables dynamic knowledge fusion without any parameter updates. The proposed KGA module augments the standard self-attention mechanism with two synergistic pathways: outward and inward aggregation. Specifically, the outward pathway dynamically integrates external knowledge into input representations via input-driven KG fusion. This inward aggregation complements the outward pathway by refining input representations through KG-guided filtering, suppressing task-irrelevant signals and amplifying knowledge-relevant patterns. Importantly, while the outward pathway handles knowledge fusion, the inward path selects the most relevant triples and feeds them back into the fusion process, forming a closed-loop enhancement mechanism. By synergistically combining these two pathways, the proposed method supports real-time knowledge fusion exclusively at test-time, without any parameter modification. Extensive experiments on five benchmarks verify the comparable knowledge fusion performance of KGA.
知识图谱(KGs)在通过引入结构化和有根据的知识来增强大型语言模型(LLMs)的学习过程中扮演着关键角色。然而,大多数现有的基于KG的改进方法依赖于耗参数量大的微调过程,这会带来灾难性遗忘的风险,并且损害预训练模型的泛化能力。此外,由于它们采用静态集成框架,这些方法对实时知识更新表现出有限的适应性。 为了解决这些问题,我们引入了首个测试时基于KG增强的LLMs框架,该框架围绕一个专有的、以知识图为指导的注意力(KGA)模块构建而成,可以实现在无需参数更新的情况下动态融合知识。所提出的KGA模块通过两条协同作用的路径扩展标准自注意力机制:向外聚合和向内聚合。 具体而言,向外路径通过输入驱动的知识图谱融合,将外部知识动态地集成到输入表示中。向内聚合则通过KG指导下的过滤来完善输入表示,抑制任务无关信号并放大与知识相关的模式,从而补充了向外路径的功能。尤为重要的是,在处理知识融合时,向外路径选择最相关的三元组,并将其反馈给融合过程,形成了一个闭环增强机制。 通过协同结合这两条路径,所提出的方法支持仅在测试时间进行实时知识融合,而无需对任何参数进行修改。在五个基准上的广泛实验验证了KGA的知识融合性能与现有方法相当。
https://arxiv.org/abs/2507.08704
We present ONION, a multi-layered framework for participatory Entity-Relationship (ER) modeling that integrates insights from design justice, participatory AI, and conceptual modeling. ONION introduces a five-stage methodology: Observe, Nurture, Integrate, Optimize, Normalize. It supports progressive abstraction from unstructured stakeholder input to structured ER diagrams. Our approach aims to reduce designer bias, promote inclusive participation, and increase transparency through the modeling process. We evaluate ONION through real-world workshops focused on sociotechnical systems in Ukraine, highlighting how diverse stakeholder engagement leads to richer data models and deeper mutual understanding. Early results demonstrate ONION's potential to host diversity in early-stage data modeling. We conclude with lessons learned, limitations and challenges involved in scaling and refining the framework for broader adoption.
我们介绍了一种名为ONION的多层次框架,该框架用于参与式实体-关系(ER)建模,并结合了设计正义、参与式AI和概念建模的观点。ONION引入了一个五阶段的方法论:观察(Observe)、培育(Nurture)、整合(Integrate)、优化(Optimize)、规范(Normalize)。它支持从非结构化的利益相关者输入到结构化ER图的逐步抽象过程。我们的方法旨在减少设计师的偏见,促进包容性参与,并通过建模流程提高透明度。 我们通过对乌克兰社会技术系统的实际工作坊进行评估来测试ONION框架,强调了多样化的利益相关者的参与如何带来更丰富的数据模型和更深的相互理解。早期的结果表明,ONION在早期阶段的数据建模中具有支持多样性潜力。 最后,我们总结了一些从实践中得到的经验教训、框架扩展及优化过程中面临的限制与挑战,以及为更广泛的应用进行调整和完善方面的考量。
https://arxiv.org/abs/2507.08702
There is an imperative need to provide quality of life to a growing population of older adults living independently. Personalised solutions that focus on the person and take into consideration their preferences and context are key. In this work, we introduce a framework for representing and reasoning about the Activities of Daily Living of older adults living independently at home. The framework integrates data from sensors and contextual information that aggregates semi-structured interviews, home layouts and sociological observations from the participants. We use these data to create formal models, personalised for each participant according to their preferences and context. We formulate requirements that are specific to each individual as properties encoded in Linear Temporal Logic and use a model checker to verify whether each property is satisfied by the model. When a property is violated, a counterexample is generated giving the cause of the violation. We demonstrate the framework's generalisability by applying it to different participants, highlighting its potential to enhance the safety and well-being of older adults ageing in place.
为了解决日益增长的独立生活的老年人口的生活质量需求,迫切需要提供个性化的解决方案,这些方案应以个人为中心,并考虑他们的偏好和情境。在这项工作中,我们介绍了一个框架,用于表示和推理居住在家中独立生活的老年人日常活动(Activities of Daily Living, ADL)。该框架整合了来自传感器的数据以及汇总的半结构化访谈、家庭布局和社会学观察等上下文信息。 我们利用这些数据为每位参与者创建个性化形式模型,根据他们的偏好和情境进行定制。我们将每个个体的具体需求作为线性时态逻辑(Linear Temporal Logic)编码的属性,并使用模型检查器来验证每个属性是否满足该模型的要求。当某个属性被违反时,会生成反例以揭示违规的原因。 我们通过将其应用于不同参与者来展示框架的通用性,强调了它在增强老年人独立生活安全与福祉方面的潜力。
https://arxiv.org/abs/2507.08701
Magnetic resonance imaging (MRI) enables non-invasive, high-resolution analysis of muscle structures. However, automated segmentation remains limited by high computational costs, reliance on large training datasets, and reduced accuracy in segmenting smaller muscles. Convolutional neural network (CNN)-based methods, while powerful, often suffer from substantial computational overhead, limited generalizability, and poor interpretability across diverse populations. This study proposes a training-free segmentation approach based on keypoint tracking, which integrates keypoint selection with Lucas-Kanade optical flow. The proposed method achieves a mean Dice similarity coefficient (DSC) ranging from 0.6 to 0.7, depending on the keypoint selection strategy, performing comparably to state-of-the-art CNN-based models while substantially reducing computational demands and enhancing interpretability. This scalable framework presents a robust and explainable alternative for muscle segmentation in clinical and research applications.
磁共振成像(MRI)能够进行非侵入性的高分辨率肌肉结构分析。然而,自动分割仍然受限于高昂的计算成本、对大规模训练数据集的依赖以及在较小肌肉分割时准确度降低的问题。基于卷积神经网络(CNN)的方法虽然强大,但常常面临显著的计算开销、泛化能力有限和解释性较差等问题,尤其是在面对多样化的人群时。 本研究提出了一种基于关键点跟踪的无需训练的分割方法,该方法结合了关键点选择与Lucas-Kanade光流算法。所提出的这种方法在不同关键点选择策略下可以达到0.6到0.7之间的平均Dice相似系数(DSC),其表现可媲美最先进的CNN基模型,同时大大降低了计算需求并增强了解释性。 该可扩展框架为临床和研究应用中的肌肉分割提供了一个稳健且易于理解的替代方案。
https://arxiv.org/abs/2507.08690
Contrastive learning (CL) has emerged as a powerful paradigm for learning transferable representations without the reliance on large labeled datasets. Its ability to capture intrinsic similarities and differences among data samples has led to state-of-the-art results in computer vision tasks. These strengths make CL particularly well-suited for Earth System Observation (ESO), where diverse satellite modalities such as optical and SAR imagery offer naturally aligned views of the same geospatial regions. However, ESO presents unique challenges, including high inter-class similarity, scene clutter, and ambiguous boundaries, which complicate representation learning -- especially in low-label, multi-label settings. Existing CL frameworks often focus on intra-modality self-supervision or lack mechanisms for multi-label alignment and semantic precision across modalities. In this work, we introduce MoSAiC, a unified framework that jointly optimizes intra- and inter-modality contrastive learning with a multi-label supervised contrastive loss. Designed specifically for multi-modal satellite imagery, MoSAiC enables finer semantic disentanglement and more robust representation learning across spectrally similar and spatially complex classes. Experiments on two benchmark datasets, BigEarthNet V2.0 and Sent12MS, show that MoSAiC consistently outperforms both fully supervised and self-supervised baselines in terms of accuracy, cluster coherence, and generalization in low-label and high-class-overlap scenarios.
对比学习(CL)作为一种强大的范式,已经出现用于在无需依赖大规模标注数据集的情况下学习可迁移的表示。它捕捉数据样本之间内在相似性和差异性的能力,在计算机视觉任务中取得了最先进的成果。这些优势使对比学习特别适合于地球系统观测(ESO),在这种情况下,多样化的卫星模态如光学和SAR图像为同一地理空间区域提供了自然对齐的不同视角。然而,ESO带来了独特的挑战,包括高类内相似性、场景混乱以及模糊的边界划分,这使得表示学习变得复杂,尤其是在标注数据稀疏或需要多标签识别的情况下更为明显。现有的对比学习框架通常侧重于模态内的自我监督,或者缺少跨模态对齐和语义精度机制。 在本研究中,我们引入了MoSAiC这一统一的框架,该框架能够同时优化模内和跨模态对比学习,并使用多标签监督对比损失函数。专门设计用于处理多模态卫星图像,MoSAiC使语义细化分离成为可能,在光谱相似度高及空间复杂性高的类别中实现更为稳健的表示学习。 在两个基准数据集——BigEarthNet V2.0和Sent12MS上的实验表明,与完全监督以及自我监督的基础模型相比,MoSAiC在低标注量、多标签设置和类别重叠较高的场景下,在准确性、聚类一致性和泛化能力方面均表现出持续的优越性。
https://arxiv.org/abs/2507.08683
We introduce ByDeWay, a training-free framework designed to enhance the performance of Multimodal Large Language Models (MLLMs). ByDeWay uses a novel prompting strategy called Layered-Depth-Based Prompting (LDP), which improves spatial reasoning and grounding without modifying any model parameters. It segments the scene into closest, mid-range, and farthest layers using monocular depth estimation, then generates region-specific captions with a grounded vision-language model. These structured, depth-aware captions are appended to the image-question prompt, enriching it with spatial context. This guides MLLMs to produce more grounded and less hallucinated responses. Our method is lightweight, modular, and compatible with black-box MLLMs. Experiments on hallucination-sensitive (POPE) and reasoning-intensive (GQA) benchmarks show consistent improvements across multiple MLLMs, validating the effectiveness of depth-aware prompting in a zero-training setting.
我们介绍了ByDeWay,这是一个无需训练的框架,旨在增强多模态大型语言模型(MLLM)的性能。ByDeWay采用了一种称为分层深度基础提示法(Layered-Depth-Based Prompting, LDP)的新颖策略,在不修改任何模型参数的情况下提升了空间推理和定位能力。它使用单目深度估计将场景划分为最近、中距离和最远三层,然后利用基于定位的视觉语言模型生成特定区域描述文本。这些结构化的深度感知描述被附加到图像问题提示中,丰富了其空间信息背景。这引导MLLM产生更为准确且较少幻想的回答。 我们的方法轻量级、模块化,并兼容黑盒多模态大型语言模型。在针对幻想敏感性(POPE)和推理密集型(GQA)基准测试上的实验表明,在零训练设置下,深度感知提示法能够跨多种MLLM提供持续的改进效果,验证了其有效性。
https://arxiv.org/abs/2507.08679
Modern large language models (LLMs) show promising progress in formalizing informal mathematics into machine-verifiable theorems. However, these methods still face bottlenecks due to the limited quantity and quality of multilingual parallel corpora. In this paper, we propose a novel neuro-symbolic framework KELPS (Knowledge-Equation based Logical Processing System) to address these problems. KELPS is an iterative framework for translating, synthesizing, and filtering informal data into multiple formal languages (Lean, Coq, and Isabelle). First, we translate natural language into Knowledge Equations (KEs), a novel language that we designed, theoretically grounded in assertional logic. Next, we convert them to target languages through rigorously defined rules that preserve both syntactic structure and semantic meaning. This process yielded a parallel corpus of over 60,000 problems. Our framework achieves 88.9% syntactic accuracy (pass@1) on MiniF2F, outperforming SOTA models such as Deepseek-V3 (81%) and Herald (81.3%) across multiple datasets. All datasets and codes are available in the supplementary materials.
现代大型语言模型(LLMs)在将非正式数学形式化为机器可验证定理方面显示出有希望的进展。然而,由于多语言平行语料库的数量和质量有限,这些方法仍然面临瓶颈问题。本文提出了一种新的神经符号框架KELPS(基于知识方程的逻辑处理系统),以解决这些问题。KELPS是一个迭代框架,用于将非正式数据翻译、综合和过滤成多种形式化语言(Lean、Coq 和 Isabelle)。 首先,我们将自然语言转换为知识方程(KEs),这是一种我们设计的新语言,并在断言逻辑的基础上进行了理论构建。接下来,通过严谨定义的规则将其转换为目标语言,这些规则既保持了句法结构又保留了语义意义。这一过程产生了超过60,000个问题的平行语料库。 我们的框架在MiniF2F数据集上实现了88.9%的句法准确率(pass@1),超过了Deepseek-V3 (81%) 和 Herald (81.3%) 等最先进的模型。所有数据集和代码均可在补充材料中获取。
https://arxiv.org/abs/2507.08665
AI Agents rely on Large Language Models (LLMs) and Multimodal-LLMs (MLLMs) to perform interpretation and inference in text and image tasks without post-training, where LLMs and MLLMs play the most critical role and determine the initial ability and limitations of AI Agents. Usually, AI Agents utilize sophisticated prompt engineering and external reasoning framework to obtain a promising interaction with LLMs, e.g., Chain-of-Thought, Iteration of Thought and Image-of-Thought. However, they are still constrained by the inherent limitations of LLM in understanding natural language, and the iterative reasoning process will generate a large amount of inference cost. To this end, we propose a novel AI Agent Reasoning Framework with Introspection of Thought (INoT) by designing a new LLM-Read code in prompt. It enables LLM to execute programmatic dialogue reasoning processes following the code in prompt. Therefore, self-denial and reflection occur within LLM instead of outside LLM, which can reduce token cost effectively. Through our experiments on six benchmarks for three different tasks, the effectiveness of INoT is verified, with an average improvement of 7.95\% in performance, exceeding the baselines. Furthermore, the token cost of INoT is lower on average than the best performing method at baseline by 58.3\%. In addition, we demonstrate the versatility of INoT in image interpretation and inference through verification experiments.
以下是给定文本的中文翻译: AI代理依赖于大型语言模型(LLM)和多模态-大型语言模型(MLLM),以在没有后训练的情况下执行文字和图像任务中的解释和推理,其中LLM和MLLM扮演最关键的角色,并决定了AI代理的初始能力和局限性。通常,AI代理利用复杂的提示工程技术和外部推理框架来获得与LLM互动的良好效果,例如链式思维、思想迭代和图像思维等方法。然而,它们仍然受到LLM在理解自然语言方面的固有限制的约束,且迭代推理过程会产生大量的推理成本。为此,我们提出了一种新的具有内省思维(INoT)的人工智能代理推理框架,并通过在提示中设计一种新形式的语言模型读取代码来实现它。这种方法使LLM能够按照提示中的代码执行程序化的对话推理过程。因此,在LLM内部而不是外部发生自我否定和反思,从而有效降低令牌成本。通过我们在六个不同任务的基准测试上的实验验证了INoT的有效性,其性能平均提高了7.95%,超过了基线方法。此外,与基线上最有效的方法相比,INoT的令牌成本平均降低了58.3%。另外,我们还通过实验证明了INoT在图像解释和推理中的多功能性。 希望这个翻译对您有帮助!如果需要进一步的信息或改进,请告诉我。
https://arxiv.org/abs/2507.08664
Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.
基于演讲录音的文字转录中的说话人归属任务是指根据语言使用模式从其讲话的转录中识别出说话者。这项任务在音频不可用(例如被删除)或不可靠(例如匿名发言)时特别有用。此前的工作主要集中在利用人工标注生成的转录文本进行说话人归属的可能性研究上。然而,在现实场景下,人们往往只有由自动语音识别(ASR)系统产生的、更加错误频出的文字转录可用。 在本文中,我们开展了迄今为止据我们所知的第一项关于自动转录对说话人归属性能影响的全面研究。特别地,我们研究了面对文字转录中的错误时说话人归属性能下降的程度,以及语音识别系统的特性如何影响归属性能。我们的发现表明,尽管存在单词级别的转录错误,归属任务依然表现出惊人的鲁棒性,并且恢复真实转录的目标与归属表现的关联度极低。 总体而言,我们的研究结果表明,在ASR系统生成的文字转录(即使更加混乱)上进行说话人归属的效果至少不逊于基于人工转写的资料,甚至可能更好。这或许是因为ASR系统的转录错误能够捕捉到体现说话人身份的独特特征。
https://arxiv.org/abs/2507.08660
Learning whole-body control for locomotion and arm motions in a single policy has challenges, as the two tasks have conflicting goals. For instance, efficient locomotion typically favors a horizontal base orientation, while end-effector tracking may benefit from base tilting to extend reachability. Additionally, current Reinforcement Learning (RL) approaches using a pose-based task specification lack the ability to directly control the end-effector velocity, making smoothly executing trajectories very challenging. To address these limitations, we propose an RL-based framework that allows for dynamic, velocity-aware whole-body end-effector control. Our method introduces a multi-critic actor architecture that decouples the reward signals for locomotion and manipulation, simplifying reward tuning and allowing the policy to resolve task conflicts more effectively. Furthermore, we design a twist-based end-effector task formulation that can track both discrete poses and motion trajectories. We validate our approach through a set of simulation and hardware experiments using a quadruped robot equipped with a robotic arm. The resulting controller can simultaneously walk and move its end-effector and shows emergent whole-body behaviors, where the base assists the arm in extending the workspace, despite a lack of explicit formulations.
在单一策略中学习全身控制以协调行走和手臂动作面临挑战,因为这两个任务的目标往往是冲突的。例如,高效的行走通常倾向于保持水平的基础姿态,而末端执行器跟踪可能需要基础倾斜来增加可达性。此外,目前使用基于姿势的任务规范的强化学习(RL)方法无法直接控制末端执行器的速度,这使得平滑地执行轨迹变得非常困难。 为了解决这些限制,我们提出了一种基于强化学习的框架,该框架允许动态、速度感知的整体末端执行器控制。我们的方法引入了一个多评论员演员架构,将行走和操作任务的奖励信号解耦,简化了奖励调整,并使策略能够更有效地解决任务冲突。此外,我们设计了一种基于扭曲的末端执行器任务规范,可以跟踪离散姿势和运动轨迹。 通过使用配备机械臂的四足机器人进行的一系列模拟和硬件实验验证了我们的方法。由此产生的控制器可以在行走的同时移动其末端执行器,并展示了整体出现的行为,在这种行为中基础帮助手臂扩展工作空间,即使没有明确的形式化也是如此。
https://arxiv.org/abs/2507.08656
Purpose: Ultra-high-field 7T MRI offers improved resolution and contrast over standard clinical field strengths (1.5T, 3T). However, 7T scanners are costly, scarce, and introduce additional challenges such as susceptibility artifacts. We propose an efficient transformer-based model (7T-Restormer) to synthesize 7T-quality T1-maps from routine 1.5T or 3T T1-weighted (T1W) images. Methods: Our model was validated on 35 1.5T and 108 3T T1w MRI paired with corresponding 7T T1 maps of patients with confirmed MS. A total of 141 patient cases (32,128 slices) were randomly divided into 105 (25; 80) training cases (19,204 slices), 19 (5; 14) validation cases (3,476 slices), and 17 (5; 14) test cases (3,145 slices) where (X; Y) denotes the patients with 1.5T and 3T T1W scans, respectively. The synthetic 7T T1 maps were compared against the ResViT and ResShift models. Results: The 7T-Restormer model achieved a PSNR of 26.0 +/- 4.6 dB, SSIM of 0.861 +/- 0.072, and NMSE of 0.019 +/- 0.011 for 1.5T inputs, and 25.9 +/- 4.9 dB, and 0.866 +/- 0.077 for 3T inputs, respectively. Using 10.5 M parameters, our model reduced NMSE by 64 % relative to 56.7M parameter ResShift (0.019 vs 0.052, p = <.001 and by 41 % relative to 70.4M parameter ResViT (0.019 vs 0.032, p = <.001) at 1.5T, with similar advantages at 3T (0.021 vs 0.060 and 0.033; p < .001). Training with a mixed 1.5 T + 3 T corpus was superior to single-field strategies. Restricting the model to 1.5T increased the 1.5T NMSE from 0.019 to 0.021 (p = 1.1E-3) while training solely on 3T resulted in lower performance on input 1.5T T1W MRI. Conclusion: We propose a novel method for predicting quantitative 7T MP2RAGE maps from 1.5T and 3T T1W scans with higher quality than existing state-of-the-art methods. Our approach makes the benefits of 7T MRI more accessible to standard clinical workflows.
目的:超高场强的7T磁共振成像(MRI)相比标准临床使用的磁场强度(如1.5T和3T),提供了更高的分辨率和对比度。然而,7T扫描仪的成本高昂且数量稀少,并且还会带来诸如磁敏感性伪影等额外挑战。为此,我们提出了一种基于变压器的高效模型(7T-Restormer),该模型可以从常规的1.5T或3T T1加权图像中合成具有7T质量的T1图。 方法:我们的模型在来自确诊多发性硬化症患者的35例1.5T和108例3T T1w MRI,以及对应的7T T1图上进行了验证。总共141个病例(共32,128片图像)被随机分为训练组(105例,共19,204张切片)、验证组(19例,共3,476张切片)和测试组(17例,共3,145张切片)。其中(X; Y)分别代表接受1.5T和3T T1加权扫描的患者数量。我们合成的7T T1图与ResViT和ResShift模型进行了对比。 结果:在1.5T输入下,我们的7T-Restormer模型达到了26.0±4.6 dB的PSNR值、0.861±0.072的SSIM值以及0.019±0.011的NMSE;而在3T输入下的相应结果为25.9±4.9 dB和0.866±0.077。使用仅含10.5M参数,我们的模型相对减少了与ResShift(56.7M参数)相比,NMSE降低了64%,即0.019比0.052 (p < .001),以及相对于ResViT(70.4M参数),NMSE降低了41%即0.019比0.032 (p < .001)。在1.5T时,仅使用1.5T数据训练模型会使1.5T NMSE从0.019增加到0.021(p = 1.1E-3),而单独使用3T数据进行训练则会导致对输入的1.5T T1W MRI性能降低。使用混合1.5T和3T的数据集进行训练优于单一磁场策略。 结论:我们提出了一种新型方法,能够从1.5T和3T T1加权扫描中预测出高质量的7T MP2RAGE图,并且其质量超过现有的最先进的方法。我们的方法使超高场强MRI的优势更易在常规临床流程中应用。
https://arxiv.org/abs/2507.08655
In Wireless Networked Control Systems (WNCSs), control and communication systems must be co-designed due to their strong interdependence. This paper presents a novel optimization theory-based safe deep reinforcement learning (DRL) framework for ultra-reliable WNCSs, ensuring constraint satisfaction while optimizing performance, for the first time in the literature. The approach minimizes power consumption under key constraints, including Peak Age of Information (PAoI) violation probability, transmit power, and schedulability in the finite blocklength regime. PAoI violation probability is uniquely derived by combining stochastic maximum allowable transfer interval (MATI) and maximum allowable packet delay (MAD) constraints in a multi-sensor network. The framework consists of two stages: optimization theory and safe DRL. The first stage derives optimality conditions to establish mathematical relationships among variables, simplifying and decomposing the problem. The second stage employs a safe DRL model where a teacher-student framework guides the DRL agent (student). The control mechanism (teacher) evaluates compliance with system constraints and suggests the nearest feasible action when needed. Extensive simulations show that the proposed framework outperforms rule-based and other optimization theory based DRL benchmarks, achieving faster convergence, higher rewards, and greater stability.
在无线网络控制系统(WNCSs)中,由于控制和通信系统的高度相互依赖性,必须进行联合设计。本文首次提出了一种基于优化理论的安全深度强化学习(DRL)框架,以确保满足约束条件的同时提升性能,应用于超可靠的WNCS系统。该方法在包括信息老化峰值概率(PAoI违反概率)、传输功率以及有限块长度下的调度能力等关键约束条件下最小化能耗。 本文中,PAoI违反概率首次通过结合随机最大允许传输间隔(MATI)和多传感器网络中的最大允许包延迟(MAD)的约束条件来推导。所提出的框架包括两个阶段:优化理论和安全深度强化学习(DRL)。第一阶段推导出最优性条件以建立变量之间的数学关系,简化并分解问题;第二阶段采用了一种教师-学生框架指导的安全DRL模型,其中控制机制(教师)评估系统约束的遵守情况,并在必要时建议最近可行的操作。 通过广泛的模拟实验表明,提出的框架优于规则基础和其他基于优化理论的深度强化学习基准,表现出更快的收敛速度、更高的奖励值和更大的稳定性。
https://arxiv.org/abs/2507.08653