We introduce our Leanabell-Prover-V2, a 7B large language models (LLMs) that can produce formal theorem proofs in Lean 4, with verifier-integrated Long Chain-of-Thoughts (CoT). Following our previous work Leanabell-Prover-V1, we continual to choose to posttrain existing strong prover models for further performance improvement. In our V2 version, we mainly upgrade the Reinforcement Learning (RL) with feedback provided by the Lean 4 verifier. Crucially, verifier feedback, such as indicating success or detailing specific errors, allows the LLM to become ``self-aware'' of the correctness of its own reasoning process and learn to reflexively correct errors. Leanabell-Prover-V2 directly optimizes LLM reasoning trajectories with multi-turn verifier interactions, together with feedback token masking for stable RL training and a simple reward strategy. Experiments show that Leanabell-Prover-V2 improves performance by 3.2% (pass@128) with Kimina-Prover-Preview-Distill-7B and 2.0% (pass@128) with DeepSeek-Prover-V2-7B on the MiniF2F test set. The source codes, curated data and models are available at: this https URL.
我们介绍了Leanabell-Prover-V2,这是一个经过训练的70亿参数的大规模语言模型(LLM),能够生成Lean 4中的形式定理证明,并集成了验证器反馈的长链推理过程(CoT)。继我们的早期工作Leanabell-Prover-V1之后,我们在现有强大的证明模型基础上进行后期微调以进一步提高性能。在V2版本中,我们主要通过集成来自Lean 4验证器的反馈来升级强化学习(RL)方法。至关重要的是,验证器提供的反馈(如指示成功或详细说明特定错误),让LLM能够对其自身的推理过程产生“自我意识”,并学会反射性地纠正错误。 Leanabell-Prover-V2通过多轮与验证器的交互直接优化了LLM的推理轨迹,并结合了反馈令牌屏蔽技术以实现稳定的RL训练以及简单的奖励策略。实验表明,使用Kimina-Prover-Preview-Distill-7B和DeepSeek-Prover-V2-7B模型时,在MiniF2F测试集上Leanabell-Prover-V2分别提高了3.2%(pass@128)和2.0%(pass@128)的性能。源代码、整理后的数据以及模型可以在以下网址获得:this https URL。
https://arxiv.org/abs/2507.08649
Common knowledge indicates that the process of constructing image datasets usually depends on the time-intensive and inefficient method of manual collection and annotation. Large models offer a solution via data generation. Nonetheless, real-world data are obviously more valuable comparing to artificially intelligence generated data, particularly in constructing image datasets. For this reason, we propose a novel method for auto-constructing datasets from real-world images by a multiagent collaborative system, named as DatasetAgent. By coordinating four different agents equipped with Multi-modal Large Language Models (MLLMs), as well as a tool package for image optimization, DatasetAgent is able to construct high-quality image datasets according to user-specified requirements. In particular, two types of experiments are conducted, including expanding existing datasets and creating new ones from scratch, on a variety of open-source datasets. In both cases, multiple image datasets constructed by DatasetAgent are used to train various vision models for image classification, object detection, and image segmentation.
常识表明,构建图像数据集通常依赖于耗时且效率低下的手动收集和标注方法。大型模型通过生成数据提供了一种解决方案。然而,与人工合成的数据相比,现实世界中的数据在构建图像数据集中显然更有价值。因此,我们提出了一种使用多智能体协作系统从真实世界的图像自动构建数据集的新方法,命名为DatasetAgent。通过协调四个配备有多模态大型语言模型(MLLM)的代理以及用于图像优化的工具包,DatasetAgent能够根据用户指定的要求构建高质量的图像数据集。特别地,在各种开源数据集上进行了两种类型的实验,包括扩展现有数据集和从头开始创建新数据集。在这两种情况下,使用多个由DatasetAgent构建的图像数据集来训练多种视觉模型以进行图像分类、目标检测和图像分割任务。
https://arxiv.org/abs/2507.08648
Multi-view camera-based 3D perception can be conducted using bird's eye view (BEV) features obtained through perspective view-to-BEV transformations. Several studies have shown that the performance of these 3D perception methods can be further enhanced by combining sequential BEV features obtained from multiple camera frames. However, even after compensating for the ego-motion of an autonomous agent, the performance gain from temporal aggregation is limited when combining a large number of image frames. This limitation arises due to dynamic changes in BEV features over time caused by object motion. In this paper, we introduce a novel temporal 3D perception method called OnlineBEV, which combines BEV features over time using a recurrent structure. This structure increases the effective number of combined features with minimal memory usage. However, it is critical to spatially align the features over time to maintain strong performance. OnlineBEV employs the Motion-guided BEV Fusion Network (MBFNet) to achieve temporal feature alignment. MBFNet extracts motion features from consecutive BEV frames and dynamically aligns historical BEV features with current ones using these motion features. To enforce temporal feature alignment explicitly, we use Temporal Consistency Learning Loss, which captures discrepancies between historical and target BEV features. Experiments conducted on the nuScenes benchmark demonstrate that OnlineBEV achieves significant performance gains over the current best method, SOLOFusion. OnlineBEV achieves 63.9% NDS on the nuScenes test set, recording state-of-the-art performance in the camera-only 3D object detection task.
基于多视角相机的三维感知可以通过透视视图到鸟瞰视图(BEV)变换获得的BEV特征来进行。多项研究表明,通过结合多个摄像机帧获取的连续BEV特征,可以进一步增强这些三维感知方法的性能。然而,在补偿自主代理的自我运动后,当合并大量图像帧时,时间聚合带来的性能提升是有限的,这主要是由于随着时间变化,由物体移动引起的BEV特征动态变化所导致的。 在这篇文章中,我们介绍了一种新颖的时间三维感知方法——OnlineBEV,它通过递归结构结合随时间变化的BEV特征。这种结构能够在最小化内存使用的情况下增加有效组合的功能数量。然而,在保持高性能的同时,跨时间对齐功能是非常重要的。为了实现这一点,OnlineBEV采用了基于运动引导的BEV融合网络(MBFNet)来完成时间特性对准。MBFNet从连续的BEV帧中提取运动特征,并利用这些运动特征动态地将历史BEV特征与当前特征对齐。为确保显式的时间特性对齐,我们使用了时间一致性学习损失,该损失捕捉到历史和目标BEV特征之间的差异。 在nuScenes基准测试上进行的实验表明,OnlineBEV相比目前最佳方法SOLOFusion取得了显著的性能提升。在nuScenes测试集上,OnlineBEV达到了63.9%的NDS(平均精度),创下了仅使用相机的三维物体检测任务中的最新记录。
https://arxiv.org/abs/2507.08644
Transformer models are computationally costly on long sequences since regular attention has quadratic $O(n^2)$ time complexity. We introduce Wavelet-Enhanced Random Spectral Attention (WERSA), a novel mechanism of linear $O(n)$ time complexity that is pivotal to enable successful long-sequence processing without the performance trade-off. WERSA merges content-adaptive random spectral features together with multi-resolution Haar wavelets and learnable parameters to selectively attend to informative scales of data while preserving linear efficiency. Large-scale comparisons \textbf{on single GPU} and across various benchmarks (vision, NLP, hierarchical reasoning) and various attention mechanisms (like Multiheaded Attention, Flash-Attention-2, FNet, Linformer, Performer, Waveformer), reveal uniform advantages of WERSA. It achieves best accuracy in all tests. On ArXiv classification, WERSA improves accuracy over vanilla attention by 1.2\% (86.2\% vs 85.0\%) while cutting training time by 81\% (296s vs 1554s) and FLOPS by 73.4\% (26.2G vs 98.4G). Significantly, WERSA excels where vanilla and FlashAttention-2 fail: on ArXiv-128k's extremely lengthy sequences, it achieves best accuracy (79.1\%) and AUC (0.979) among viable methods, operating on data that gives Out-Of-Memory errors to quadratic methods while being \textbf{twice as fast} as Waveformer, its next-best competitor. By significantly reducing computational loads without compromising accuracy, WERSA makes possible more practical, more affordable, long-context models, in particular on low-resource hardware, for more sustainable and more scalable AI development.
Transformer模型在处理长序列时计算成本高昂,因为标准注意力机制的时间复杂度为二次方$O(n^2)$。我们引入了一种新颖的机制——Wavelet-Enhanced Random Spectral Attention (WERSA),它具有线性$O(n)$时间复杂度,在不牺牲性能的情况下实现了成功处理长序列的目标。 WERSA结合了内容自适应的随机光谱特征、多分辨率Haar小波以及可学习参数,能够在保持线性效率的同时选择性地关注数据中的信息量级。大规模测试在单个GPU上进行,并且跨越了各种基准(如视觉任务、NLP和层次推理),并涉及不同的注意力机制(包括Multiheaded Attention、Flash-Attention-2、FNet、Linformer、Performer以及Waveformer),结果显示WERSA具有一致的优势。 WERSA在所有测试中都达到了最佳的准确性。以ArXiv分类为例,与标准注意力相比,它提高了1.2%的准确率(86.2% vs 85.0%),并将训练时间缩短了81%(从296秒减少到1554秒)以及减少了73.4%的浮点运算次数(从26.2G减少至98.4G)。更重要的是,当面对ArXiv-128k中的极长序列时,WERSA在其他方法失败的情况下表现出色。它在这类序列上实现了最高的准确率(79.1%)和AUC值(0.979),并且能够处理导致二次方方法产生内存溢出错误的数据量,在这方面比其最近的竞争对手Waveformer还要快两倍。 通过显著降低计算负载而不牺牲准确性,WERSA使得更实用、更具成本效益的长上下文模型成为可能,尤其是在低资源硬件上。这有助于促进可持续且可扩展的人工智能开发。
https://arxiv.org/abs/2507.08637
This study evaluates the recently proposed Document Attention Network (DAN) for extracting key-value information from Uruguayan birth certificates, handwritten in Spanish. We investigate two annotation strategies for automatically transcribing handwritten documents, fine-tuning DAN with minimal training data and annotation effort. Experiments were conducted on two datasets containing the same images (201 scans of birth certificates written by more than 15 different writers) but with different annotation methods. Our findings indicate that normalized annotation is more effective for fields that can be standardized, such as dates and places of birth, whereas diplomatic annotation performs much better for fields containing names and surnames, which can not be standardized.
这项研究评估了最近提出的文档注意力网络(Document Attention Network,简称DAN)在从乌拉圭出生证明中提取关键信息的能力。这些出生证明是用西班牙文手写的。我们调查了两种自动转录手写文件的注释策略,并探讨了使用最少训练数据和标注努力对DAN进行微调的方法。实验是在两个包含相同图像(201份不同作者书写的出生证明扫描件)的数据集上进行的,但采用的是不同的注释方法。我们的研究结果表明,对于可以标准化的字段(如日期和出生地点),规范化注释更为有效;而对于含有姓名和姓氏且无法标准化的字段,外交式注释效果更好。
https://arxiv.org/abs/2507.08636
Recent advances in generative AI have made the creation of speech deepfakes widely accessible, posing serious challenges to digital trust. To counter this, various speech deepfake detection strategies have been proposed, including Person-of-Interest (POI) approaches, which focus on identifying impersonations of specific individuals by modeling and analyzing their unique vocal traits. Despite their excellent performance, the existing methods offer limited granularity and lack interpretability. In this work, we propose a POI-based speech deepfake detection method that operates at the phoneme level. Our approach decomposes reference audio into phonemes to construct a detailed speaker profile. In inference, phonemes from a test sample are individually compared against this profile, enabling fine-grained detection of synthetic artifacts. The proposed method achieves comparable accuracy to traditional approaches while offering superior robustness and interpretability, key aspects in multimedia forensics. By focusing on phoneme analysis, this work explores a novel direction for explainable, speaker-centric deepfake detection.
近期在生成式人工智能领域的进展使得制造语音深度伪造变得更为普及,这给数字信任带来了严重的挑战。为了应对这一问题,已经提出了多种语音深度伪造检测策略,其中包括聚焦于通过建模和分析特定个体独特声纹来识别其模仿行为的“人物感兴趣”(POI)方法。尽管这些现有方法表现出色,但它们提供的细节粒度有限且缺乏可解释性。 在本项工作中,我们提出了一种基于POI的语音深度伪造检测方法,该方法在音素层级上进行操作。我们的方法将参考音频分解为音素以构建详细的说话人模型。在推理过程中,测试样本中的每个音素都会与这个模型进行单独比较,从而能够精细地检测合成伪迹。所提出的方法在准确性方面可媲美传统方法,同时提供更强的鲁棒性和解释性,在多媒体取证中尤为关键。 通过专注于音素分析,这项工作探索了可解释、以说话人为中心的深度伪造检测的新方向。
https://arxiv.org/abs/2507.08626
This paper introduces the Ambient Intelligence Rehabilitation Support (AIRS) framework, an advanced artificial intelligence-based solution tailored for home rehabilitation environments. AIRS integrates cutting-edge technologies, including Real-Time 3D Reconstruction (RT-3DR), intelligent navigation, and large Vision-Language Models (VLMs), to create a comprehensive system for machine-guided physical rehabilitation. The general AIRS framework is demonstrated in rehabilitation scenarios following total knee replacement (TKR), utilizing a database of 263 video recordings for evaluation. A smartphone is employed within AIRS to perform RT-3DR of living spaces and has a body-matched avatar to provide visual feedback about the excercise. This avatar is necessary in (a) optimizing exercise configurations, including camera placement, patient positioning, and initial poses, and (b) addressing privacy concerns and promoting compliance with the AI Act. The system guides users through the recording process to ensure the collection of properly recorded videos. AIRS employs two feedback mechanisms: (i) visual 3D feedback, enabling direct comparisons between prerecorded clinical exercises and patient home recordings and (ii) VLM-generated feedback, providing detailed explanations and corrections for exercise errors. The framework also supports people with visual and hearing impairments. It also features a modular design that can be adapted to broader rehabilitation contexts. AIRS software components are available for further use and customization.
本文介绍了环境智能康复支持(AIRS)框架,这是一种基于先进人工智能的解决方案,专为家庭康复环境设计。AIRS整合了尖端技术,包括实时三维重建(RT-3DR)、智能导航和大型视觉语言模型(VLMs),以创建一个全面的机器引导物理康复系统。一般性的AIRS框架在全膝关节置换(TKR)后的康复场景中得到演示,并利用263段视频记录的数据集进行评估。 在AIRS中,智能手机被用来执行生活空间的RT-3DR,并且有一个与患者身体相匹配的虚拟角色来提供关于锻炼的视觉反馈。这一虚拟角色对于(a)优化锻炼配置至关重要,包括摄像头的位置、患者的姿势以及初始姿态,以及(b)解决隐私问题并促进遵守AI法案的要求。 该系统指导用户完成记录过程,以确保收集到准确录制的视频资料。AIRS采用了两种反馈机制:(i)视觉3D反馈,使得可以将预先录制的临床练习与患者的家庭录像进行直接比较;(ii)由VLM生成的反馈,提供对锻炼错误的详细解释和纠正建议。 该框架还支持视力障碍者和听力障碍者的康复需求,并且具有模块化设计,能够适应更广泛的康复场景。AIRS软件组件可供进一步使用及定制。
https://arxiv.org/abs/2507.08624
Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM's, using diverse datasets such as this http URL and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.
论证挖掘(AM)是一个跨学科的研究领域,融合了逻辑学、哲学、语言学、修辞学、法律学、心理学和计算机科学的见解。它涉及自动识别和提取论据成分,例如前提和主张,并检测它们之间的关系,如支持、攻击或中立。近年来,这一领域取得了显著进展,尤其是在大型语言模型(LLM)出现之后,这些模型在分析和提取论证语义方面比传统方法和其他深度学习模型更为高效。尽管有许多基准测试用于检验和验证LLM的质量,但关于这些模型在公开可用的论点分类数据库中的表现仍缺乏研究和成果。 本文对一系列大型语言模型进行了研究,使用了诸如[this URL](http://this.URL) 和 UKP 等多样化的数据集。所测试的模型包括 GPT、Llama 和 DeepSeek 的不同版本,以及整合 Chain-of-Thoughts(CoT)算法以增强推理能力的变体。结果显示,在论点分类基准中,ChatGPT-4o 表现最佳。在具备推理功能的模型中,Deepseek-R1 展现出其优越性。然而,尽管 GPT-4o 和 Deepseek-R1 的性能卓越,它们仍然会犯错误。本文讨论了所有模型中最常见的错误类型。 据我们所知,本研究是首次广泛分析这些数据集并使用大型语言模型和提示算法的研究工作。该工作还揭示了一些已知提示算法在论证分析中的弱点,并指出了改进的方向。此外,这项工作的附加值在于对现有论点数据集进行了深入的分析,并展示了它们的不足之处。
https://arxiv.org/abs/2507.08621
Early-stage engineering design involves complex, iterative reasoning, yet existing large language model (LLM) workflows struggle to maintain task continuity and generate executable models. We evaluate whether a structured multi-agent system (MAS) can more effectively manage requirements extraction, functional decomposition, and simulator code generation than a simpler two-agent system (2AS). The target application is a solar-powered water filtration system as described in a cahier des charges. We introduce the Design-State Graph (DSG), a JSON-serializable representation that bundles requirements, physical embodiments, and Python-based physics models into graph nodes. A nine-role MAS iteratively builds and refines the DSG, while the 2AS collapses the process to a Generator-Reflector loop. Both systems run a total of 60 experiments (2 LLMs - Llama 3.3 70B vs reasoning-distilled DeepSeek R1 70B x 2 agent configurations x 3 temperatures x 5 seeds). We report a JSON validity, requirement coverage, embodiment presence, code compatibility, workflow completion, runtime, and graph size. Across all runs, both MAS and 2AS maintained perfect JSON integrity and embodiment tagging. Requirement coverage remained minimal (less than 20\%). Code compatibility peaked at 100\% under specific 2AS settings but averaged below 50\% for MAS. Only the reasoning-distilled model reliably flagged workflow completion. Powered by DeepSeek R1 70B, the MAS generated more granular DSGs (average 5-6 nodes) whereas 2AS mode-collapsed. Structured multi-agent orchestration enhanced design detail. Reasoning-distilled LLM improved completion rates, yet low requirements and fidelity gaps in coding persisted.
早期工程设计阶段涉及复杂的迭代推理,然而现有的大型语言模型(LLM)工作流程在保持任务连续性和生成可执行模型方面存在困难。我们评估了结构化的多代理系统(MAS)是否比简单的双代理系统(2AS)更有效地管理需求提取、功能分解和模拟器代码生成。目标应用是一个太阳能驱动的水过滤系统,该系统描述于一份工作说明书之中。 我们引入了设计状态图(DSG),这是一种JSON序列化表示形式,它将需求、物理实现以及基于Python的物理模型捆绑到图节点中。一个九角色MAS迭代构建并完善DSG,而2AS则将其过程简化为生成-反射循环。两种系统分别运行60次实验(2种LLM - Llama 3.3 70B vs 推理精炼后的DeepSeek R1 70B x 2代理配置x 3温度x 5种子)。我们报告了JSON的有效性、需求覆盖范围、物理实现的存在与否、代码兼容性、工作流完成度、运行时间和图的大小。 在所有实验中,MAS和2AS都保持了完美的JSON完整性和标记体。需求覆盖率依然很低(不到20%)。代码兼容性在特定2AS设置下达到了100%,但在整个过程中MAS平均低于50%。只有推理精炼后的模型能够可靠地标记工作流完成情况。DeepSeek R1 70B驱动的MAS生成了更详细的DSG(平均每图有5-6个节点),而2AS则表现出模式化现象。 结构化的多代理系统增强了设计细节,推理精炼后的LLM提高了任务完成率,但需求低和代码保真度差距的问题仍然存在。
https://arxiv.org/abs/2507.08619
Collaborative fairness is a crucial challenge in federated learning. However, existing approaches often overlook a practical yet complex form of heterogeneity: imbalanced covariate shift. We provide a theoretical analysis of this setting, which motivates the design of FedAKD (Federated Asynchronous Knowledge Distillation)- simple yet effective approach that balances accurate prediction with collaborative fairness. FedAKD consists of client and server updates. In the client update, we introduce a novel asynchronous knowledge distillation strategy based on our preliminary analysis, which reveals that while correctly predicted samples exhibit similar feature distributions across clients, incorrectly predicted samples show significant variability. This suggests that imbalanced covariate shift primarily arises from misclassified samples. Leveraging this insight, our approach first applies traditional knowledge distillation to update client models while keeping the global model fixed. Next, we select correctly predicted high-confidence samples and update the global model using these samples while keeping client models fixed. The server update simply aggregates all client models. We further provide a theoretical proof of FedAKD's convergence. Experimental results on public datasets (FashionMNIST and CIFAR10) and a real-world Electronic Health Records (EHR) dataset demonstrate that FedAKD significantly improves collaborative fairness, enhances predictive accuracy, and fosters client participation even under highly heterogeneous data distributions.
协作公平性是联邦学习中的一个关键挑战。然而,现有的方法往往忽视了一种实用但复杂的异质性形式:不平衡的协变量偏移。我们对该场景进行了理论分析,并提出了FedAKD(Federated Asynchronous Knowledge Distillation)的设计动机——一种简单而有效的方法,它在准确预测和协作公平性之间实现了平衡。 FedAKD由客户端更新和服务器更新两部分组成: 1. **客户端更新**:基于我们的初步分析,我们引入了一种新颖的异步知识蒸馏策略。这一策略揭示了正确分类样本在不同客户端中的特征分布相似,而错误分类样本则表现出显著的变化性。这表明不平衡的协变量偏移主要源自被误分类的样本。因此,该方法首先利用传统的知识蒸馏技术来更新客户端模型,同时保持全局模型固定不变。 2. **服务器更新**:在此之后,我们选择正确且具有高置信度预测的样本,并在保持客户端模型固定的条件下使用这些样本来更新全局模型。最后,服务器简单地聚合所有客户端模型以形成新的全局模型版本。 除此之外,我们还提供了FedAKD收敛性的理论证明。 实验结果表明,在公共数据集(如FashionMNIST和CIFAR10)以及一个现实世界中的电子健康记录(EHR)数据集中,FedAKD显著提高了协作公平性、增强了预测准确性,并促进了在高度异质化数据分布下的客户端参与度。
https://arxiv.org/abs/2507.08617
Image captioning is an important problem in developing various AI systems, and these tasks require large volumes of annotated images to train the models. Since all existing labelled datasets are already used for training the large Vision Language Models (VLMs), it becomes challenging to improve the performance of the same. Considering this, it is essential to consider the unsupervised image captioning performance, which remains relatively under-explored. To that end, we propose LoGIC (Lewis Communication Game for Image Captioning), a Multi-agent Reinforcement Learning game. The proposed method consists of two agents, a 'speaker' and a 'listener', with the objective of learning a strategy for communicating in natural language. We train agents in the cooperative common-reward setting using the GRPO algorithm and show that improvement in image captioning performance emerges as a consequence of the agents learning to play the game. We show that using pre-trained VLMs as the 'speaker' and Large Language Model (LLM) for language understanding in the 'listener', we achieved a $46$ BLEU score after fine-tuning using LoGIC without additional labels, a $2$ units advantage in absolute metrics compared to the $44$ BLEU score of the vanilla VLM. Additionally, we replace the VLM from the 'speaker' with lightweight components: (i) a ViT for image perception and (ii) a GPT2 language generation, and train them from scratch using LoGIC, obtaining a $31$ BLEU score in the unsupervised setting, a $10$ points advantage over existing unsupervised image-captioning methods.
图像描述是开发各种AI系统中的一个重要问题,这些任务需要大量带有标注的图片来训练模型。由于所有现有的标记数据集已经被用于训练大型视觉语言模型(VLMs),因此很难进一步提升性能。鉴于此,探索无监督图像描述效果显得尤为重要,这一领域目前相对未被充分研究。 为此,我们提出了LoGIC(Lewis沟通游戏进行图像描述)——一种多智能体强化学习游戏。该方法包括两个代理,“说话者”和“听众”,其目标是学会用自然语言进行交流的策略。通过使用GRPO算法在合作共同奖励设置中训练这些代理,并证明了由于代理学会了玩游戏,因此出现了图像描述性能的提升。 我们还展示了,在没有额外标注的情况下,利用预训练的VLM作为“说话者”以及大型语言模型(LLM)用于“听众”的语言理解,LoGIC微调后实现了46的BLEU得分,相较于原始VLM的44 BLEU分,这一成绩在绝对指标上提高了2个单位。 此外,我们用轻量级组件替换了“说话者”中的VLM:(i)一个ViT用于图像感知和(ii)GPT2用于语言生成,并完全从头开始使用LoGIC进行训练,在无监督环境下获得了31的BLEU得分,相较于现有的无监督图像描述方法,这一成绩在绝对指标上领先了10个单位。
https://arxiv.org/abs/2507.08610
Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but degrade significantly under \textit{temporally evolving distribution shifts} common in real-world scenarios (e.g., gradual illumination or seasonal changes). Existing continual test-time adaptation (CTTA) methods are typically built around sudden and severe distribution shifts and neglect temporal continuity, leading to three core defects: limited memory cache restricts long-range distribution modeling, causing catastrophic forgetting; entropy-based confidence becomes unreliable under temporal drift, worsening error accumulation; and static visual representations misalign with evolving inputs. We formalize this practical problem as \textit{Continual-Temporal Test-Time Adaptation (CT-TTA)}, where test distributions evolve gradually over time. To address it, we propose \textit{BayesTTA}, a Bayesian adaptation framework that enforces temporally consistent predictions and dynamically aligns visual representations. Specifically, BayesTTA incrementally estimates class-conditional Gaussian mixture distributions without storing raw data, adaptively selects covariance structures through statistical hypothesis testing, and performs calibrated inference using Gaussian discriminant analysis (GDA). These calibrated predictions supervise self-paced adaptation of normalization layers, ensuring efficient and stable representation alignment. We establish a comprehensive CT-TTA benchmark across four temporally evolving datasets and further evaluate generalization on ten standard TTA datasets. Extensive experiments show that BayesTTA consistently outperforms state-of-the-art methods, achieving significant gains while maintaining efficiency. Code is available at \href{this https URL}{this https URL}.
视觉-语言模型(VLMs)如CLIP在零样本识别方面表现出色,但在现实场景中常见的时序分布变化(例如光照逐渐改变或季节变换)下性能显著下降。现有的持续测试时间适应(CTTA)方法通常围绕突然且严重的分布偏移构建,并忽视了时间连续性,导致三大核心缺陷:有限的记忆缓存限制了长时间范围内分布的建模能力,从而造成灾难性的遗忘;基于熵的信心度在时序漂移下变得不可靠,进而加重误差累积;静态视觉表示无法与不断变化的输入对齐。我们将这一实际问题形式化为“连续-时间测试时间适应(CT-TTA)”,其中测试分布随时间逐渐演变。 为了应对这个问题,我们提出了BayesTTA,这是一个贝叶斯适应框架,它强制执行时序一致性的预测,并动态地调整视觉表示。具体来说,BayesTTA通过统计假设检验自适应选择协方差结构,在不存储原始数据的情况下增量式估计类条件高斯混合分布,并使用高斯判别分析(GDA)进行校准推理。这些校准后的预测监督归一化层的自我调节适应过程,确保表示对齐的有效性和稳定性。 我们建立了跨越四个时序变化数据集的全面CT-TTA基准,并进一步评估了在十个标准TTA数据集上的泛化能力。大量的实验表明,BayesTTA始终优于现有方法,在保持效率的同时取得显著收益。代码可在[这里](this https URL)获取。
https://arxiv.org/abs/2507.08607
We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.
我们介绍了DocPolarBERT,这是一种针对文档理解的布局感知型BERT模型,它消除了对绝对二维位置嵌入的需求。我们将自注意力机制扩展为考虑基于相对极坐标系统而非笛卡尔坐标的文本块位置。尽管是在比广泛使用的IIT-CDIP语料库小六倍多的数据集上进行预训练的,DocPolarBERT仍取得了最先进的成果。这些结果表明,精心设计的注意机制可以补偿较小规模的预训练数据量,并为文档理解提供了高效且有效的替代方案。
https://arxiv.org/abs/2507.08606
End-to-end Large Speech Language Models~(\textbf{LSLMs}) demonstrate strong potential in response latency and speech comprehension capabilities, showcasing general intelligence across speech understanding tasks. However, the ability to follow speech instructions has not been fully realized due to the lack of datasets and heavily biased training tasks. Leveraging the rich ASR datasets, previous approaches have used Large Language Models~(\textbf{LLMs}) to continue the linguistic information of speech to construct speech instruction datasets. Yet, due to the gap between LLM-generated results and real human responses, the continuation methods further amplify these shortcomings. Given the high costs of collecting and annotating speech instruction datasets by humans, using speech synthesis to construct large-scale speech instruction datasets has become a balanced and robust alternative. Although modern Text-To-Speech~(\textbf{TTS}) models have achieved near-human-level synthesis quality, it is challenging to appropriately convert out-of-distribution text instruction to speech due to the limitations of the training data distribution in TTS models. To address this issue, we propose a query rewriting framework with multi-LLM knowledge fusion, employing multiple agents to annotate and validate the synthesized speech, making it possible to construct high-quality speech instruction datasets without relying on human annotation. Experiments show that this method can transform text instructions into distributions more suitable for TTS models for speech synthesis through zero-shot rewriting, increasing data usability from 72\% to 93\%. It also demonstrates unique advantages in rewriting tasks that require complex knowledge and context-related abilities.
端到端大型语音语言模型(LSLMs)在响应延迟和语音理解能力方面展现出强大的潜力,展示了其在语音理解任务中的通用智能。然而,由于缺乏数据集以及训练任务的偏见,跟随语音指令的能力尚未完全实现。通过利用丰富的自动语音识别(ASR)数据集,先前的方法使用大型语言模型(LLMs)继续语音的语言信息以构建语音指令数据集。然而,由于LLM生成的结果与真实人类响应之间的差距,这些延续方法进一步放大了这些缺点。鉴于由人工收集和标注语音指令数据集的成本高昂,利用语音合成来构建大规模的语音指令数据集已成为一种平衡且稳健的选择。尽管现代文本到语音(TTS)模型已实现了接近人类水平的合成效果,但由于TTS模型训练数据分布的限制,将离域文本指令转换为语音仍具有挑战性。为此,我们提出了一种基于多LLM知识融合的查询重写框架,并采用多个代理对合成语音进行标注和验证,从而能够在不依赖人工标注的情况下构建高质量的语音指令数据集。实验表明,该方法能够通过零样本重写将文本指令转换为更适合TTS模型进行语音合成效果的数据分布,使数据可用性从72%提高到93%,并且在需要复杂知识和语境能力的任务中展现出独特优势。
https://arxiv.org/abs/2507.08603
Proto-personas are commonly used during early-stage Product Discovery, such as Lean Inception, to guide product definition and stakeholder alignment. However, the manual creation of proto-personas is often time-consuming, cognitively demanding, and prone to bias. In this paper, we propose and empirically investigate a prompt engineering-based approach to generate proto-personas with the support of Generative AI (GenAI). Our goal is to evaluate the approach in terms of efficiency, effectiveness, user acceptance, and the empathy elicited by the generated personas. We conducted a case study with 19 participants embedded in a real Lean Inception, employing a qualitative and quantitative methods design. The results reveal the approach's efficiency by reducing time and effort and improving the quality and reusability of personas in later discovery phases, such as Minimum Viable Product (MVP) scoping and feature refinement. While acceptance was generally high, especially regarding perceived usefulness and ease of use, participants noted limitations related to generalization and domain specificity. Furthermore, although cognitive empathy was strongly supported, affective and behavioral empathy varied significantly across participants. These results contribute novel empirical evidence on how GenAI can be effectively integrated into software Product Discovery practices, while also identifying key challenges to be addressed in future iterations of such hybrid design processes.
在产品发现阶段,如精益构思(Lean Inception)中,原型人物通常被用来指导产品定义和利益相关者的协调。然而,手动创建这些原型人物往往耗时、认知负荷大且容易出现偏差。本文提出了一种基于提示工程的方法,并通过实证研究探讨了利用生成式人工智能(GenAI)来生成原型人物的可能性。我们的目标是根据效率、有效性、用户接受度以及所生成的人物角色引发的同理心,对该方法进行评估。 我们与19名参与者一起进行了案例研究,这些参与者嵌入在一个真实的精益构思过程中,并采用了定性和定量的方法设计。结果表明,该方法通过减少时间和精力,提高了人物角色的质量和可重用性,在后续发现阶段如最小可行产品(MVP)范围界定和功能细化中表现尤为突出。尽管总体接受度较高,尤其是在感知有用性和易用性方面,参与者还是指出了与泛化和领域特定性相关的限制。 此外,虽然认知同理心得到了强烈支持,但情感同理心和行为同理心在不同参与者之间存在显著差异。这些结果为如何有效地将GenAI整合到软件产品发现实践中提供了新颖的实证证据,并同时确定了未来此类混合设计过程需要解决的关键挑战。
https://arxiv.org/abs/2507.08594
Image-text matching (ITM) aims to address the fundamental challenge of aligning visual and textual modalities, which inherently differ in their representations, continuous, high-dimensional image features vs. discrete, structured text. We propose a novel framework that bridges the modality gap by leveraging multimodal large language models (MLLMs) as visual semantic parsers. By generating rich Visual Semantic Descriptions (VSD), MLLMs provide semantic anchor that facilitate cross-modal alignment. Our approach combines: (1) Instance-level alignment by fusing visual features with VSD to enhance the linguistic expressiveness of image representations, and (2) Prototype-level alignment through VSD clustering to ensure category-level consistency. These modules can be seamlessly integrated into existing ITM models. Extensive experiments on Flickr30K and MSCOCO demonstrate substantial performance improvements. The approach also exhibits remarkable zero-shot generalization to cross-domain tasks, including news and remote sensing ITM. The code and model checkpoints are available at this https URL.
图像-文本匹配(ITM)旨在解决视觉和文本模态之间对齐的基本挑战,这两者在表示方式上存在本质差异:连续、高维的图像特征与离散、结构化的文本。我们提出了一种新颖的框架,通过利用多模态大型语言模型(MLLMs)作为视觉语义解析器来弥合这种模态差距。通过生成丰富的视觉语义描述(VSD),MLLMs 提供了语义锚点以促进跨模态对齐。我们的方法结合了: 1. 实例级对齐:通过将视觉特征与 VSD 融合,增强图像表示的语义表达能力; 2. 原型级对齐:通过 VSD 聚类确保类别级别的一致性。 这些模块可以无缝集成到现有的 ITM 模型中。在 Flickr30K 和 MSCOCO 数据集上进行的广泛实验表明了显著的性能提升。该方法还表现出跨域任务(包括新闻和遥感 ITM)上的卓越零样本泛化能力。代码和模型检查点可在 [此链接](https://this-url.com) 获取。
https://arxiv.org/abs/2507.08590
Large language models (LLMs) are increasingly deployed in agentic frameworks, in which prompts trigger complex tool-based analysis in pursuit of a goal. While these frameworks have shown promise across multiple domains including in finance, they typically lack a principled model-building step, relying instead on sentiment- or trend-based analysis. We address this gap by developing an agentic system that uses LLMs to iteratively discover stochastic differential equations for financial time series. These models generate risk metrics which inform daily trading decisions. We evaluate our system in both traditional backtests and using a market simulator, which introduces synthetic but causally plausible price paths and news events. We find that model-informed trading strategies outperform standard LLM-based agents, improving Sharpe ratios across multiple equities. Our results show that combining LLMs with agentic model discovery enhances market risk estimation and enables more profitable trading decisions.
大型语言模型(LLMs)越来越多地被部署在代理框架中,在这些框架中,提示会触发基于工具的复杂分析以实现目标。尽管这些框架在包括金融在内的多个领域展示了潜力,但它们通常缺乏一个有原则的建模构建步骤,而是依赖于情感或趋势分析。我们通过开发一种使用LLMs迭代发现随机微分方程来填补这一空白,用于金融时间序列。这些模型生成风险指标,为每日交易决策提供信息。我们在传统的回测和市场模拟器中评估了我们的系统,后者引入了合成但因果合理的价格路径和新闻事件。我们发现基于模型的交易策略优于标准的LLM代理,在多个股票上提高了夏普比率(Sharpe ratio)。我们的结果显示,将LLMs与代理模型发现相结合可以增强市场的风险估计,并有助于做出更盈利的交易决策。
https://arxiv.org/abs/2507.08584
Millions of biological sample records collected in the last few centuries archived in natural history collections are un-georeferenced. Georeferencing complex locality descriptions associated with these collection samples is a highly labour-intensive task collection agencies struggle with. None of the existing automated methods exploit maps that are an essential tool for georeferencing complex relations. We present preliminary experiments and results of a novel method that exploits multi-modal capabilities of recent Large Multi-Modal Models (LMM). This method enables the model to visually contextualize spatial relations it reads in the locality description. We use a grid-based approach to adapt these auto-regressive models for this task in a zero-shot setting. Our experiments conducted on a small manually annotated dataset show impressive results for our approach ($\sim$1 km Average distance error) compared to uni-modal georeferencing with Large Language Models and existing georeferencing tools. The paper also discusses the findings of the experiments in light of an LMM's ability to comprehend fine-grained maps. Motivated by these results, a practical framework is proposed to integrate this method into a georeferencing workflow.
几世纪以来收集并存档在自然历史收藏中的数百万生物样本记录尚未进行地理参照。将这些采集样本所关联的复杂地点描述进行地理参照是一项非常耗费人力的任务,而收藏机构也一直在为此努力。目前尚不存在的自动化方法能够利用地图这一地理位置参考的重要工具。 我们介绍了一种新型方法的初步实验和成果,该方法利用了最近大型多模态模型(LMM)的多模态能力。这种方法使模型能够在读取地点描述时通过视觉方式来理解空间关系。我们采用网格化的方法,在零样本设置下调整这些自回归模型以适应此任务。 我们在一个小规模的手动注释数据集上进行了实验,结果显示我们的方法相较于单模态地理参照(如使用大型语言模型和现有的地理参照工具)在平均距离误差方面取得了显著的成果(约1公里)。论文还讨论了实验结果所揭示出的大规模多模态模型理解细粒度地图能力方面的发现。 鉴于这些成果,我们提出了一种将这种方法整合到地理参照工作流程中的实用框架。
https://arxiv.org/abs/2507.08575
This study aims to develop a novel multi-modal fusion framework for brain tumor segmentation that integrates spatial-language-vision information through bidirectional interactive attention mechanisms to improve segmentation accuracy and boundary delineation. Methods: We propose two core components: Multi-modal Semantic Fusion Adapter (MSFA) integrating 3D MRI data with clinical text descriptions through hierarchical semantic decoupling, and Bidirectional Interactive Visual-semantic Attention (BIVA) enabling iterative information exchange between modalities. The framework was evaluated on BraTS 2020 dataset comprising 369 multi-institutional MRI scans. Results: The proposed method achieved average Dice coefficient of 0.8505 and 95% Hausdorff distance of 2.8256mm across enhancing tumor, tumor core, and whole tumor regions, outperforming state-of-the-art methods including SCAU-Net, CA-Net, and 3D U-Net. Ablation studies confirmed critical contributions of semantic and spatial modules to boundary precision. Conclusion: Multi-modal semantic fusion combined with bidirectional interactive attention significantly enhances brain tumor segmentation performance, establishing new paradigms for integrating clinical knowledge into medical image analysis.
这项研究旨在开发一种新的多模态融合框架,用于脑肿瘤分割。该框架通过双向交互注意力机制整合空间、语言和视觉信息,以提高分割精度和边界界定的准确性。方法:我们提出了两个核心组件:多模态语义融合适配器(MSFA),将3D MRI数据与临床文本描述通过层次化语义解耦进行集成;以及双向互动视觉-语义注意力机制(BIVA),使不同模式之间能够迭代交换信息。该框架在包含369个多机构MRI扫描的BraTS 2020数据集上进行了评估。 结果:所提出的方法在增强肿瘤、肿瘤核心和整个肿瘤区域分别实现了平均Dice系数为0.8505,以及95% Hausdorff距离为2.8256毫米的成绩,优于包括SCAU-Net、CA-Net和3D U-Net在内的最新方法。消融研究表明,语义模块和空间模块对边界精度的贡献至关重要。 结论:多模态语义融合结合双向互动注意力机制显著提升了脑肿瘤分割性能,为将临床知识融入医学图像分析中建立了新的范式。
https://arxiv.org/abs/2507.08574
When inverse kinematics (IK) is adopted to control robotic arms in manipulation tasks, there is often a discrepancy between the end effector (EE) position of the robot model in the simulator and the physical EE in reality. In most robotic scenarios with sim-to-real transfer, we have information about joint positions in both simulation and reality, but the EE position is only available in simulation. We developed a novel method to overcome this difficulty based on haptic feedback calibration, using a touchscreen in front of the robot that provides information on the EE position in the real environment. During the calibration procedure, the robot touches specific points on the screen, and the information is stored. In the next stage, we build a transformation function from the data based on linear transformation and neural networks that is capable of outputting all missing variables from any partial input (simulated/real joint/EE position). Our results demonstrate that a fully nonlinear neural network model performs best, significantly reducing positioning errors.
当使用逆运动学(IK)来控制机器人手臂执行操作任务时,通常会发现仿真环境中机器人的末端执行器(EE)位置与现实中的物理末端执行器位置之间存在差异。在大多数涉及从仿真到现实转移的机器人场景中,我们同时拥有模拟和现实中关节位置的信息,但只有仿真环境提供了末端执行器的位置信息。为此,我们开发了一种基于触觉反馈校准的新方法,在机器人前方放置一个触摸屏以提供真实环境中末端执行器位置的信息。在校准过程中,机器人会接触屏幕上的特定点,并将这些信息记录下来。 接下来的阶段中,我们会根据线性变换和神经网络的数据构建转换函数,从而能够从任何部分输入(仿真/现实中的关节/末端执行器位置)输出所有缺失变量。我们的结果显示,一个完全非线性的神经网络模型表现最佳,显著减少了定位误差。
https://arxiv.org/abs/2507.08572