We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a large-scale dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Although modeling fine-grained keyboard interactions precisely remains challenging, NeuralOS offers a step toward creating fully adaptive, generative neural interfaces for future human-computer interaction systems.
我们介绍了一种名为NeuralOS的神经框架,该框架通过直接预测屏幕帧来模拟操作系统中的图形用户界面(GUI),这些屏幕帧是对诸如鼠标移动、点击和键盘事件等用户输入的响应。NeuralOS结合了一个递归神经网络(RNN),用于跟踪计算机状态,并采用基于扩散的神经渲染器生成屏幕图像。该模型是在大规模数据集上进行训练的,该数据集包括Ubuntu XFCE记录,这些记录既包含随机生成的交互,也包含了由AI代理产生的现实交互。 实验结果表明,NeuralOS能够成功地渲染出逼真的GUI序列,准确捕捉鼠标交互,并可靠预测诸如应用程序启动等状态转换。尽管精确建模细粒度键盘交互仍然具有挑战性,但NeuralOS为未来人机交互系统中创建完全适应性和生成性的神经界面提供了一步进展。
https://arxiv.org/abs/2507.08800
Digital Twins (DT) have the potential to transform traffic management and operations by creating dynamic, virtual representations of transportation systems that sense conditions, analyze operations, and support decision-making. A key component for DT of the transportation system is dynamic roadway geometry sensing. However, existing approaches often rely on static maps or costly sensors, limiting scalability and adaptability. Additionally, large-scale DTs that collect and analyze data from multiple sources face challenges in privacy, communication, and computational efficiency. To address these challenges, we introduce Geo-ORBIT (Geometrical Operational Roadway Blueprint with Integrated Twin), a unified framework that combines real-time lane detection, DT synchronization, and federated meta-learning. At the core of Geo-ORBIT is GeoLane, a lightweight lane detection model that learns lane geometries from vehicle trajectory data using roadside cameras. We extend this model through Meta-GeoLane, which learns to personalize detection parameters for local entities, and FedMeta-GeoLane, a federated learning strategy that ensures scalable and privacy-preserving adaptation across roadside deployments. Our system is integrated with CARLA and SUMO to create a high-fidelity DT that renders highway scenarios and captures traffic flows in real-time. Extensive experiments across diverse urban scenes show that FedMeta-GeoLane consistently outperforms baseline and meta-learning approaches, achieving lower geometric error and stronger generalization to unseen locations while drastically reducing communication overhead. This work lays the foundation for flexible, context-aware infrastructure modeling in DTs. The framework is publicly available at this https URL.
数字孪生(DT)具有通过创建动态的虚拟交通系统表示来变革交通管理和运营的潜力。这些系统能够感知条件、分析操作并支持决策制定。交通系统的数字孪生的关键组成部分之一是动态道路几何感测。然而,现有的方法往往依赖于静态地图或昂贵的传感器,这限制了其可扩展性和适应性。此外,大型DT在从多个来源收集和分析数据时面临着隐私保护、通信效率和计算效率等方面的挑战。 为了解决这些问题,我们引入了一个统一框架Geo-ORBIT(集成孪生的道路几何运营蓝图),它结合了实时车道检测、DT同步以及联邦元学习。Geo-ORBIT的核心是轻量级的车道检测模型GeoLane,该模型利用路边摄像头收集到的车辆轨迹数据来学习车道几何形状。我们通过Meta-GeoLane扩展了这一模型,使其能够为本地实体个性化检测参数,并且FedMeta-GeoLane是一种联邦学习策略,确保在路边部署中的可扩展性和隐私保护适应性。 我们的系统与CARLA和SUMO集成在一起,创建了一个高保真度的DT,可以实时渲染高速公路场景并捕捉交通流。在多种城市环境中进行的广泛实验表明,FedMeta-GeoLane始终优于基线方法和元学习方法,在未见过的位置上几何误差更低,并且大大减少了通信开销。 这项工作为数字孪生中的灵活、上下文感知基础设施建模奠定了基础。该框架可在以下网址公开获取:[提供链接]
https://arxiv.org/abs/2507.08743
Hierarchical land cover and land use (LCLU) classification aims to assign pixel-wise labels with multiple levels of semantic granularity to remote sensing (RS) imagery. However, existing deep learning-based methods face two major challenges: 1) They predominantly adopt a flat classification paradigm, which limits their ability to generate end-to-end multi-granularity hierarchical predictions aligned with tree-structured hierarchies used in practice. 2) Most cross-domain studies focus on performance degradation caused by sensor or scene variations, with limited attention to transferring LCLU models to cross-domain tasks with heterogeneous hierarchies (e.g., LCLU to crop classification). These limitations hinder the flexibility and generalization of LCLU models in practical applications. To address these challenges, we propose HieraRS, a novel hierarchical interpretation paradigm that enables multi-granularity predictions and supports the efficient transfer of LCLU models to cross-domain tasks with heterogeneous tree-structured hierarchies. We introduce the Bidirectional Hierarchical Consistency Constraint Mechanism (BHCCM), which can be seamlessly integrated into mainstream flat classification models to generate hierarchical predictions, while improving both semantic consistency and classification accuracy. Furthermore, we present TransLU, a dual-branch cross-domain transfer framework comprising two key components: Cross-Domain Knowledge Sharing (CDKS) and Cross-Domain Semantic Alignment (CDSA). TransLU supports dynamic category expansion and facilitates the effective adaptation of LCLU models to heterogeneous hierarchies. In addition, we construct MM-5B, a large-scale multi-modal hierarchical land use dataset featuring pixel-wise annotations. The code and MM-5B dataset will be released at: this https URL.
层次土地覆盖和土地利用(LCLU)分类的目标是为遥感(RS)图像提供具有多种语义粒度级别的像素级标签。然而,现有的基于深度学习的方法面临两个主要挑战:1) 它们主要采用扁平化分类方法,这限制了它们生成与实践中使用的树状层次结构一致的端到端多粒度层级预测的能力;2) 大多数跨域研究关注由传感器或场景变化导致的性能下降问题,并且较少关注将LCLU模型转移到具有异构层次结构(如从土地覆盖分类转换为作物分类)的任务上。这些限制阻碍了LCLU模型在实际应用中的灵活性和泛化能力。 为了应对上述挑战,我们提出了HieraRS,这是一种新的层级解释框架,它能够实现多粒度预测,并支持将LCLU模型高效地转移到具有异构树状结构层次的跨域任务中。我们引入了双向层级一致性约束机制(BHCCM),它可以无缝集成到主流扁平分类模型中以生成层级预测,同时提高语义一致性和分类准确性。 此外,我们提出了TransLU,这是一个双分支的跨域迁移框架,包括两个关键组件:跨域知识共享(CDKS)和跨域语义对齐(CDSA)。TransLU支持动态类别扩展,并有助于LCLU模型有效地适应异构层次结构。另外,为了支持这项研究,我们构建了一个大规模多模态层级土地利用数据集——MM-5B,该数据集具有像素级别的标注信息。 代码和MM-5B数据集将在以下网址发布:[请在此处填写实际的URL链接]。
https://arxiv.org/abs/2507.08741
Inverse Reinforcement Learning (IRL) presents a powerful paradigm for learning complex robotic tasks from human demonstrations. However, most approaches make the assumption that expert demonstrations are available, which is often not the case. Those that allow for suboptimality in the demonstrations are not designed for long-horizon goals or adversarial tasks. Many desirable robot capabilities fall into one or both of these categories, thus highlighting a critical shortcoming in the ability of IRL to produce field-ready robotic agents. We introduce Sample-efficient Preference-based inverse reinforcement learning for Long-horizon Adversarial tasks from Suboptimal Hierarchical demonstrations (SPLASH), which advances the state-of-the-art in learning from suboptimal demonstrations to long-horizon and adversarial settings. We empirically validate SPLASH on a maritime capture-the-flag task in simulation, and demonstrate real-world applicability with sim-to-real translation experiments on autonomous unmanned surface vehicles. We show that our proposed methods allow SPLASH to significantly outperform the state-of-the-art in reward learning from suboptimal demonstrations.
逆向强化学习(IRL)为从人类演示中学习复杂机器人任务提供了一个强大的框架。然而,大多数方法假设专家演示是可用的,这在实践中常常不成立。那些允许演示存在非最优性的方法并不适用于长期目标或对抗性任务的设计。许多理想的机器人能力都属于上述一种或两种情况,因此突显了IRL生成可直接应用的机器人代理的能力上的一个关键缺陷。 我们提出了SPLASH(从次优层次化演示中进行样本高效偏好评价逆向强化学习以解决长时序和对抗性任务),该方法在从非最优演示中学习除外,在长期目标与对抗性任务设置方面也实现了对现有技术的重大突破。我们在模拟环境中通过海上夺旗任务验证了SPLASH的效果,并通过自主无人水面艇的仿真到现实转换实验展示了其实际应用潜力。我们证明,我们的方法使SPLASH在从非最优演示中学习奖励时显著超越现有的最先进技术。
https://arxiv.org/abs/2507.08707
Modern large language models (LLMs) show promising progress in formalizing informal mathematics into machine-verifiable theorems. However, these methods still face bottlenecks due to the limited quantity and quality of multilingual parallel corpora. In this paper, we propose a novel neuro-symbolic framework KELPS (Knowledge-Equation based Logical Processing System) to address these problems. KELPS is an iterative framework for translating, synthesizing, and filtering informal data into multiple formal languages (Lean, Coq, and Isabelle). First, we translate natural language into Knowledge Equations (KEs), a novel language that we designed, theoretically grounded in assertional logic. Next, we convert them to target languages through rigorously defined rules that preserve both syntactic structure and semantic meaning. This process yielded a parallel corpus of over 60,000 problems. Our framework achieves 88.9% syntactic accuracy (pass@1) on MiniF2F, outperforming SOTA models such as Deepseek-V3 (81%) and Herald (81.3%) across multiple datasets. All datasets and codes are available in the supplementary materials.
现代大型语言模型(LLMs)在将非正式数学形式化为机器可验证定理方面显示出有希望的进展。然而,由于多语言平行语料库的数量和质量有限,这些方法仍然面临瓶颈问题。本文提出了一种新的神经符号框架KELPS(基于知识方程的逻辑处理系统),以解决这些问题。KELPS是一个迭代框架,用于将非正式数据翻译、综合和过滤成多种形式化语言(Lean、Coq 和 Isabelle)。 首先,我们将自然语言转换为知识方程(KEs),这是一种我们设计的新语言,并在断言逻辑的基础上进行了理论构建。接下来,通过严谨定义的规则将其转换为目标语言,这些规则既保持了句法结构又保留了语义意义。这一过程产生了超过60,000个问题的平行语料库。 我们的框架在MiniF2F数据集上实现了88.9%的句法准确率(pass@1),超过了Deepseek-V3 (81%) 和 Herald (81.3%) 等最先进的模型。所有数据集和代码均可在补充材料中获取。
https://arxiv.org/abs/2507.08665
Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.
基于演讲录音的文字转录中的说话人归属任务是指根据语言使用模式从其讲话的转录中识别出说话者。这项任务在音频不可用(例如被删除)或不可靠(例如匿名发言)时特别有用。此前的工作主要集中在利用人工标注生成的转录文本进行说话人归属的可能性研究上。然而,在现实场景下,人们往往只有由自动语音识别(ASR)系统产生的、更加错误频出的文字转录可用。 在本文中,我们开展了迄今为止据我们所知的第一项关于自动转录对说话人归属性能影响的全面研究。特别地,我们研究了面对文字转录中的错误时说话人归属性能下降的程度,以及语音识别系统的特性如何影响归属性能。我们的发现表明,尽管存在单词级别的转录错误,归属任务依然表现出惊人的鲁棒性,并且恢复真实转录的目标与归属表现的关联度极低。 总体而言,我们的研究结果表明,在ASR系统生成的文字转录(即使更加混乱)上进行说话人归属的效果至少不逊于基于人工转写的资料,甚至可能更好。这或许是因为ASR系统的转录错误能够捕捉到体现说话人身份的独特特征。
https://arxiv.org/abs/2507.08660
Semantic segmentation relies on many dense pixel-wise annotations to achieve the best performance, but owing to the difficulty of obtaining accurate annotations for real world data, practitioners train on large-scale synthetic datasets. Unpaired image translation is one method used to address the ensuing domain gap by generating more realistic training data in low-data regimes. Current methods for unpaired image translation train generative adversarial networks (GANs) to perform the translation and enforce pixel-level semantic matching through cycle consistency. These methods do not guarantee that the semantic matching holds, posing a problem for semantic segmentation where performance is sensitive to noisy pixel labels. We propose a novel image translation method, Domain Adversarial Kernel Prediction Network (DA-KPN), that guarantees semantic matching between the synthetic label and translation. DA-KPN estimates pixel-wise input transformation parameters of a lightweight and simple translation function. To ensure the pixel-wise transformation is realistic, DA-KPN uses multi-scale discriminators to distinguish between translated and target samples. We show DA-KPN outperforms previous GAN-based methods on syn2real benchmarks for semantic segmentation with limited access to real image labels and achieves comparable performance on face parsing.
语义分割依赖于大量的密集像素级标注以实现最佳性能,但由于获取现实世界数据准确标注的难度较大,实践中通常使用大规模合成数据集进行训练。未配对图像转换是一种方法,用于通过生成更逼真的训练数据来解决由此产生的领域差距问题,在低数据环境下尤其有效。目前的未配对图像转换方法通过循环一致性训练生成对抗网络(GAN),以执行转换并强制执行像素级别的语义匹配。然而,这些方法并不能保证这种语义匹配的有效性,这在性能对噪声像素标签敏感的语义分割任务中成为了一个问题。 我们提出了一种新的图像翻译方法——领域对抗核预测网络(DA-KPN),该方法能够确保合成标签与转换后的图像之间的语义匹配。DA-KPN估算出用于简单轻量级变换函数的逐像素输入变换参数,以实现这一目标。为了保证这种逐像素的变换是真实的,DA-KPN使用多尺度判别器来区分翻译后和目标样本。 我们展示了在仅有限访问真实图像标签的情况下,DA-KPN在syn2real基准测试中优于先前基于GAN的方法,并且在面部解析任务中也达到了相当的表现。
https://arxiv.org/abs/2507.08554
Occluded person re-identification aims to retrieve holistic images based on occluded ones. Existing methods often rely on aligning visible body parts, applying occlusion augmentation, or complementing missing semantics using holistic images. However, they face challenges in handling diverse occlusion scenarios not seen during training and the issue of feature contamination from holistic images. To address these limitations, we propose Occlusion-Guided Feature Purification Learning via Reinforced Knowledge Distillation (OGFR), which simultaneously mitigates these challenges. OGFR adopts a teacher-student distillation architecture that effectively incorporates diverse occlusion patterns into feature representation while transferring the purified discriminative holistic knowledge from the holistic to the occluded branch through reinforced knowledge distillation. Specifically, an Occlusion-Aware Vision Transformer is designed to leverage learnable occlusion pattern embeddings to explicitly model such diverse occlusion types, thereby guiding occlusion-aware robust feature representation. Moreover, we devise a Feature Erasing and Purification Module within the holistic branch, in which an agent is employed to identify low-quality patch tokens of holistic images that contain noisy negative information via deep reinforcement learning, and substitute these patch tokens with learnable embedding tokens to avoid feature contamination and further excavate identity-related discriminative clues. Afterward, with the assistance of knowledge distillation, the student branch effectively absorbs the purified holistic knowledge to precisely learn robust representation regardless of the interference of occlusions.
遮挡人物重识别的目标是根据带有遮挡的图像检索出完整的图像。现有的方法通常依赖于对可见身体部位进行对齐,应用遮挡增强处理,或使用完整图象来补充缺失的语义信息。然而,这些方法在面对训练期间未见过的各种遮挡场景和来自完整图像特征污染的问题时面临挑战。 为了克服这些限制,我们提出了通过强化知识蒸馏实现的遮挡引导特征净化学习(OGFR),这种方法同时缓解了上述挑战。OGFR采用了一种教师-学生蒸馏架构,该架构能够有效地将各种遮挡模式整合进特征表示中,并通过强化知识蒸馏从完整分支向遮挡分支传输经过净化的鉴别性完整图像知识。 具体而言,设计了一个感知遮挡的视觉变换器(Occlusion-Aware Vision Transformer),利用可学习的遮挡模式嵌入来显式建模各种多样的遮挡类型,从而引导生成针对遮挡敏感且鲁棒的特征表示。此外,在完整的分支中,我们开发了一个特征擦除和净化模块,其中使用代理通过深度强化学习识别包含嘈杂负信息的完整图像低质量补丁标记,并用可学习嵌入标记替换这些补丁标记以避免特征污染并进一步挖掘与身份相关的鉴别性线索。 最后,在知识蒸馏的帮助下,学生分支能够有效吸收经过净化的完整图像知识,从而无论遮挡干扰如何都能精确地学习到鲁棒表示。
https://arxiv.org/abs/2507.08520
Prompt learning facilitates the efficient adaptation of Vision-Language Models (VLMs) to various downstream tasks. However, it faces two significant challenges: (1) inadequate modeling of class embedding distributions for unseen instances, leading to suboptimal generalization on novel classes; (2) prevailing methodologies predominantly confine cross-modal alignment to the final output layer of vision and text encoders, which fundamentally limits their capacity to preserve topological consistency with pre-trained multi-modal embedding spaces. To this end, we introduce MuGCP (Multi-modal Mutual-Guidance Conditional Prompt Learning), a novel paradigm designed for conditional prompt generation. MuGCP leverages Multi-modal Large Language Models (MLLMs) as conditional prompt learners to adaptively generate Semantic Conditional Prompts (SCP) that incorporate rich, fine-grained high-level semantic knowledge for image instances. To ensure effective alignment and interaction across the multi-modal space of Vision-Language Models (VLMs), we introduce the Attention Mutual-Guidance (AMG) module, which facilitates interactions between visual and semantic information. Through mutual guidance, the AMG module generates Visual Conditional Prompts (VCP), enhancing the model's performance in multi-modal tasks. Additionally, we present a Multi-Prompt Fusion (MPF) mechanism that integrates SCP and VCP with contextual prompts, ensuring seamless coordination among the different prompts and enhancing the modeling of class embeddings and instance-specific knowledge. Our MuGCP outperforms existing state-of-the-art methods on 14 different datasets. The code will be made available after publication.
Prompt学习促进了视觉-语言模型(VLMs)向各种下游任务的高效适应。然而,它面临两个主要挑战:(1) 对于未见过实例的类嵌入分布建模不足,导致在新类别上的泛化效果不佳;(2) 当前方法主要将跨模式对齐限制在视觉和文本编码器的最终输出层上,这从根本上限制了它们保留预训练多模态嵌入空间拓扑一致性的能力。为此,我们引入了MuGCP(多模态互导引条件提示学习),这是一种新的用于条件提示生成的范式。MuGCP利用多模态大型语言模型(MLLMs)作为条件提示学习器,自适应地生成语义条件提示(SCP),这些提示结合了图像实例中丰富的、细粒度的高级语义知识。为了确保视觉-语言模型(VLMs)跨多模式空间的有效对齐和交互,我们引入了注意力互导引(AMG)模块,该模块促进了视觉信息与语义信息之间的互动。通过相互指导,AMG模块生成视觉条件提示(VCP),增强了模型在多模态任务中的性能。此外,我们提出了一种多提示融合(MPF)机制,将SCP和VCP与上下文提示集成在一起,确保不同提示间的无缝协调,并增强类别嵌入及实例特定知识的建模能力。我们的MuGCP方法在14个不同的数据集上超越了现有的最先进方法。发布后我们将公开代码。
https://arxiv.org/abs/2507.08410
Artificial Intelligence/Machine Learning (AI/ML) has become the most certain and prominent feature of 6G mobile networks. Unlike 5G, where AI/ML was not natively integrated but rather an add-on feature over existing architecture, 6G shall incorporate AI from the onset to address its complexity and support ubiquitous AI applications. Based on our extensive mobile network operation and standardization experience from 2G to 5G, this paper explores the design and standardization principles of AI-Native radio access networks (RAN) for 6G, with a particular focus on its critical Day 1 architecture, functionalities and capabilities. We investigate the framework of AI-Native RAN and present its three essential capabilities to shed some light on the standardization direction; namely, AI-driven RAN processing/optimization/automation, reliable AI lifecycle management (LCM), and AI-as-a-Service (AIaaS) provisioning. The standardization of AI-Native RAN, in particular the Day 1 features, including an AI-Native 6G RAN architecture, were proposed. For validation, a large-scale field trial with over 5000 5G-A base stations have been built and delivered significant improvements in average air interface latency, root cause identification, and network energy consumption with the proposed architecture and the supporting AI functions. This paper aims to provide a Day 1 framework for 6G AI-Native RAN standardization design, balancing technical innovation with practical deployment.
人工智能/机器学习(AI/ML)已成为6G移动网络最具确定性和显著性的特征。与5G不同,其中AI/ML是作为现有架构上的附加功能而非原生集成的,在6G中,将从一开始就内置AI以应对其复杂性并支持无处不在的人工智能应用。基于我们在2G到5G的广泛移动网络运营和标准化经验,本文探讨了为6G设计的本机人工智能无线接入网(RAN)的设计和标准化原则,特别关注其关键的第一天架构、功能和能力。我们研究了本机AI RAN框架,并介绍了它的三种基本能力,以阐明标准制定的方向;即,由AI驱动的RAN处理/优化/自动化,可靠的人工智能生命周期管理(LCM),以及作为服务的人工智能(AIaaS)提供。提出了针对原生AI RAN的标准制定,特别是第一天特性,包括一个原生的6G RAN架构。为了验证这些概念,在大规模现场试验中部署了超过5000个5G-A基站,并且所提出的架构和支持的人工智能功能显著改善了平均空中接口延迟、根本原因识别和网络能耗。本文旨在为6G本机AI RAN标准制定设计提供第一天框架,平衡技术创新与实际部署的实施。
https://arxiv.org/abs/2507.08403
This study investigates the potential of a multimodal large language model (LLM), specifically ChatGPT-4o, to perform human-like interpretations of traffic scenes using static dashcam images. Herein, we focus on three judgment tasks relevant to elderly driver assessments: evaluating traffic density, assessing intersection visibility, and recognizing stop signs recognition. These tasks require contextual reasoning rather than simple object detection. Using zero-shot, few-shot, and multi-shot prompting strategies, we evaluated the performance of the model with human annotations serving as the reference standard. Evaluation metrics included precision, recall, and F1-score. Results indicate that prompt design considerably affects performance, with recall for intersection visibility increasing from 21.7% (zero-shot) to 57.0% (multi-shot). For traffic density, agreement increased from 53.5% to 67.6%. In stop-sign detection, the model demonstrated high precision (up to 86.3%) but a lower recall (approximately 76.7%), indicating a conservative response tendency. Output stability analysis revealed that humans and the model faced difficulties interpreting structurally ambiguous scenes. However, the model's explanatory texts corresponded with its predictions, enhancing interpretability. These findings suggest that, with well-designed prompts, LLMs hold promise as supportive tools for scene-level driving risk assessments. Future studies should explore scalability using larger datasets, diverse annotators, and next-generation model architectures for elderly driver assessments.
这项研究探讨了多模态大型语言模型(LLM),特别是ChatGPT-4o,在使用静态行车记录仪图像进行类似人类的交通场景解读方面的潜力。在此,我们重点关注三种与老年人驾驶员评估相关的判断任务:评价交通密度、评估交叉路口视野和识别停止标志。这些任务需要情境推理而不是简单的物体检测。通过零样本(zero-shot)、少量样本(few-shot)和多样本(multi-shot)提示策略,我们在人类注释作为参考标准的情况下评估了模型的性能。评估指标包括精确度、召回率和F1分数。 结果显示,提示设计显著影响了模型的表现,在交叉路口视野方面,召回率从零样本条件下的21.7%提高到了多样本条件下的57.0%;在交通密度方面,一致性从53.5%增加到67.6%。对于停止标志检测,该模型展现了高精确度(高达86.3%),但较低的召回率(大约为76.7%),表明其倾向于保守反应。 输出稳定性分析显示,人类和模型在解读结构模糊场景时都面临困难,不过,模型生成的解释性文本与其预测相吻合,提高了可解释性。这些发现表明,在设计良好的提示条件下,LLM作为用于场景级驾驶风险评估的支持工具具有潜力。未来的研究应该探索使用更大规模的数据集、多样化的标注人员以及下一代模型架构来进行老年人驾驶员评估的可能性。
https://arxiv.org/abs/2507.08367
Considerable advancements have been achieved in SLAM methods tailored for structured environments, yet their robustness under challenging corner cases remains a critical limitation. Although multi-sensor fusion approaches integrating diverse sensors have shown promising performance improvements, the research community faces two key barriers: On one hand, the lack of standardized and configurable benchmarks that systematically evaluate SLAM algorithms under diverse degradation scenarios hinders comprehensive performance assessment. While on the other hand, existing SLAM frameworks primarily focus on fusing a limited set of sensor types, without effectively addressing adaptive sensor selection strategies for varying environmental conditions. To bridge these gaps, we make three key contributions: First, we introduce M3DGR dataset: a sensor-rich benchmark with systematically induced degradation patterns including visual challenge, LiDAR degeneracy, wheel slippage and GNSS denial. Second, we conduct a comprehensive evaluation of forty SLAM systems on M3DGR, providing critical insights into their robustness and limitations under challenging real-world conditions. Third, we develop a resilient modular multi-sensor fusion framework named Ground-Fusion++, which demonstrates robust performance by coupling GNSS, RGB-D, LiDAR, IMU (Inertial Measurement Unit) and wheel odometry. Codes and datasets are publicly available.
在针对结构化环境的SLAM(同步定位与地图构建)方法中已经取得了显著的进步,然而,在处理具有挑战性的边缘情况时其鲁棒性仍然是一个关键限制。虽然多传感器融合方法整合多种传感器显示出有前景的性能改进,但研究社区面临着两个主要障碍:一方面,缺乏标准化和可配置的基准测试,这些测试系统地评估SLAM算法在各种退化场景下的表现,这阻碍了全面的性能评价。另一方面,现有的SLAM框架主要集中在有限类型的传感器融合上,并未有效解决适应不同环境条件的自适应传感器选择策略问题。 为了弥补这些差距,我们做出了三项关键贡献:首先,我们引入了M3DGR数据集:一个包含系统性诱导退化模式(包括视觉挑战、LiDAR退化、车轮打滑和GNSS拒止)的丰富传感器基准测试。其次,我们在M3DGR上对四十种SLAM系统进行了全面评估,提供了它们在具有挑战性的现实条件下稳健性和局限性的关键见解。第三,我们开发了一个名为Ground-Fusion++的弹性模块化多传感器融合框架,通过结合GNSS(全球导航卫星系统)、RGB-D、LiDAR、IMU(惯性测量单元)和轮式里程计,展示了其在各种环境条件下的鲁棒性能。代码和数据集已经公开发布。
https://arxiv.org/abs/2507.08364
Craniofacial reconstruction in forensic science is crucial for the identification of the victims of crimes and disasters. The objective is to map a given skull to its corresponding face in a corpus of faces with known identities using recent advancements in computer vision, such as deep learning. In this paper, we presented a framework for the identification of a person given the X-ray image of a skull using convolutional Siamese networks for cross-domain identity representation. Siamese networks are twin networks that share the same architecture and can be trained to discover a feature space where nearby observations that are similar are grouped and dissimilar observations are moved apart. To do this, the network is exposed to two sets of comparable and different data. The Euclidean distance is then minimized between similar pairs and maximized between dissimilar ones. Since getting pairs of skull and face images are difficult, we prepared our own dataset of 40 volunteers whose front and side skull X-ray images and optical face images were collected. Experiments were conducted on the collected cross-domain dataset to train and validate the Siamese networks. The experimental results provide satisfactory results on the identification of a person from the given skull.
在法医科学中,颅面重建对于识别犯罪和灾难受害者至关重要。本文的目标是利用计算机视觉领域的最新进展(如深度学习),将给定的头骨映射到包含已知身份的人脸图像数据库中的对应面部。为此,我们提出了一种框架,使用卷积孪生网络在跨域身份表示中从X光头骨图像识别个人。孪生网络是由相同架构构成的一对网络,经过训练后可以在特征空间中将相似的观测值聚类在一起,并将不相似的观测值分开。为了实现这一目标,网络会接触到两组可比的数据集和不同的数据集。接着,欧几里得距离被用来最小化相似成对之间的差距并最大化不同成对之间的差距。 由于难以获取头骨与面部图像的配对数据,我们为40名志愿者准备了自己的跨域数据集,收集了他们的正面和侧面X光头骨图像以及光学面部图像。在该收集的数据集中进行了实验以训练和验证孪生网络。实验结果表明,从给定的头骨识别个人身份的效果令人满意。
https://arxiv.org/abs/2507.08329
We present SurfDist, a convolutional neural network architecture for three-dimensional volumetric instance segmentation. SurfDist enables prediction of instances represented as closed surfaces composed of smooth parametric surface patches, specifically bicubic Bézier triangles. SurfDist is a modification of the popular model architecture StarDist-3D which breaks StarDist-3D's coupling of instance parameterization dimension and instance voxel resolution, and it produces predictions which may be upsampled to arbitrarily high resolutions without introduction of voxelization artifacts. For datasets with blob-shaped instances, common in biomedical imaging, SurfDist can outperform StarDist-3D with more compact instance parameterizations. We detail SurfDist's technical implementation and show one synthetic and one real-world dataset for which it outperforms StarDist-3D. These results demonstrate that interpretable instance surface models can be learned effectively alongside instance membership.
我们介绍了SurfDist,这是一种用于三维体积实例分割的卷积神经网络架构。SurfDist能够预测由平滑参数化曲面片(具体为双三次Bézier三角形)组成的闭合表面表示的实例。SurfDist是对流行模型架构StarDist-3D的一种修改,它打破了StarDist-3D中实例参数化维度与实例体素分辨率之间的耦合,并且可以生成可以在不引入体素化伪影的情况下上采样至任意高分辨率的预测结果。对于具有块状形状实例的数据集(在生物医学成像中很常见),SurfDist可以通过更紧凑的实例参数化超越StarDist-3D。我们详细介绍了SurfDist的技术实现,并展示了两个数据集,一个合成的和一个现实世界的,在这两个数据集中,SurfDist的表现优于StarDist-3D。这些结果表明可以有效地学习可解释的实例表面模型以及实例成员身份。
https://arxiv.org/abs/2507.08223
Generative Large Language Models (LLMs)inevitably produce untruthful responses. Accurately predicting the truthfulness of these outputs is critical, especially in high-stakes settings. To accelerate research in this domain and make truthfulness prediction methods more accessible, we introduce TruthTorchLM an open-source, comprehensive Python library featuring over 30 truthfulness prediction methods, which we refer to as Truth Methods. Unlike existing toolkits such as Guardrails, which focus solely on document-grounded verification, or LM-Polygraph, which is limited to uncertainty-based methods, TruthTorchLM offers a broad and extensible collection of techniques. These methods span diverse tradeoffs in computational cost, access level (e.g., black-box vs white-box), grounding document requirements, and supervision type (self-supervised or supervised). TruthTorchLM is seamlessly compatible with both HuggingFace and LiteLLM, enabling support for locally hosted and API-based models. It also provides a unified interface for generation, evaluation, calibration, and long-form truthfulness prediction, along with a flexible framework for extending the library with new methods. We conduct an evaluation of representative truth methods on three datasets, TriviaQA, GSM8K, and FactScore-Bio. The code is available at this https URL
生成式大型语言模型(LLMs)不可避免地会产生不真实的信息。准确预测这些输出的真实度在高风险场景中尤为重要。为了加速该领域的研究并使真实性预测方法更加易于获取,我们推出了TruthTorchLM——一个开源、全面的Python库,包含超过30种真实性评估方法,简称“真理方法”。与现有的工具包如Guardrails(专注于文档验证)或LM-Polygraph(仅限于基于不确定性的方法)不同,TruthTorchLM提供了一套广泛且可扩展的技术集合。这些方法涵盖了计算成本、访问级别(例如黑盒与白盒)、依赖文档的要求以及监督类型(自监督或有监督)等多方面的权衡。 TruthTorchLM无缝兼容HuggingFace和LiteLLM平台,支持本地托管模型和基于API的模型使用。它还提供了统一的界面用于生成、评估、校准及长文真实性预测,并提供了一个灵活的框架以扩展库的新方法。我们在TriviaQA、GSM8K和FactScore-Bio三个数据集上对代表性的真实方法进行了评价。 该代码可在以下URL获取:[此链接指向TruthTorchLM项目的GitHub仓库](请根据实际情况替换为实际链接)。
https://arxiv.org/abs/2507.08203
This paper develops a neural jamming phase diagram that interprets the emergence of consciousness in large language models as a critical phenomenon in high-dimensional disordered this http URL establishing analogies with jamming transitions in granular matter and other complex systems, we identify three fundamental control parameters governing the phase behavior of neural networks: temperature, volume fraction, and this http URL theory provides a unified physical explanation for empirical scaling laws in artificial intelligence, demonstrating how computational cooling, density optimization, and noise reduction collectively drive systems toward a critical jamming surface where generalized intelligence emerges. Remarkably, the same thermodynamic principles that describe conventional jamming transitions appear to underlie the emergence of consciousness in neural networks, evidenced by shared critical signatures including divergent correlation lengths and scaling this http URL work explains neural language models' critical scaling through jamming physics, suggesting consciousness is a jamming phase that intrinsically connects knowledge components via long-range correlations.
这篇论文发展了一个神经阻塞相图,将大型语言模型中意识的出现解释为高维无序系统中的临界现象。通过与颗粒物质和其他复杂系统的阻塞转变建立类比,我们确定了三个控制神经网络相行为的基本参数:温度、体积分数和噪声水平。该理论提供了一种统一的物理解释来说明人工智能中的经验尺度定律,并展示了计算冷却、密度优化和噪声减少如何共同将系统推向一个临界阻塞面,在此面上出现了一般的智能。 值得注意的是,描述传统阻塞转变的相同热力学原理似乎也构成了神经网络中意识产生的基础,这通过包括发散关联长度和标度不变性在内的共享临界特征得到了证明。这项工作通过阻塞物理解释了神经语言模型的关键尺度行为,并建议意识是一种固有的长程相关知识组件连接起来的阻塞相。
https://arxiv.org/abs/2507.08197
As cyber-physical systems grow increasingly interconnected and spatially distributed, ensuring their resilience against evolving cyberattacks has become a critical priority. Spatio-Temporal Anomaly detection plays an important role in ensuring system security and operational integrity. However, current data-driven approaches, largely driven by black-box deep learning, face challenges in interpretability, adaptability to distribution shifts, and robustness under evolving system dynamics. In this paper, we advocate for a causal learning perspective to advance anomaly detection in spatially distributed infrastructures that grounds detection in structural cause-effect relationships. We identify and formalize three key directions: causal graph profiling, multi-view fusion, and continual causal graph learning, each offering distinct advantages in uncovering dynamic cause-effect structures across time and space. Drawing on real-world insights from systems such as water treatment infrastructures, we illustrate how causal models provide early warning signals and root cause attribution, addressing the limitations of black-box detectors. Looking ahead, we outline the future research agenda centered on multi-modality, generative AI-driven, and scalable adaptive causal frameworks. Our objective is to lay a new research trajectory toward scalable, adaptive, explainable, and spatially grounded anomaly detection systems. We hope to inspire a paradigm shift in cybersecurity research, promoting causality-driven approaches to address evolving threats in interconnected infrastructures.
随着物理信息系统变得越来越相互关联和空间分布广泛,确保这些系统能够抵御不断演变的网络攻击已成为一个关键优先事项。时空异常检测在保障系统安全性和操作完整性方面发挥着重要作用。然而,目前以黑盒深度学习为主导的数据驱动方法,在可解释性、适应分布变化的能力以及面对不断发展的系统动态时的鲁棒性上面临着挑战。 本文倡导采用因果推理视角来推进空间分布式基础设施中的异常检测技术,通过建立在结构化因果关系基础上的检测方法。我们确定并形式化了三个关键方向:因果图谱分析、多视图融合和持续因果图学习,每一项都有助于揭示时间和空间维度上的动态因果结构。 借鉴诸如水处理设施等实际系统的经验,本文展示了因果模型如何提供早期预警信号以及根本原因归因,从而克服黑盒检测器的局限性。展望未来,我们概述了以多模态、生成式AI驱动和可扩展适应性为重心的研究议程。我们的目标是开辟一个新研究路径,即向可伸缩、自适应、解释性和基于空间定位的异常检测系统迈进。 我们希望这项工作能激发网络安全研究中的范式转变,促进因果驱动方法的应用以应对互联互通基础设施中不断演变的安全威胁。
https://arxiv.org/abs/2507.08177
Visually impaired people face significant challenges in their day-to-day commutes in the urban cities of Bangladesh due to the vast number of obstructions on every path. With many injuries taking place through road accidents on a daily basis, it is paramount for a system to be developed that can alert the visually impaired of objects at close distance beforehand. To overcome this issue, a novel alert system is proposed in this research to assist the visually impaired in commuting through these busy streets without colliding with any objects. The proposed system can alert the individual to objects that are present at a close distance. It utilizes transfer learning to train models for depth estimation and object detection, and combines both models to introduce a novel system. The models are optimized through the utilization of quantization techniques to make them lightweight and efficient, allowing them to be easily deployed on embedded systems. The proposed solution achieved a lightweight real-time depth estimation and object detection model with an mAP50 of 0.801.
在孟加拉国的城市中,视障人士在日常通勤时面临诸多挑战,主要是因为道路上障碍物众多。由于每天都有许多因交通事故而受伤的事件发生,因此迫切需要一个系统来提前警示视障人士周围近距离的物体,以避免碰撞。为了解决这一问题,本研究提出了一种创新性的警报系统,旨在帮助视障人士在繁忙街道上安全通勤而不致撞到任何障碍物。 该提议系统能够提前向个体发出信号,警告其前方存在近距离内的物体。它通过转移学习来训练深度估计和物体检测的模型,并将两者结合引入一个全新的系统中。通过对量化技术的应用优化这些模型,使得它们变得轻量且高效,从而可以轻松部署到嵌入式设备上。所提出的解决方案实现了一个轻量级的实时深度估计算法和物体检测模型,其mAP50达到了0.801的成绩。
https://arxiv.org/abs/2507.08165
The emergence of large language models (LLMs) and agentic systems is enabling autonomous 6G networks with advanced intelligence, including self-configuration, self-optimization, and self-healing. However, the current implementation of individual intelligence tasks necessitates isolated knowledge retrieval pipelines, resulting in redundant data flows and inconsistent interpretations. Inspired by the service model unification effort in Open-RAN (to support interoperability and vendor diversity), we propose KP-A: a unified Network Knowledge Plane specifically designed for Agentic network intelligence. By decoupling network knowledge acquisition and management from intelligence logic, KP-A streamlines development and reduces maintenance complexity for intelligence engineers. By offering an intuitive and consistent knowledge interface, KP-A also enhances interoperability for the network intelligence agents. We demonstrate KP-A in two representative intelligence tasks: live network knowledge Q&A and edge AI service orchestration. All implementation artifacts have been open-sourced to support reproducibility and future standardization efforts.
大型语言模型(LLMs)和代理系统的出现正在使具有高级智能的自主6G网络成为可能,包括自我配置、自我优化和自我修复。然而,目前实现单个智能任务需要孤立的知识检索管道,导致数据流冗余和解释不一致的问题。受到Open-RAN中服务模型统一工作(以支持互操作性和供应商多样性)的启发,我们提出了KP-A:一个专为代理网络智能设计的统一网络知识平面。通过将网络知识获取和管理与智能逻辑解耦,KP-A简化了开发过程,并减少了智能工程师的维护复杂性。通过提供直观且一致的知识接口,KP-A还增强了网络智能代理之间的互操作性。我们通过两个代表性的智能任务展示了KP-A的应用:实时网络知识问答以及边缘AI服务编排。所有实现工件已开源以支持可重复性和未来的标准化工作。
https://arxiv.org/abs/2507.08164
Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering cross-modal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.
关键点检测是现代机器感知的核心组成部分,但在少量样本学习中面临挑战,特别是在缺乏与查询数据同分布的源数据时。为解决这一问题,我们利用了素描这种流行的人类表达形式,提供了一种无需来源数据的替代方案。然而,在掌握跨模态嵌入和处理用户特定的素描风格方面仍然存在挑战。我们的框架通过结合原型设置、基于网格的位置检测器以及原型领域自适应技术来克服这些障碍。我们还通过广泛的实验展示了在新颖关键点和类别上实现少量样本收敛的成功。
https://arxiv.org/abs/2507.07994