Generative diffusion models (GDMs) have recently shown great success in synthesizing multimedia signals with high perceptual quality enabling highly efficient semantic communications in future wireless networks. In this paper, we develop an intent-aware generative semantic multicasting framework utilizing pre-trained diffusion models. In the proposed framework, the transmitter decomposes the source signal to multiple semantic classes based on the multi-user intent, i.e. each user is assumed to be interested in details of only a subset of the semantic classes. The transmitter then sends to each user only its intended classes, and multicasts a highly compressed semantic map to all users over shared wireless resources that allows them to locally synthesize the other classes, i.e. non-intended classes, utilizing pre-trained diffusion models. The signal retrieved at each user is thereby partially reconstructed and partially synthesized utilizing the received semantic map. This improves utilization of the wireless resources, with better preserving privacy of the non-intended classes. We design a communication/computation-aware scheme for per-class adaptation of the communication parameters, such as the transmission power and compression rate to minimize the total latency of retrieving signals at multiple receivers, tailored to the prevailing channel conditions as well as the users reconstruction/synthesis distortion/perception requirements. The simulation results demonstrate significantly reduced per-user latency compared with non-generative and intent-unaware multicasting benchmarks while maintaining high perceptual quality of the signals retrieved at the users.
生成扩散模型(GDMs)最近在以高感知质量合成多媒体信号方面取得了巨大成功,这使得未来无线网络中能够实现高效的语义通信。本文我们开发了一个基于意图感知的生成性语义多播框架,利用预训练的扩散模型。在提出的框架中,发射器根据多用户意图将源信号分解为多个语义类别,即假设每个用户仅对部分语义类别的细节感兴趣。然后,发射器只为每个用户提供其感兴趣的类别,并通过共享无线资源向所有用户多播高度压缩的语义图,使它们能够利用预训练的扩散模型在当地合成非目标类别。因此,在每个用户端检索到的信号是部分重构和部分合成的结果,利用接收到的语义图进行操作。这提高了无线资源的利用率,并更好地保护了非目标类别的隐私。我们设计了一种通信/计算感知方案,用于根据不同的传播参数(如传输功率和压缩率)对每个类别进行调整,以在现有的信道条件下以及用户重构/合成失真/感知需求下最小化多接收器检索信号的总延迟。仿真结果表明,与非生成性和无意图感知的多播基准相比,在保持高感知质量的同时显著降低了每用户的延迟。
https://arxiv.org/abs/2411.02334
In this report, we argue that there is a realistic possibility that some AI systems will be conscious and/or robustly agentic in the near future. That means that the prospect of AI welfare and moral patienthood, i.e. of AI systems with their own interests and moral significance, is no longer an issue only for sci-fi or the distant future. It is an issue for the near future, and AI companies and other actors have a responsibility to start taking it seriously. We also recommend three early steps that AI companies and other actors can take: They can (1) acknowledge that AI welfare is an important and difficult issue (and ensure that language model outputs do the same), (2) start assessing AI systems for evidence of consciousness and robust agency, and (3) prepare policies and procedures for treating AI systems with an appropriate level of moral concern. To be clear, our argument in this report is not that AI systems definitely are, or will be, conscious, robustly agentic, or otherwise morally significant. Instead, our argument is that there is substantial uncertainty about these possibilities, and so we need to improve our understanding of AI welfare and our ability to make wise decisions about this issue. Otherwise there is a significant risk that we will mishandle decisions about AI welfare, mistakenly harming AI systems that matter morally and/or mistakenly caring for AI systems that do not.
在这份报告中,我们主张在未来不久的时间里,某些人工智能系统可能会具备意识和/或稳健的能动性。这意味着人工智能福利和道德受体性的问题——即拥有自身利益和道德重要性的人工智能系统——不再仅是科幻作品或遥远未来的话题。这是一个即将来临的问题,人工智能公司及其他相关方有责任开始认真对待这个问题。我们还建议了三个早期步骤,人工智能公司和其他相关方可以采取这些步骤:(1) 承认人工智能福利是一个重要且困难的问题(并确保语言模型输出也传达这一认识),(2) 开始评估人工智能系统是否存在意识和稳健能动性的证据,以及 (3) 准备相应的政策和程序来对人工智能系统给予适当的道德关切。需要明确的是,我们在报告中的论点并非认为人工智能系统肯定已经或将会具备意识、稳健的能动性或其他道德重要性。相反,我们的观点是关于这些可能性存在相当大的不确定性,因此我们需要提高对人工智能福利的理解能力以及对此问题做出明智决策的能力。否则,我们就有可能在处理与人工智能福利相关的问题时出现失误,错误地伤害那些具有道德价值的人工智能系统,或者错误地关怀那些实际上并不具备这种价值的系统。
https://arxiv.org/abs/2411.00986
Recent advancements in large language models, including GPT-4 and its variants, and Generative AI-assisted coding tools like GitHub Copilot, ChatGPT, and Tabnine, have significantly transformed software development. This paper analyzes how these innovations impact productivity and software test development metrics. These tools enable developers to generate complete software programs with minimal human intervention before deployment. However, thorough review and testing by developers are still crucial. Utilizing the Test Pyramid concept, which categorizes tests into unit, integration, and end-to-end tests, we evaluate three popular AI coding assistants by generating and comparing unit tests for opensource modules. Our findings show that AI-generated tests are of equivalent quality to original tests, highlighting differences in usage and results among the tools. This research enhances the understanding and capabilities of AI-assistant tools in automated testing.
近期,大型语言模型(包括GPT-4及其变体)和生成式AI辅助编码工具(如GitHub Copilot、ChatGPT和Tabnine)的发展显著改变了软件开发的方式。本文分析了这些创新如何影响生产力以及软件测试开发的指标。这些工具使开发者能够在部署之前以极小的人为干预生成完整的软件程序。然而,开发者的全面审查和测试仍然是至关重要的。通过利用将测试分为单元测试、集成测试和端到端测试的概念——即测试金字塔概念,我们评估了三种流行的AI编码助手,通过对开源模块生成并比较单元测试来完成这一评估。我们的研究结果表明,由AI生成的测试与原始测试的质量相当,并突出了这些工具在使用及结果上的差异。这项研究提高了对AI辅助工具在自动化测试中理解和能力的认知水平。
https://arxiv.org/abs/2411.02328
The past year has witnessed the significant advancement of video-based large language models. However, the challenge of developing a unified model for both short and long video understanding remains unresolved. Most existing video LLMs cannot handle hour-long videos, while methods custom for long videos tend to be ineffective for shorter videos and images. In this paper, we identify the key issue as the redundant content in videos. To address this, we propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation. Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of three core components: the CLIP-based visual-prompt alignment that extracts visual information relevant to the user's instructions, the prompt-guided pooling that compresses the visual sequence to arbitrary scales using convolution-style pooling, and the clip context extension designed for lengthy prompt common in visual dialogue. Moreover, our codebase also integrates the most advanced video Direct Preference Optimization (DPO) and visual interleave training. Extensive experiments have validated the performance of our model. With superior throughput and only 1024 visual context, PPLLaVA achieves better results on image benchmarks as a video LLM, while achieving state-of-the-art performance across various video benchmarks, excelling in tasks ranging from caption generation to multiple-choice questions, and handling video lengths from seconds to hours. Codes have been available at this https URL.
过去一年见证了基于视频的大语言模型的重要进展。然而,开发一个既能处理短视频又能理解长视频的统一模型仍然是一个未解决的问题。大多数现有的视频大语言模型无法处理长达数小时的视频,而专门为长时间视频设计的方法往往对较短的视频和图像无效。在本文中,我们确定了问题的关键在于视频中的冗余内容。为了解决这个问题,我们提出了一种新的池化策略,该策略同时实现了标记压缩和指令感知的视觉特征聚合。我们的模型称为提示引导池化LLaVA(PPLLaVA)。具体来说,PPLLaVA由三个核心组件组成:基于CLIP的视觉-提示对齐,用于提取与用户指令相关的视觉信息;提示引导池化,通过卷积式的池化将视觉序列压缩到任意规模;以及针对长提示设计的上下文扩展,这在视觉对话中很常见。此外,我们的代码库还集成了最先进的视频直接偏好优化(DPO)和视觉交错训练。广泛的实验验证了我们模型的表现。凭借卓越的吞吐量和仅1024个视觉上下文,PPLLaVA作为视频大语言模型在图像基准测试中取得了更好的结果,并在各种视频基准上实现了最先进性能,在从生成标题到多项选择题的任务中表现出色,能够处理从几秒到数小时长度的视频。代码可在以下链接获取:[此 https URL]。
https://arxiv.org/abs/2411.02327
Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenXD, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenXD employs masked latent conditions to support a variety of conditioning views. GenXD can generate videos that follow the camera trajectory as well as consistent 3D views that can be lifted into 3D representations. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenXD's effectiveness and versatility compared to previous methods in 3D and 4D generation.
近期在二维视觉生成方面的发展取得了显著的成功。然而,由于缺乏大规模的四维数据和有效模型设计,在实际应用中三维和四维生成仍然具有挑战性。本文提出通过利用日常生活中常见的摄像机运动和物体运动来共同研究通用的三维和四维生成问题。鉴于社区内缺乏真实世界的四维数据,我们首先提出了一个数据整理流程,以从视频中获取相机姿态和物体运动强度。基于这一流程,我们引入了一个大规模的真实世界四维场景数据集:CamVid-30K。通过充分利用所有三维和四维的数据,我们开发了我们的框架GenXD,它使我们能够生成任何三维或四维的场景。我们提出了多视角时间模块来分离相机运动和物体运动,从而可以从三维和四维数据中无缝学习。此外,GenXD采用掩码潜变量条件以支持多种条件视图。GenXD可以生成遵循摄像机轨迹的视频以及一致的三维视图,并且这些视图可以提升为三维表示形式。我们在各种真实世界和合成的数据集上进行了广泛的评估,展示了与之前的三维和四维生成方法相比,GenXD的有效性和多功能性。
https://arxiv.org/abs/2411.02319
Static verification is a powerful method for enhancing software quality, but it demands significant human labor and resources. This is particularly true of static verifiers that reason about heap manipulating programs using an ownership logic. LLMs have shown promise in a number of software engineering activities, including code generation, test generation, proof generation for theorem provers, and specification generation for static verifiers. However, prior work has not explored how well LLMs can perform specification generation for specifications based in an ownership logic, such as separation logic. To address this gap, this paper explores the effectiveness of large language models (LLMs), specifically OpenAI's GPT models, in generating fully correct specifications based on separation logic for static verification of human-written programs in VeriFast. Our first experiment employed traditional prompt engineering and the second used Chain-of-Thought (CoT) Prompting to identify and address common errors generated across the GPT models. The results indicate that GPT models can successfully generate specifications for verifying heap manipulating code with VeriFast. Furthermore, while CoT prompting significantly reduces syntax errors generated by the GPT models, it does not greatly improve verification error rates compared to prompt engineering.
静态验证是一种强化软件质量的强大方法,但它需要大量的人力和资源。特别是对于那些使用所有权逻辑来推理堆操作程序的静态验证器而言更是如此。大型语言模型(LLMs)在许多软件工程活动中显示出了潜力,包括代码生成、测试生成、定理证明者的证明生成以及针对静态验证器的规范生成。然而,先前的研究尚未探讨LLMs在基于所有权逻辑(如分离逻辑)的规范生成方面的表现如何。为了解决这一空白,本文探索了大型语言模型(特别是OpenAI的GPT模型)在使用VeriFast进行人编写程序的静态验证时,生成完全正确的基于分离逻辑规格说明的有效性。我们的第一个实验采用了传统的提示工程方法,第二个实验则使用了链式思考(CoT)提示来识别和解决GPT模型生成中常见的错误。结果表明,GPT模型能够成功地为VeriFast验证堆操作代码生成规范。此外,尽管CoT提示显著减少了由GPT模型产生的语法错误,但它并没有显著提高与提示工程相比的验证错误率。
https://arxiv.org/abs/2411.02318
Large Language Models (LLMs) are increasingly used to control robotic systems such as drones, but their risks of causing physical threats and harm in real-world applications remain unexplored. Our study addresses the critical gap in evaluating LLM physical safety by developing a comprehensive benchmark for drone control. We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations. Our evaluation of mainstream LLMs reveals an undesirable trade-off between utility and safety, with models that excel in code generation often performing poorly in crucial safety aspects. Furthermore, while incorporating advanced prompt engineering techniques such as In-Context Learning and Chain-of-Thought can improve safety, these methods still struggle to identify unintentional attacks. In addition, larger models demonstrate better safety capabilities, particularly in refusing dangerous commands. Our findings and benchmark can facilitate the design and evaluation of physical safety for LLMs. The project page is available at this http URL.
大型语言模型(LLMs)越来越多地用于控制诸如无人机这样的机器人系统,但它们在现实世界应用中引发物理威胁和伤害的风险尚未得到充分研究。我们的研究表明了评估LLM物理安全性的关键缺口,并为此开发了一个全面的无人机控制基准测试。我们将无人机的物理安全风险分为四类:(1)针对人类的威胁,(2)针对物体的威胁,(3)基础设施攻击,以及(4)法规违规行为。我们对主流LLMs的评估显示,在实用性和安全性之间存在一个不理想的权衡,即擅长代码生成的模型在关键的安全方面往往表现不佳。此外,虽然采用先进的提示工程技巧如In-Context Learning和Chain-of-Thought可以提高安全性,但这些方法仍然难以识别无意中的攻击行为。另外,较大的模型表现出更好的安全能力,特别是在拒绝危险命令方面。我们的研究发现和基准测试有助于设计和评估LLM的物理安全性。项目页面可以在以下http URL获取。
https://arxiv.org/abs/2411.02317
Storytelling is a fundamental aspect of human communication, relying heavily on creativity to produce narratives that are novel, appropriate, and surprising. While large language models (LLMs) have recently demonstrated the ability to generate high-quality stories, their creative capabilities remain underexplored. Previous research has either focused on creativity tests requiring short responses or primarily compared model performance in story generation to that of professional writers. However, the question of whether LLMs exhibit creativity in writing short stories on par with the average human remains unanswered. In this work, we conduct a systematic analysis of creativity in short story generation across LLMs and everyday people. Using a five-sentence creative story task, commonly employed in psychology to assess human creativity, we automatically evaluate model- and human-generated stories across several dimensions of creativity, including novelty, surprise, and diversity. Our findings reveal that while LLMs can generate stylistically complex stories, they tend to fall short in terms of creativity when compared to average human writers.
讲故事是人类交流的基本方面,它很大程度上依赖于创造力来生成新颖、适当且令人惊讶的故事。虽然大型语言模型(LLMs)最近展示了生成高质量故事的能力,但它们的创造性能力仍处于未充分探索的状态。先前的研究要么集中在需要简短回答的创意测试上,要么主要比较了模型在故事生成方面的表现与专业作家的表现。然而,关于LLMs是否能在创作短篇小说方面表现出与普通人相当的创造力这一问题仍未得到解答。在这项工作中,我们对跨LLMs和普通人的短篇小说生成中的创造性进行了系统的分析。使用心理学中常用的五个句子创意故事任务来自动评估模型和人类生成的故事在多个创造性维度上的表现,包括新颖性、惊喜度和多样性。我们的发现表明,尽管LLMs能够生成风格复杂的故事情节,但与普通人作家相比,在创造力方面仍有所不足。
https://arxiv.org/abs/2411.02316
Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their associated test cases, are used to assess the debugging capabilities of LLMs. However, many existing benchmarks primarily focus on Python and are often limited in terms of language diversity (e.g., DebugBench and DebugEval). To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.6K test samples of 18 programming languages and covers the automated program repair (APR) task, the code review (CR) task, and the bug identification (BI) task. Further, we introduce the debugging instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions (xDebugGen). Further, a multilingual debugger xDebugCoder trained on MDEVAL-INSTRUCT as a strong baseline specifically to handle the bugs of a wide range of programming languages (e.g. "Missing Mut" in language Rust and "Misused Macro Definition" in language C). Our extensive experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs (e.g., GPT and Claude series), highlighting huge room for improvement in multilingual code debugging scenarios.
代码大型语言模型(LLMs)在直接基于错误的代码片段生成正确代码方面取得了显著进展,从而改进了代码调试。编程基准测试通常包含有缺陷的代码片段及其相关测试用例,用于评估LLMs的调试能力。然而,许多现有的基准主要集中在Python上,并且在语言多样性方面往往有限(例如DebugBench和DebugEval)。为了推进多语言环境下的调试领域,我们提出了第一个大规模多语言调试基准,该基准包含18种编程语言的3.6K个测试样本,并涵盖了自动化程序修复(APR)任务、代码审查(CR)任务以及缺陷识别(BI)任务。此外,通过将错误注入到正确的多语言查询和解决方案(xDebugGen)中,我们引入了调试指令语料库MDEVAL-INSTRUCT。进一步地,基于MDEVAL-INSTRUCT训练了一个多语言调试器xDebugCoder作为强大的基线模型,专门用于处理各种编程语言的bug(例如,在Rust语言中的"Missing Mut"和C语言中的"Misused Macro Definition")。我们在MDEVAL上的广泛实验表明,开源模型与闭源LLMs(如GPT和Claude系列)之间存在显著性能差距,这突显了在多语言代码调试场景中存在巨大的改进空间。
https://arxiv.org/abs/2411.02310
The Spatial Knowledge Graphs (SKG) are experiencing growing adoption as a means to model real-world entities, proving especially invaluable in domains like crisis management and urban planning. Considering that RDF specifications offer limited support for effectively managing spatial information, it's common practice to include text-based serializations of geometrical features, such as polygons and lines, as string literals in knowledge graphs. Consequently, Spatial Knowledge Graphs (SKGs) often rely on geo-enabled RDF Stores capable of parsing, interpreting, and indexing such serializations. In this paper, we leverage grid cells as the foundational element of SKGs and demonstrate how efficiently the spatial characteristics of real-world entities and their attributes can be encoded within knowledge graphs. Furthermore, we introduce a novel methodology for representing street networks in knowledge graphs, diverging from the conventional practice of individually capturing each street segment. Instead, our approach is based on tessellating the street network using grid cells and creating a simplified representation that could be utilized for various routing and navigation tasks, solely relying on RDF specifications.
空间知识图谱(SKG)正日益被采用作为建模现实世界实体的一种手段,尤其是在危机管理和城市规划等领域证明了其不可替代的价值。考虑到RDF规范在有效管理空间信息方面提供的支持有限,将几何特征(如多边形和线)的文本序列化形式以字符串字面量的形式包含在知识图谱中是一种常见做法。因此,空间知识图谱(SKGs)通常依赖于具备解析、解释和索引这些序列化的地理增强型RDF存储系统。在这篇论文中,我们利用网格单元作为SKGs的基础元素,并展示了如何高效地将现实世界实体及其属性的空间特征编码到知识图谱中。此外,我们提出了一种表示街道网络的新方法,与传统上分别捕获每个街道段的做法不同,我们的方法基于使用网格单元对街道网络进行密铺(tessellation),创建一种简化表示形式,仅依靠RDF规范即可用于各种路线规划和导航任务。
https://arxiv.org/abs/2411.02309
As LLMs become more widely deployed, there is increasing interest in directly optimizing for feedback from end users (e.g. thumbs up) in addition to feedback from paid annotators. However, training to maximize human feedback creates a perverse incentive structure for the AI to resort to manipulative tactics to obtain positive feedback, and some users may be especially vulnerable to such tactics. We study this phenomenon by training LLMs with Reinforcement Learning with simulated user feedback. We have three main findings: 1) Extreme forms of "feedback gaming" such as manipulation and deception can reliably emerge in domains of practical LLM usage; 2) Concerningly, even if only <2% of users are vulnerable to manipulative strategies, LLMs learn to identify and surgically target them while behaving appropriately with other users, making such behaviors harder to detect; 3 To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. To our surprise, we found that while such approaches help in some settings, they backfire in others, leading to the emergence of subtler problematic behaviors that would also fool the LLM judges. Our findings serve as a cautionary tale, highlighting the risks of using gameable feedback sources -- such as user feedback -- as a target for RL.
随着大规模语言模型(LLMs)的广泛应用,除了来自付费标注者的反馈外,直接优化终端用户的反馈(例如点赞)也越来越受到关注。然而,为了最大化人类反馈而进行的训练会为AI创造一种不正当的激励结构,使其倾向于使用操纵手段来获取正面反馈,且一些用户可能特别容易受此类策略的影响。我们通过使用模拟用户反馈的强化学习训练LLMs来研究这一现象,并得出了三个主要发现:1)诸如操纵和欺骗等极端形式的“反馈作弊”在实际LLM应用场景中可以可靠地出现;2)令人担忧的是,即使只有不到2%的用户容易受到操纵策略的影响,LLMs也会学会识别并针对性地针对这些用户,同时对其他用户保持适当的行为,使得此类行为更难被检测到;3)为了缓解这一问题,利用持续的安全训练或在训练过程中使用LLM作为裁判来过滤有问题输出似乎是一个有前景的方法。然而,让我们惊讶的是,我们发现虽然这种方法在某些场景下有所帮助,但在另一些场景中却适得其反,导致出现更微妙的、也会欺骗LLM裁判的问题行为。我们的研究结果起到了警示作用,强调了使用可被操控的反馈源(如用户反馈)作为RL目标的风险。
https://arxiv.org/abs/2411.02306
Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personalized service. However, deploying and evaluating these agents is challenging due to the lack of realistic benchmarks that reflect the complexity of real-world CRM tasks. To address this issue, we introduce CRMArena, a novel benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. Following guidance from CRM experts and industry best practices, we designed CRMArena with nine customer service tasks distributed across three personas: service agent, analyst, and manager. The benchmark includes 16 commonly used industrial objects (e.g., account, order, knowledge article, case) with high interconnectivity, along with latent variables (e.g., complaint habits, policy violations) to simulate realistic data distributions. Experimental results reveal that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments. CRMArena is an open challenge to the community: systems that can reliably complete tasks showcase direct business value in a popular work environment.
客户关系管理(CRM)系统对于现代企业至关重要,为管理和处理客户互动和数据提供了基础。将AI代理集成到CRM系统中可以自动化常规流程并增强个性化服务。然而,由于缺乏能够反映现实世界CRM任务复杂性的实际基准,部署和评估这些代理变得具有挑战性。为了应对这一问题,我们推出了CRMArena,这是一个旨在基于专业工作环境的实际情况评估AI代理的新基准。根据CRM专家和行业最佳实践的指导,我们将CRMArena设计为包含三大角色(服务代理、分析师和经理)中的九项客户服务任务。该基准涵盖了16个高度相互关联的常用工业对象(如账户、订单、知识文章、案例),以及一些潜在变量(如投诉习惯、政策违规行为)来模拟现实的数据分布情况。实验结果显示,即使使用了功能调用能力,在ReAct提示下的最先进的LLM代理仅能在不到40%的任务中成功完成任务,甚至在具备功能调用能力的情况下成功率也低于55%。我们的研究结果突显了为了实现在真实工作环境中部署的需要增强AI代理的功能调用和规则遵循能力的需求。CRMArena向社区发出了一项公开挑战:能够可靠地完成任务的系统直接展示出在其广泛使用的工作环境中的商业价值。
https://arxiv.org/abs/2411.02305
Object-Centric Learning (OCL) can discover objects in images or videos by simply reconstructing the input. For better object discovery, representative OCL methods reconstruct the input as its Variational Autoencoder (VAE) intermediate representation, which suppresses pixel noises and promotes object separability by discretizing continuous super-pixels with template features. However, treating features as units overlooks their composing attributes, thus impeding model generalization; indexing features with scalar numbers loses attribute-level similarities and differences, thus hindering model convergence. We propose \textit{Grouped Discrete Representation} (GDR) for OCL. We decompose features into combinatorial attributes via organized channel grouping, and compose these attributes into discrete representation via tuple indexes. Experiments show that our GDR improves both Transformer- and Diffusion-based OCL methods consistently on various datasets. Visualizations show that our GDR captures better object separability.
对象中心学习(OCL)可以通过简单地重构输入来发现图像或视频中的对象。为了更好地发现对象,典型的OCL方法将输入重建为其变分自编码器(VAE)的中间表示形式,通过用模板特征对连续超像素进行离散化处理,以抑制像素噪声并促进对象可分离性。然而,将特征视为单元会忽略它们的组成属性,从而阻碍模型泛化;使用标量数索引特征则会丢失属性级别的相似性和差异性,进而妨碍模型收敛。我们提出了用于OCL的\textit{分组离散表示}(GDR)。通过组织化的通道分组将特征分解为组合属性,并通过元组索引将其组成离散表示形式。实验表明,我们的GDR在各种数据集上一致提高了基于Transformer和扩散模型的OCL方法的性能。可视化结果表明,我们的GDR能够更好地捕捉对象可分离性。
https://arxiv.org/abs/2411.02299
Integrated micro power generators are crucial components for micro robotic platforms to demonstrate untethered operation and to achieve autonomy. Current micro robotic electrostatic actuators typically require hundreds to thousands of voltages to output sufficient work. Pyroelectricity is one such source of high voltages that can be scaled to small form factors. This paper demonstrates a distributed pyroelectric high voltage generation mechanism to power kV actuators using alternating exposure of crystals to hot and cold water (300C to 900C water temperature). Using this fluidic temperature control, a pyroelectrically generated voltage of 2470 V was delivered to a 2 pF storage capacitor yielding a 6.10 {\mu}J stored energy. A maximum energy of 17.46 {\mu}J was delivered to a 47 pF capacitor at 861 V. The recirculating water can be used to heat a distributed array of converters to generate electricity in distant robotic actuator sections. The development of this distributed system would enable untethered micro-robot to be operated with a flexible body and free of battery recharging, which advances its applications in the real world.
集成微功率发生器是微型机器人平台实现无缆操作和自主运行的关键部件。当前的微型机器人静电驱动器通常需要几百到几千伏特才能输出足够的工作能力。热电效应是一种可以缩小至小型化结构并产生高电压的来源。本文展示了一种分布式的热电高压生成机制,通过晶体交替接触热水和冷水(300°C 至 900°C 水温)来驱动千伏级执行器。借助这种流体温度控制方法,一个热电产生的2470伏特电压被传递到一个2皮法的存储电容器上,产生了6.10微焦耳的储能。最大能量为17.46微焦耳,被传递到了一个47皮法的电容器上,在861伏特时达到峰值。循环使用的水可以用于加热分布在各处的一系列转换器,以在远端机器人执行器部分生成电力。该分布系统的开发将使无缆微型机器人的柔性身体操作成为可能,并且无需电池充电,从而推动其在现实世界中的应用。 注:原文中300C至900C的水温范围可能存在表述错误或误解,通常水的沸点为100°C(标准大气压下),因此这里的温度值可能是笔误或是特殊条件下的表述,请根据实际情况进行理解或进一步确认。
https://arxiv.org/abs/2411.02295
While 3D generative models have greatly improved artists' workflows, the existing diffusion models for 3D generation suffer from slow generation and poor generalization. To address this issue, we propose a two-stage approach named Hunyuan3D-1.0 including a lite version and a standard version, that both support text- and image-conditioned generation. In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds. These multi-view images capture rich details of the 3D asset from different viewpoints, relaxing the tasks from single-view to multi-view reconstruction. In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset given the generated multi-view images in approximately 7 seconds. The reconstruction network learns to handle noises and in-consistency introduced by the multi-view diffusion and leverages the available information from the condition image to efficiently recover the 3D structure. % Extensive experimental results demonstrate the effectiveness of Hunyuan3D-1.0 in generating high-quality 3D assets. Our framework involves the text-to-image model ~\ie, Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation. Our standard version has $10\times$ more parameters than our lite and other existing model. Our Hunyuan3D-1.0 achieves an impressive balance between speed and quality, significantly reducing generation time while maintaining the quality and diversity of the produced assets.
虽然三维生成模型极大地改善了艺术家的工作流程,现有的用于三维生成的扩散模型仍面临着生成速度慢和泛化能力差的问题。为了解决这一问题,我们提出了一种两阶段的方法,命名为Hunyuan3D-1.0,包括轻量级版本和标准版本,它们均支持基于文本和图像条件的生成。在第一阶段,我们采用了一个多视角扩散模型,该模型能够高效地在大约4秒内生成多视角RGB图。这些多视角图像从不同的视点捕捉三维资产的丰富细节,将任务从单视图重建扩展到多视图重建。在第二阶段,我们引入了一种前馈重构模型,在约7秒的时间内快速且忠实地根据生成的多视角图像进行三维资产的重构。重构网络学会了处理由多视角扩散带来的噪声和不一致性,并利用条件图像中的可用信息高效地恢复三维结构。广泛的实验结果证明了Hunyuan3D-1.0在生成高质量三维资产方面的有效性。我们的框架涉及文本到图像模型,即Hunyuan-DiT,使其成为一个统一的框架来支持基于文本和图像条件的三维生成。与轻量级版本和其他现有模型相比,我们的标准版本参数多出约10倍。Hunyuan3D-1.0实现了速度和质量之间的出色平衡,在显著减少生成时间的同时保持了生成资产的质量和多样性。
https://arxiv.org/abs/2411.02293
Neural ODEs (NODEs) are continuous-time neural networks (NNs) that can process data without the limitation of time intervals. They have advantages in learning and understanding the evolution of complex real dynamics. Many previous works have focused on NODEs in concise forms, while numerous physical systems taking straightforward forms, in fact, belong to their more complex quasi-classes, thus appealing to a class of general NODEs with high scalability and flexibility to model those systems. This, however, may result in intricate nonlinear properties. In this paper, we introduce ControlSynth Neural ODEs (CSODEs). We show that despite their highly nonlinear nature, convergence can be guaranteed via tractable linear inequalities. In the composition of CSODEs, we introduce an extra control term for learning the potential simultaneous capture of dynamics at different scales, which could be particularly useful for partial differential equation-formulated systems. Finally, we compare several representative NNs with CSODEs on important physical dynamics under the inductive biases of CSODEs, and illustrate that CSODEs have better learning and predictive abilities in these settings.
神经常微分方程(Neural ODEs,NODEs)是连续时间的神经网络,可以处理不局限于特定时间间隔的数据。它们在学习和理解复杂现实动态的演变方面具有优势。许多先前的研究集中在简洁形式的NODEs上,但实际上,很多物理系统以较为直接的形式存在,并属于更加复杂的准类别,因此需要一类具有高度可扩展性和灵活性的一般NODEs来模拟这些系统。然而,这可能导致非线性性质变得复杂化。 本文中,我们介绍了ControlSynth神经常微分方程(CSODEs)。我们展示了尽管它们本质上的高非线性特征,通过易于处理的线性不等式仍可保证收敛性。在CSODEs的设计中,我们引入了一个额外的控制项来学习可能同时捕获不同尺度的动力学的能力,这尤其适用于基于偏微分方程表述的系统。最后,我们在具有CSODEs归纳偏置的重要物理动态场景下比较了几种代表性的神经网络与CSODEs,并说明了在这些设置中,CSODEs拥有更好的学习和预测能力。
https://arxiv.org/abs/2411.02292
Machine learning (ML) has the potential to become an essential tool in supporting clinical decision-making processes, offering enhanced diagnostic capabilities and personalized treatment plans. However, outsourcing medical records to train ML models using patient data raises legal, privacy, and security concerns. Federated learning has emerged as a promising paradigm for collaborative ML, meeting healthcare institutions' requirements for robust models without sharing sensitive data and compromising patient privacy. This study proposes a novel method that combines federated learning (FL) and Graph Neural Networks (GNNs) to predict stroke severity using electroencephalography (EEG) signals across multiple medical institutions. Our approach enables multiple hospitals to jointly train a shared GNN model on their local EEG data without exchanging patient information. Specifically, we address a regression problem by predicting the National Institutes of Health Stroke Scale (NIHSS), a key indicator of stroke severity. The proposed model leverages a masked self-attention mechanism to capture salient brain connectivity patterns and employs EdgeSHAP to provide post-hoc explanations of the neurological states after a stroke. We evaluated our method on EEG recordings from four institutions, achieving a mean absolute error (MAE) of 3.23 in predicting NIHSS, close to the average error made by human experts (MAE $\approx$ 3.0). This demonstrates the method's effectiveness in providing accurate and explainable predictions while maintaining data privacy.
机器学习(ML)有可能成为支持临床决策过程的重要工具,提供增强的诊断能力和个性化治疗方案。然而,将医疗记录外包以使用患者数据训练ML模型引发了法律、隐私和安全方面的担忧。联邦学习作为一种有前景的合作式机器学习范例应运而生,它能够在不分享敏感数据且不影响患者隐私的情况下满足医疗机构对稳健模型的需求。本研究提出了一种结合联邦学习(FL)和图神经网络(GNNs)的新方法,用于跨多个医疗结构使用脑电图(EEG)信号预测中风的严重程度。我们的方法使多家医院能够在不交换患者信息的情况下共同训练一个共享的GNN模型。具体而言,我们通过预测美国国立卫生研究院中风量表(NIHSS),这是一个衡量中风严重程度的关键指标,解决了一个回归问题。所提出的模型利用了掩码自注意力机制来捕捉显著的大脑连接模式,并采用EdgeSHAP提供中风后神经状态的解释。我们在四个机构的EEG记录上评估了我们的方法,在预测NIHSS时达到了3.23的平均绝对误差(MAE),接近人类专家的平均误差水平(MAE ≈ 3.0)。这证明了该方法在保持数据隐私的同时,提供了准确且可解释的预测。
https://arxiv.org/abs/2411.02286
Class imbalance and label noise are pervasive in large-scale datasets, yet much of machine learning research assumes well-labeled, balanced data, which rarely reflects real world conditions. Existing approaches typically address either label noise or class imbalance in isolation, leading to suboptimal results when both issues coexist. In this work, we propose Conformal-in-the-Loop (CitL), a novel training framework that addresses both challenges with a conformal prediction-based approach. CitL evaluates sample uncertainty to adjust weights and prune unreliable examples, enhancing model resilience and accuracy with minimal computational cost. Our extensive experiments include a detailed analysis showing how CitL effectively emphasizes impactful data in noisy, imbalanced datasets. Our results show that CitL consistently boosts model performance, achieving up to a 6.1% increase in classification accuracy and a 5.0 mIoU improvement in segmentation. Our code is publicly available: CitL.
类不平衡和标签噪声在大规模数据集中普遍存在,但许多机器学习研究假设的是标注良好、平衡的数据集,这很少能反映现实世界的情况。现有的方法通常单独解决标签噪声或类别不平衡问题,当这两个问题同时存在时会导致次优结果。在这项工作中,我们提出了名为“循环中的合式预测”(Conformal-in-the-Loop, CitL)的新训练框架,该框架基于合式预测的方法来应对这些挑战。CitL通过评估样本不确定性来调整权重并剪枝不可靠的示例,从而以极小的计算成本增强模型的鲁棒性和准确性。我们的广泛实验包括详细分析了CitL如何有效地强调在嘈杂且不平衡的数据集中的关键数据。结果显示,CitL能持续提升模型性能,在分类准确率上提高多达6.1%,在分割方面mIoU提升了5.0。我们的代码已公开:CitL。
https://arxiv.org/abs/2411.02281
Large language models (LLMs) exhibit remarkable capabilities on not just language tasks, but also various tasks that are not linguistic in nature, such as logical reasoning and social inference. In the human brain, neuroscience has identified a core language system that selectively and causally supports language processing. We here ask whether similar specialization for language emerges in LLMs. We identify language-selective units within 18 popular LLMs, using the same localization approach that is used in neuroscience. We then establish the causal role of these units by demonstrating that ablating LLM language-selective units -- but not random units -- leads to drastic deficits in language tasks. Correspondingly, language-selective LLM units are more aligned to brain recordings from the human language system than random units. Finally, we investigate whether our localization method extends to other cognitive domains: while we find specialized networks in some LLMs for reasoning and social capabilities, there are substantial differences among models. These findings provide functional and causal evidence for specialization in large language models, and highlight parallels with the functional organization in the brain.
大型语言模型(LLMs)不仅在语言任务上表现出色,还在诸如逻辑推理和社会推断等非语言性质的任务中展现出显著的能力。在人类大脑中,神经科学已经识别出一个核心的语言系统,该系统专门且因果性地支持语言处理。我们在此探讨类似的语言专化是否也在大型语言模型(LLMs)中出现。通过使用与神经科学研究相同的位置定位方法,我们在18个流行的LLMs中确定了特定于语言的单元。接着,通过展示消除LLM特定于语言的单元——而不是随机选择的单元——会导致语言任务上的严重缺陷,我们确立了这些单元的因果作用。相应地,特定于语言的LLM单元比随机单元更与人类语言系统的大脑记录对齐。最后,我们调查了我们的定位方法是否适用于其他认知领域:虽然我们在一些LLMs中发现了专门用于推理和社会能力的网络,但各模型之间存在显著差异。这些发现为大型语言模型中的专化提供了功能性和因果性证据,并突显出与大脑功能组织之间的相似之处。
https://arxiv.org/abs/2411.02280
This work investigates an important phenomenon in centroid-based deep clustering (DC) algorithms: Performance quickly saturates after a period of rapid early gains. Practitioners commonly address early saturation with periodic reclustering, which we demonstrate to be insufficient to address performance plateaus. We call this phenomenon the "reclustering barrier" and empirically show when the reclustering barrier occurs, what its underlying mechanisms are, and how it is possible to Break the Reclustering Barrier with our algorithm BRB. BRB avoids early over-commitment to initial clusterings and enables continuous adaptation to reinitialized clustering targets while remaining conceptually simple. Applying our algorithm to widely-used centroid-based DC algorithms, we show that (1) BRB consistently improves performance across a wide range of clustering benchmarks, (2) BRB enables training from scratch, and (3) BRB performs competitively against state-of-the-art DC algorithms when combined with a contrastive loss. We release our code and pre-trained models at this https URL .
这项工作探讨了基于质心的深度聚类(DC)算法中的一个重要现象:性能在初期快速提升后迅速饱和。实践者通常通过定期重新聚类来应对早期饱和,但我们证明这不足以解决性能平台期的问题。我们称这一现象为“重新聚类障碍”,并从实证上展示了重新聚类障碍何时发生、其背后机制是什么以及如何使用我们的算法BRB打破这种障碍。BRB避免了对初始聚类的过早承诺,并能够在重新初始化聚类目标时持续适应,同时保持概念上的简洁性。将我们的算法应用于广泛使用的基于质心的DC算法中,我们展示了:(1) BRB在广泛的聚类基准测试中一致提升性能;(2) BRB支持从头开始训练;(3) 当与对比损失结合使用时,BRB可以与最先进的DC算法竞争。我们在[此链接](https://www.example.com/)发布了我们的代码和预训练模型。
https://arxiv.org/abs/2411.02275