To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.
为了在3D人类运动和语言之间构建跨模态潜在空间,获取大规模且高质量的人类运动数据至关重要。然而,与图像数据的丰富相比,运动数据的稀疏性限制了现有运动-语言模型的性能。为了应对这个问题,我们引入了“运动补丁”,一种新的运动序列表示,并通过迁移学习使用Vision Transformers(ViT)作为运动编码器,旨在从图像域提取有用知识并将其应用于运动域。这些运动补丁是由基于运动部件在运动序列中进行拆分和排序的骨骼关节创建的,对不同的骨架结构具有鲁棒性,可以被视为ViT中的颜色图像补丁。我们发现,通过使用通过2D图像数据训练得到的预训练ViT权重的迁移学习,可以提高运动分析的性能,为解决运动数据有限的问题提供了一个有前途的方向。我们的广泛实验表明,与ViT共同使用的运动补丁在文本到运动检索基准测试和其他新颖挑战任务(如跨骨架识别、零散射击运动分类和人类交互识别)上实现了最先进的性能,这些任务目前由于缺乏数据而受到阻碍。
https://arxiv.org/abs/2405.04771
Collaborative filtering (CF) methods for recommendation systems have been extensively researched, ranging from matrix factorization and autoencoder-based to graph filtering-based methods. Recently, lightweight methods that require almost no training have been recently proposed to reduce overall computation. However, existing methods still have room to improve the trade-offs among accuracy, efficiency, and robustness. In particular, there are no well-designed closed-form studies for \emph{balanced} CF in terms of the aforementioned trade-offs. In this paper, we design SVD-AE, a simple yet effective singular vector decomposition (SVD)-based linear autoencoder, whose closed-form solution can be defined based on SVD for CF. SVD-AE does not require iterative training processes as its closed-form solution can be calculated at once. Furthermore, given the noisy nature of the rating matrix, we explore the robustness against such noisy interactions of existing CF methods and our SVD-AE. As a result, we demonstrate that our simple design choice based on truncated SVD can be used to strengthen the noise robustness of the recommendation while improving efficiency. Code is available at this https URL.
协作过滤(CF)方法在推荐系统领域得到了广泛研究,从矩阵分解和自编码器为基础到图过滤为基础的方法。最近,人们提出了一些轻量级的方法,几乎不需要训练,以降低总计算量。然而,现有的方法在准确性和效率之间仍存在潜在的权衡。特别是,在上述权衡方面,没有得到良好设计的闭合形式研究。在本文中,我们设计了一个简单的 yet 有效的 singular vector decomposition (SVD)-based linear autoencoder,称为 SVD-AE,其闭合形式解决方案可以根据 SVD 定义。SVD-AE 不需要迭代训练过程,因为其闭合形式解决方案可以一次性计算出来。此外,考虑到评分矩阵的噪声性质,我们研究了现有 CF 方法的鲁棒性,以及我们 SVD-AE 对这种噪声交互的鲁棒性。结果,我们证明了基于截断 SVD 的简单设计选择可以用于增强推荐系统的噪声鲁棒性,同时提高效率。代码可在此处访问:https://www.acm.org/dl/doi/10.1145/2848006.2848015
https://arxiv.org/abs/2405.04746
In this paper, we present a comprehensive survey of the metaverse, envisioned as a transformative dimension of next-generation Internet technologies. This study not only outlines the structural components of our survey but also makes a substantial scientific contribution by elucidating the foundational concepts underlying the emergence of the metaverse. We analyze its architecture by defining key characteristics and requirements, thereby illuminating the nascent reality set to revolutionize digital interactions. Our analysis emphasizes the importance of collaborative efforts in developing metaverse standards, thereby fostering a unified understanding among industry stakeholders, organizations, and regulatory bodies. We extend our scrutiny to critical technologies integral to the metaverse, including interactive experiences, communication technologies, ubiquitous computing, digital twins, artificial intelligence, and cybersecurity measures. For each technological domain, we rigorously assess current contributions, principal techniques, and representative use cases, providing a nuanced perspective on their potential impacts. Furthermore, we delve into the metaverse's diverse applications across education, healthcare, business, social interactions, industrial sectors, defense, and mission-critical operations, highlighting its extensive utility. Each application is thoroughly analyzed, demonstrating its value and addressing associated challenges. The survey concludes with an overview of persistent challenges and future directions, offering insights into essential considerations and strategies necessary to harness the full potential of the metaverse. Through this detailed investigation, our goal is to articulate the scientific contributions of this survey paper, transcending a mere structural overview to highlight the transformative implications of the metaverse.
在本文中,我们对元宇宙进行了全面的调查,将其视为下一代互联网技术的变革性维度。本研究不仅概述了我们的调查结构,而且通过阐明元宇宙出现的基础概念,对科学做出了重大贡献。我们通过定义关键特征和要求来分析其架构,从而阐明元宇宙的萌芽现实,为数字互动的变革奠定基础。我们强调开发元宇宙标准的过程中需要合作努力,从而促进产业利益相关者、组织和监管机构之间的统一理解。我们将目光投向元宇宙关键技术的扩展,包括交互体验、通信技术、普适计算、数字孪生、人工智能和安全措施。对于每个技术领域,我们严格评估其现有贡献、主要技术和代表性的使用案例,提供对这些潜在影响的深入视角。此外,我们深入研究了元宇宙在教育、医疗、商业、社交互动、工业部门、国防和任务关键操作等各个领域的广泛应用,突出其应用价值。每个应用都被深入分析,证明其价值并解决相关挑战。调查结论概述了持续的挑战和未来的方向,为充分利用元宇宙提供了重要见解。通过这一详尽的调查,我们的目标是阐述本调查报告的科学贡献,超越简单的结构概述,突出元宇宙的变革性影响。
https://arxiv.org/abs/2405.04718
Policymakers frequently analyze air quality and climate change in isolation, disregarding their interactions. This study explores the influence of specific climate factors on air quality by contrasting a regression model with K-Means Clustering, Hierarchical Clustering, and Random Forest techniques. We employ Physics-based Deep Learning (PBDL) and Long Short-Term Memory (LSTM) to examine the air pollution predictions. Our analysis utilizes ten years (2009-2018) of daily traffic, weather, and air pollution data from three major cities in Norway. Findings from feature selection reveal a correlation between rising heating degree days and heightened air pollution levels, suggesting increased heating activities in Norway are a contributing factor to worsening air quality. PBDL demonstrates superior accuracy in air pollution predictions compared to LSTM. This paper contributes to the growing literature on PBDL methods for more accurate air pollution predictions using environmental variables, aiding policymakers in formulating effective data-driven climate policies.
政策制定者经常孤立地分析空气质量和气候变化,忽视其相互影响。本研究通过将回归模型与K-聚类、层次聚类和随机森林技术相比较,探讨了特定气候因素对空气质量的影响。我们使用基于物理的深度学习(PBDL)和长短时记忆(LSTM)来研究空气污染预测。我们的分析利用了挪威三个主要城市(2009-2018)的每日交通、天气和空气污染数据十年的时间。特征选择的结果表明,每日加热 degree days 上升与空气污染水平加剧之间存在关联,表明挪威挪威的加热活动是加剧空气污染的一个因素。与LSTM相比,PBDL在空气污染预测方面的准确性具有优势。本文为使用环境变量进行更准确空气污染预测的PBDL方法的文献贡献,为政策制定者制定有效的数据驱动气候政策提供了帮助。
https://arxiv.org/abs/2405.04716
Reinforcement learning provides an appealing framework for robotic control due to its ability to learn expressive policies purely through real-world interaction. However, this requires addressing real-world constraints and avoiding catastrophic failures during training, which might severely impede both learning progress and the performance of the final policy. In many robotics settings, this amounts to avoiding certain "unsafe" states. The high-speed off-road driving task represents a particularly challenging instantiation of this problem: a high-return policy should drive as aggressively and as quickly as possible, which often requires getting close to the edge of the set of "safe" states, and therefore places a particular burden on the method to avoid frequent failures. To both learn highly performant policies and avoid excessive failures, we propose a reinforcement learning framework that combines risk-sensitive control with an adaptive action space curriculum. Furthermore, we show that our risk-sensitive objective automatically avoids out-of-distribution states when equipped with an estimator for epistemic uncertainty. We implement our algorithm on a small-scale rally car and show that it is capable of learning high-speed policies for a real-world off-road driving task. We show that our method greatly reduces the number of safety violations during the training process, and actually leads to higher-performance policies in both driving and non-driving simulation environments with similar challenges.
强化学习在机器人控制领域具有通过现实世界交互学习富有表现力的策略的吸引力。然而,这需要解决现实世界的约束,并在训练过程中避免灾难性故障,这可能会极大地阻碍学习和最终策略的性能。在许多机器人设置中,这等价于避免某些“不安全”的状态。高速越野驾驶任务代表了一个 particularly 困难的实例:高回报策略应该尽可能积极和快速地驱动,这往往需要接近“安全”状态的边缘,因此对方法避免频繁失败提出了特殊要求。为了同时学习高度有效的策略并避免过度故障,我们提出了一个结合风险敏感控制和自适应动作空间课程的强化学习框架。此外,我们还证明了我们的风险敏感目标在配备知识不确定性估计器时能够自动避免离散状态。我们在一个小型 rally car 上实现了我们的算法,并证明了它能够学会真实世界越野驾驶任务中的高速策略。我们证明了我们的方法在训练过程中大大减少了安全违规数量,并且实际上在具有类似挑战性的驾驶和非驾驶仿真环境中,都能获得更高性能的策略。
https://arxiv.org/abs/2405.04714
When agents that are independently trained (or designed) to complete their individual tasks are deployed in a shared environment, their joint actions may produce negative side effects (NSEs). As their training does not account for the behavior of other agents or their joint action effects on the environment, the agents have no prior knowledge of the NSEs of their actions. We model the problem of mitigating NSEs in a cooperative multi-agent system as a Lexicographic Decentralized Markov Decision Process with two objectives. The agents must optimize the completion of their assigned tasks while mitigating NSEs. We assume independence of transitions and rewards with respect to the agents' tasks but the joint NSE penalty creates a form of dependence in this setting. To improve scalability, the joint NSE penalty is decomposed into individual penalties for each agent using credit assignment, which facilitates decentralized policy computation. Our results in simulation on three domains demonstrate the effectiveness and scalability of our approach in mitigating NSEs by updating the policies of a subset of agents in the system.
当在共享环境中部署了那些经过独立训练(或设计)以完成各自任务的代理程序时,它们的联合行动可能会产生负面副作用(NSEs)。由于它们的训练没有考虑到其他代理程序的行为或它们联合行动对环境的影响,代理程序没有关于其行动NSEs的先验知识。我们将缓解NSEs的问题建模为合作多代理系统中的Lexicographic Decentralized Markov Decision Process,具有两个目标。代理程序必须在缓解NSEs的同时完成其分配的任务。我们假设与代理程序任务相关的转移和奖励是相互独立的,但联合NSE惩罚在某种程度上导致了这种设置中的一种形式上的依赖关系。为了提高可扩展性,联合NSE惩罚通过信用分配分解为每个代理程序的单独惩罚,这有助于促进分布式策略计算。我们在三个领域的模拟结果表明,通过更新系统中的部分代理程序策略,缓解NSEs的有效性和可扩展性。
https://arxiv.org/abs/2405.04702
Large Language Models (LLMs) deployed on edge devices learn through fine-tuning and updating a certain portion of their parameters. Although such learning methods can be optimized to reduce resource utilization, the overall required resources remain a heavy burden on edge devices. Instead, Retrieval-Augmented Generation (RAG), a resource-efficient LLM learning method, can improve the quality of the LLM-generated content without updating model parameters. However, the RAG-based LLM may involve repetitive searches on the profile data in every user-LLM interaction. This search can lead to significant latency along with the accumulation of user data. Conventional efforts to decrease latency result in restricting the size of saved user data, thus reducing the scalability of RAG as user data continuously grows. It remains an open question: how to free RAG from the constraints of latency and scalability on edge devices? In this paper, we propose a novel framework to accelerate RAG via Computing-in-Memory (CiM) architectures. It accelerates matrix multiplications by performing in-situ computation inside the memory while avoiding the expensive data transfer between the computing unit and memory. Our framework, Robust CiM-backed RAG (RoCR), utilizing a novel contrastive learning-based training method and noise-aware training, can enable RAG to efficiently search profile data with CiM. To the best of our knowledge, this is the first work utilizing CiM to accelerate RAG.
大语言模型(LLMs)在边缘设备上通过微调和完善其参数来学习。尽管这种学习方法可以优化以减少资源利用率,但总体上边缘设备所需的资源仍然沉重负担。相反,检索增强生成(RAG)是一种资源高效的LLM学习方法,可以在不更新模型参数的情况下提高LLM生成的内容的质量。然而,基于RAG的LLM可能需要在每个用户-LLM交互过程中对用户数据进行重复搜索。这种搜索可能导致延迟的积累以及随着用户数据的增长而降低RAG的可扩展性。仍然是一个未解决的问题:如何从边缘设备的延迟和可扩展性约束中解放RAG?在本文中,我们提出了通过计算在内存中的架构加速RAG的新框架。它通过在内存中进行原地计算来加速矩阵乘法,同时避免计算单元和内存之间进行昂贵的数据传输。我们的框架Robust CiM-backed RAG(RoCR)使用了一种新的基于对比学习的学习方法和新颖的噪声感知训练,可以实现RAG与CiM的 efficiently搜索用户数据。据我们所知,这是第一个利用CiM加速RAG的工作。
https://arxiv.org/abs/2405.04700
Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., `a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., `a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., `a red panda climbing a tree') and second scene description (e.g., `the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the TALC framework. We show that the TALC-finetuned model outperforms the baseline methods by 15.5 points in the overall score, which averages visual consistency and text adherence using human evaluation. The project website is this https URL.
近年来,基于扩散的生成建模的进步导致了许多基于文本提示的文本到视频(T2V)模型的开发。这些T2V模型通常会生成描述特定动作的视频片段(例如,`一只红熊猫爬树`)。然而,生成多场景视频(例如,`一只红熊猫爬树,然后它在树上睡觉`)是恰当的,因为它们在现实生活中非常普遍(例如,`一只红熊猫爬树` followed by `一只红熊猫在树上睡觉`)。为了从预训练的T2V模型中生成多场景视频,我们引入了时间同步捕获(TALC)框架。具体来说,我们增强T2V架构中文本条件机制,以识别视频场景和场景描述之间的时间对齐。例如,我们分别用第一场景描述(例如,`一只红熊猫爬树`)和第二场景描述(例如,`一只红熊猫在树上睡觉`)的表示条件视觉特征。结果表明,T2V模型可以生成符合多场景文本描述且具有视觉一致性的多场景视频(例如,实体和背景)。此外,我们使用TALC框架对预训练的T2V模型进行微调。我们发现,TALC微调的模型在整体得分上比基线方法高出15.5分,这通过人类评价来衡量视觉一致性和文本准确性。项目网站是https:// this URL。
https://arxiv.org/abs/2405.04682
Generative AI technologies demand new practical and critical competencies, which call on design to respond to and foster these. We present an exploratory study guided by Research-through-Design, in which we partnered with a primary school to develop a constructionist curriculum centered on students interacting with a generative AI technology. We provide a detailed account of the design of and outputs from the curriculum and learning materials, finding centrally that the reflexive and prolonged `hands-on' approach led to a co-development of students' practical and critical competencies. From the study, we contribute guidance for designing constructionist approaches to generative AI technology education; further arguing to do so with `critical responsivity.' We then discuss how HCI researchers may leverage constructionist strategies in designing interactions with generative AI technologies; and suggest that Research-through-Design can play an important role as a `rapid response methodology' capable of reacting to fast-evolving, disruptive technologies such as generative AI.
生成式AI技术需要新的实用和批判性能力,这需要设计来应对和促进这些能力。我们进行了一项研究导向设计的研究,其中我们与一所小学合作,开发了一种以学生与生成式AI技术互动为中心的课程。我们详细介绍了课程的设计和输出,发现中心地带是采用反思和持续的“动手”方法导致了学生实践和批判能力的共同发展。从研究中,我们提供了设计生成式AI技术教育构建主义方法的指导;进一步主张采用“批判性反应”来这样做。然后我们讨论了HCI研究人员如何在设计与生成式AI技术的交互中利用构建主义方法;建议研究导向设计可以作为“快速响应方法”,应对快速发展和颠覆性的技术,如生成式AI。
https://arxiv.org/abs/2405.04677
In recent years, convolutional neural networks (CNNs) have achieved remarkable advancement in the field of remote sensing image super-resolution due to the complexity and variability of textures and structures in remote sensing images (RSIs), which often repeat in the same images but differ across others. Current deep learning-based super-resolution models focus less on high-frequency features, which leads to suboptimal performance in capturing contours, textures, and spatial information. State-of-the-art CNN-based methods now focus on the feature extraction of RSIs using attention mechanisms. However, these methods are still incapable of effectively identifying and utilizing key content attention signals in RSIs. To solve this problem, we proposed an advanced feature extraction module called Channel and Spatial Attention Feature Extraction (CSA-FE) for effectively extracting the features by using the channel and spatial attention incorporated with the standard vision transformer (ViT). The proposed method trained over the UCMerced dataset on scales 2, 3, and 4. The experimental results show that our proposed method helps the model focus on the specific channels and spatial locations containing high-frequency information so that the model can focus on relevant features and suppress irrelevant ones, which enhances the quality of super-resolved images. Our model achieved superior performance compared to various existing models.
近年来,卷积神经网络(CNNs)在远程 sensing图像超分辨率领域取得了显著进展,这是由于远程 sensing图像(RSIs)中纹理和结构复杂性和变异性,这些图像在同一图像上重复,但在其他图像上不同。目前基于深度学习的超分辨率方法更加关注低频特征,这导致在捕捉轮廓、纹理和空间信息方面性能较低。最先进的基于CNN的超级分辨率方法现在专注于使用注意力机制提取RSIs的特征。然而,这些方法仍然无法有效地识别和利用RSIs中的关键内容关注信号。为了解决这个问题,我们提出了一个名为通道和空间注意力特征提取(CSA-FE)的高级特征提取模块,通过使用与标准视觉Transformer(ViT)集成的通道和空间注意力来有效地提取特征。在训练方面,我们使用UC Merced数据集在规模2、3和4上进行训练。实验结果表明,与各种现有模型相比,我们提出的方法有助于模型集中于包含高频信息的特定通道和空间位置,使模型可以关注相关特征并抑制无关特征,从而提高超分辨率图像的质量。我们的模型在各种现有模型中具有卓越的性能。
https://arxiv.org/abs/2405.04595
Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including the estimation of 3D pose, shape, contact, human-object interaction, emotion, and more. Each of these methods works in isolation instead of synergistically. Here we address this problem and build a language-driven human understanding system -- ChatHuman, which combines and integrates the skills of many different methods. To do so, we finetune a Large Language Model (LLM) to select and use a wide variety of existing tools in response to user inputs. In doing so, ChatHuman is able to combine information from multiple tools to solve problems more accurately than the individual tools themselves and to leverage tool output to improve its ability to reason about humans. The novel features of ChatHuman include leveraging academic publications to guide the application of 3D human-related tools, employing a retrieval-augmented generation model to generate in-context-learning examples for handling new tools, and discriminating and integrating tool results to enhance 3D human understanding. Our experiments show that ChatHuman outperforms existing models in both tool selection accuracy and performance across multiple 3D human-related tasks. ChatHuman is a step towards consolidating diverse methods for human analysis into a single, powerful, system for 3D human reasoning.
已有许多方法被提出用于检测、估计和分析图像中的人,包括估计3D姿势、形状、接触、人机交互、情绪等。这些方法彼此独立工作而不是协同工作。我们解决这个问题,并构建了一个语言驱动的人理解系统--ChatHuman,它结合和整合了许多不同方法的技能。为此,我们通过微调一个大语言模型(LLM)来选择和使用广泛的现有工具来响应用户输入。这样做,ChatHuman能够将多个工具的信息结合在一起,使其更准确地解决问题,并利用工具的输出来提高其对人类的推理能力。ChatHuman的新特点包括利用学术出版物指导3D与人相关的工具的应用,采用检索增强生成模型生成处理新工具的上下文学习示例,以及通过区分和整合工具结果来增强3D人类理解。我们的实验证明,ChatHuman在工具选择准确性和多个3D人机交互任务中的表现优于现有模型。ChatHuman是将多样化方法合并为单个、强大的3D人类推理系统的一步。
https://arxiv.org/abs/2405.04533
We report on the development of an implementable physics-data hybrid dynamic model for an articulated manipulator to plan and operate in various scenarios. Meanwhile, the physics-based and data-driven dynamic models are studied in this research to select the best model for planning. The physics-based model is constructed using the Lagrangian method, and the loss terms include inertia loss, viscous loss, and friction loss. As for the data-driven model, three methods are explored, including DNN, LSTM, and XGBoost. Our modeling results demonstrate that, after comprehensive hyperparameter optimization, the XGBoost architecture outperforms DNN and LSTM in accurately representing manipulator dynamics. The hybrid model with physics-based and data-driven terms has the best performance among all models based on the RMSE criteria, and it only needs about 24k of training data. In addition, we developed a virtual force sensor of a manipulator using the observed external torque derived from the dynamic model and designed a motion planner through the physics-data hybrid dynamic model. The external torque contributes to forces and torque on the end effector, facilitating interaction with the surroundings, while the internal torque governs manipulator motion dynamics and compensates for internal losses. By estimating external torque via the difference between measured joint torque and internal losses, we implement a sensorless control strategy which is demonstrated through a peg-in-hole task. Lastly, a learning-based motion planner based on the hybrid dynamic model assists in planning time-efficient trajectories for the manipulator. This comprehensive approach underscores the efficacy of integrating physics-based and data-driven models for advanced manipulator control and planning in industrial environments.
我们报道了一个可实施物理学数据混合动态模型的发展,用于设计和管理一个关节式操作器在各种场景下的运动规划和操作。同时,本研究还探讨了基于物理和数据驱动的动态模型的研究,以选择最佳模型进行规划。基于物理的模型使用拉格朗日方法构建,其中包括惯性损失、粘滞损失和摩擦损失。对于数据驱动模型,我们探讨了包括DNN、LSTM和XGBoost三种方法。我们模型的研究结果表明,在全面优化超参数后,XGBoost架构在准确表示操作器动力方面优于DNN和LSTM。基于物理和数据驱动的混合模型在所有基于RMSE标准的模型中具有最佳性能,而且只需要约24k的训练数据。此外,我们还使用从动态模型中观察到的外部扭矩开发了一个操作器的虚拟力传感器,并通过物理数据混合动态模型设计了一个运动规划器。外部扭矩对末端执行器的力和扭矩产生贡献,促进与周围环境的交互,而内部扭矩控制操作器运动动态并抵消内部损失。通过通过测量关节扭矩与内部损失之差估算外部扭矩,我们实现了无需传感器即可控制的策略,并通过一个钉孔任务证明了其有效性。最后,基于混合动态模型的学习运动规划器有助于为操作器在工业环境中的高级控制和规划实现高效的时间轨迹规划。这种全面的方法突出了将物理学基础和数据驱动模型整合起来在工业环境中设计高级操作器控制和规划的有效性。
https://arxiv.org/abs/2405.04503
When a teacher provides examples for a student to study, these examples must be informative, enabling a student to progress from their current state toward a target concept or skill. Good teachers must therefore simultaneously infer what students already know and adapt their teaching to students' changing state of knowledge. There is increasing interest in using computational models, particularly large language models, as pedagogical tools. As students, language models in particular have shown a remarkable ability to adapt to new tasks given small numbers of examples. But how effectively can these models adapt as teachers to students of different types? To study this question, we introduce a suite of models and evaluation methods we call AdapT. AdapT has two components: (1) a collection of simulated Bayesian student models that can be used for evaluation of automated teaching methods; (2) a platform for evaluation with human students, to characterize the real-world effectiveness of these methods. We additionally introduce (3) AToM, a new probabilistic model for adaptive teaching that jointly infers students' past beliefs and optimizes for the correctness of future beliefs. In evaluations of simulated students across three learning domains (fraction arithmetic, English morphology, function learning), AToM systematically outperforms LLM-based and standard Bayesian teaching models. In human experiments, both AToM and LLMs outperform non-adaptive random example selection. Our results highlight both the difficulty of the adaptive teaching task and the potential of learned adaptive models for solving it.
当老师为学生提供学习例子时,这些例子必须是有指导性的,使学生能够从现有状态进展到目标概念或技能。因此,好的老师必须同时推断学生已经知道的内容,并根据学生知识的改变情况进行调整。随着人们对计算模型的兴趣增加,特别是大型语言模型,作为教学工具的应用也越来越受到关注。尤其是大型语言模型,在给定少量示例的情况下表现出惊人的适应能力。但是这些模型作为教师对不同类型的学生有哪些适应能力呢?为了研究这个问题,我们引入了一个名为AdapT的系列模型和评估方法。AdapT有两个组件:(1)一组用于评估自动教学方法的模拟贝叶斯学生模型;(2)一个用于评价人学生模型的平台,以评估这些方法在现实世界中的有效性。此外,我们还引入了(3)AToM,一种新的概率模型,用于适应性教学,它共同推断学生的先验信念并优化未来的信念正确性。在三个学习领域(分数代数,英语形态学,函数学习)的模拟学生评估中,AdapT系统性地优于LLM基和标准贝叶斯教学模型。在人类实验中,AToM和LLM都胜过了非适应性随机示例选择。我们的结果突出了适应性教学任务的困难以及学习适应性模型解决这个问题的潜在可能性。
https://arxiv.org/abs/2405.04495
Aligning machine learning systems with human expectations is mostly attempted by training with manually vetted human behavioral samples, typically explicit feedback. This is done on a population level since the context that is capturing the subjective Point-Of-View (POV) of a concrete person in a specific situational context is not retained in the data. However, we argue that alignment on an individual level can boost the subjective predictive performance for the individual user interacting with the system considerably. Since perception differs for each person, the same situation is observed differently. Consequently, the basis for decision making and the subsequent reasoning processes and observable reactions differ. We hypothesize that individual perception patterns can be used for improving the alignment on an individual level. We test this, by integrating perception information into machine learning systems and measuring their predictive performance wrt.~individual subjective assessments. For our empirical study, we collect a novel data set of multimodal stimuli and corresponding eye tracking sequences for the novel task of Perception-Guided Crossmodal Entailment and tackle it with our Perception-Guided Multimodal Transformer. Our findings suggest that exploiting individual perception signals for the machine learning of subjective human assessments provides a valuable cue for individual alignment. It does not only improve the overall predictive performance from the point-of-view of the individual user but might also contribute to steering AI systems towards every person's individual expectations and values.
将机器学习系统与人类期望对齐主要是通过手动审核的人类行为样本进行训练,通常是有明确反馈的。这是在整个人口水平上进行的,因为捕获了一个具体人在特定情境背景中的主观观点的上下文的数据中不保留该上下文。然而,我们认为在个体层面上进行对齐可以显著提高与系统交互的用户的主观预测表现。由于每个人的感知不同,相同的情况以不同的方式被观察。因此,决策基础和后续推理过程以及可观察的反应是不同的。我们假设,个体感知模式可以用于提高个体层面的对齐。我们通过将感知信息集成到机器学习系统中,并测量其对个体主观评估的预测性能来进行实验,以验证这个假设。在我们的实证研究中,我们收集了一个新的多模态刺激数据集以及相应的心跳序列,用于新任务感知引导跨模态共情。我们使用感知引导多模态Transformer来解决这个任务。我们的研究结果表明,利用个人感知信号进行机器学习可以提供有价值的线索来对个体进行对齐。这不仅可以提高个体用户的总体预测表现,而且还可以引导AI系统朝着每个人的个人期望和价值观的方向发展。
https://arxiv.org/abs/2405.04443
Artificial neural networks (ANNs) perform extraordinarily on numerous tasks including classification or prediction, e.g., speech processing and image classification. These new functions are based on a computational model that is enabled to select freely all necessary internal model parameters as long as it eventually delivers the functionality it is supposed to exhibit. Here, we review the connection between the model parameter selection in machine learning (ML) algorithms running on ANNs and the epistemological theory of neopragmatism focusing on the theory's utility and anti-representationalist aspects. To understand the consequences of the model parameter selection of an ANN, we suggest using neopragmatist theories whose implications are well studied. Incidentally, neopragmatism's notion of optimization is also based on utility considerations. This means that applying this approach elegantly reveals the inherent connections between optimization in ML, using a numerical method during the learning phase, and optimization in the ethical theory of consequentialism, where it occurs as a maxim of action. We suggest that these connections originate from the way relevance is calculated in ML systems. This could ultimately reveal a tendency for specific actions in ML systems.
人工智能神经网络(ANNs)在许多任务中表现出色,包括分类或预测,例如语音处理和图像分类。这些新功能基于一种计算模型,该模型允许在最终实现其预期功能时自由选择所有必要的内部模型参数。在这里,我们回顾了机器学习(ML)算法中模型参数选择与新格言主义理论之间的联系,重点关注其理论的实用性和反表现主义 aspects。为了了解ANN模型参数选择的影响,我们建议使用研究 implications 好的新格言主义理论。值得注意的是,新格言主义的优化概念也是基于效用考虑的。这意味着应用这种方法恰当地揭示了在机器学习中的优化问题,以及在伦理后果理论中的优化问题,那里它在作为行动最大化的目标时发生。我们建议,这些联系源于ML系统中的相关性计算方式。这最终可能揭示出ML系统中的特定行动趋势。
https://arxiv.org/abs/2405.04386
We present Splat-MOVER, a modular robotics stack for open-vocabulary robotic manipulation, which leverages the editability of Gaussian Splatting (GSplat) scene representations to enable multi-stage manipulation tasks. Splat-MOVER consists of: (i) $\textit{ASK-Splat}$, a GSplat representation that distills latent codes for language semantics and grasp affordance into the 3D scene. ASK-Splat enables geometric, semantic, and affordance understanding of 3D scenes, which is critical for many robotics tasks; (ii) $\textit{SEE-Splat}$, a real-time scene-editing module using 3D semantic masking and infilling to visualize the motions of objects that result from robot interactions in the real-world. SEE-Splat creates a "digital twin" of the evolving environment throughout the manipulation task; and (iii) $\textit{Grasp-Splat}$, a grasp generation module that uses ASK-Splat and SEE-Splat to propose candidate grasps for open-world objects. ASK-Splat is trained in real-time from RGB images in a brief scanning phase prior to operation, while SEE-Splat and Grasp-Splat run in real-time during operation. We demonstrate the superior performance of Splat-MOVER in hardware experiments on a Kinova robot compared to two recent baselines in four single-stage, open-vocabulary manipulation tasks, as well as in four multi-stage manipulation tasks using the edited scene to reflect scene changes due to prior manipulation stages, which is not possible with the existing baselines. Code for this project and a link to the project page will be made available soon.
我们提出了Splat-MOVER,一个开放词汇机器人操作的模块化机器人技术堆栈,利用Gaussian Splatting(GSplat)场景表示的编辑性,实现多阶段操作任务。Splat-MOVER由以下三个部分组成: (i) ASK-Splat,一种GSplat表示,用于将语言语义和抓握可得性从3D场景中提取出来。ASK-Splat能够实现对3D场景的几何、语义和抓握理解,这对许多机器人任务非常重要; (ii) SEE-Splat,一种实时场景编辑模块,使用3D语义掩码和填充来绘制机器人交互在现实世界中的物体运动;SEE-Splat创建了一个随操作任务不断演变的环境的“数字双胞胎”; (iii) Grasp-Splat,一种用于生成抓握的模块,使用ASK-Splat和SEE-Splat来提出开放世界对象的抓握候选者。ASK-Splat在操作之前对RGB图像进行实时训练,而SEE-Splat和Grasp-Splat在操作期间实时运行。我们证明了Splat-MOVER在Kinova机器人上的硬件实验中相对于两个最近的基本模型在四个单阶段、开放词汇操作任务中的卓越性能,以及在四个多阶段操作任务中使用编辑场景反映先前的操作阶段,这是目前的基本模型所不能实现的。该项目的代码和对项目的链接将很快提供。
https://arxiv.org/abs/2405.04378
The community plays a crucial role in understanding user behavior and network characteristics in social networks. Some users can use multiple social networks at once for a variety of objectives. These users are called overlapping users who bridge different social networks. Detecting communities across multiple social networks is vital for interaction mining, information diffusion, and behavior migration analysis among networks. This paper presents a community detection method based on nonnegative matrix tri-factorization for multiple heterogeneous social networks, which formulates a common consensus matrix to represent the global fused community. Specifically, the proposed method involves creating adjacency matrices based on network structure and content similarity, followed by alignment matrices which distinguish overlapping users in different social networks. With the generated alignment matrices, the method could enhance the fusion degree of the global community by detecting overlapping user communities across networks. The effectiveness of the proposed method is evaluated with new metrics on Twitter, Instagram, and Tumblr datasets. The results of the experiments demonstrate its superior performance in terms of community quality and community fusion.
社区在理解社交网络中的用户行为和网络特征中扮演着关键角色。一些用户可以同时使用多个社交网络来实现各种目标。这些用户被称为重叠用户,他们连接不同的社交网络。在多个社交网络中检测社区对于社交网络中的交互挖掘、信息传播和行为迁移分析至关重要。本文提出了一种基于非负矩阵三元分解的多异质社交网络中的社区检测方法,它构成了一个共同的共识矩阵来表示全局融合社区。具体来说,所提出的方法基于网络结构和内容相似性创建邻接矩阵,然后跟随 alignment matrices 来区分不同社交网络中的重叠用户。通过生成的 alignment matrices,该方法可以通过检测跨网络的重叠用户社区来增强全局社区的融合程度。用 Twitter、Instagram 和 Tumblr 数据集的新指标评估所提出方法的有效性。实验结果表明,其在社区质量和社区融合方面的表现优于其他方法。
https://arxiv.org/abs/2405.04371
Understanding how humans would behave during hand-object interaction is vital for applications in service robot manipulation and extended reality. To achieve this, some recent works have been proposed to simultaneously predict hand trajectories and object affordances on human egocentric videos. They are regarded as the representation of future hand-object interactions, indicating potential human motion and motivation. However, the existing approaches mostly adopt the autoregressive paradigm for unidirectional prediction, which lacks mutual constraints within the holistic future sequence, and accumulates errors along the time axis. Meanwhile, these works basically overlook the effect of camera egomotion on first-person view predictions. To address these limitations, we propose a novel diffusion-based interaction prediction method, namely Diff-IP2D, to forecast future hand trajectories and object affordances concurrently in an iterative non-autoregressive manner. We transform the sequential 2D images into latent feature space and design a denoising diffusion model to predict future latent interaction features conditioned on past ones. Motion features are further integrated into the conditional denoising process to enable Diff-IP2D aware of the camera wearer's dynamics for more accurate interaction prediction. The experimental results show that our method significantly outperforms the state-of-the-art baselines on both the off-the-shelf metrics and our proposed new evaluation protocol. This highlights the efficacy of leveraging a generative paradigm for 2D hand-object interaction prediction. The code of Diff-IP2D will be released at this https URL.
理解人类在双手与物体交互过程中的行为对服务机器人操作和扩展现实应用至关重要。为了实现这一目标,一些最近的工作提出了在人类自中心视频中同时预测手轨迹和物体姿态。它们被认为是未来双手与物体交互的表示,表明潜在的人类运动和动机。然而,现有的方法主要采用自回归范式进行单向预测,缺乏全局未来序列中的相互约束,并在时间轴上累积误差。同时,这些作品基本上忽略了相机自转运动对第一人称视角预测的影响。为了克服这些限制,我们提出了名为Diff-IP2D的新颖扩散基于交互预测方法,以同时预测未来手轨迹和物体姿态的迭代非自回归方式。我们将顺序2D图像转换为潜在特征空间,并设计了一个去噪扩散模型,根据过去的特征预测未来的潜在交互特征。为了使Diff-IP2D能够关注到佩戴摄像头的用户的动态,我们进一步将运动特征融入条件去噪过程中。实验结果表明,我们的方法在标准指标和非标准指标上均显著优于最先进的基准模型。这突出了利用生成范式进行2D手与物体交互预测的有效性。Diff-IP2D的代码将在该链接上发布。
https://arxiv.org/abs/2405.04370
Recent developments in large language models (LLMs), while offering a powerful foundation for developing natural language agents, raise safety concerns about them and the autonomous agents built upon them. Deception is one potential capability of AI agents of particular concern, which we refer to as an act or statement that misleads, hides the truth, or promotes a belief that is not true in its entirety or in part. We move away from the conventional understanding of deception through straight-out lying, making objective selfish decisions, or giving false information, as seen in previous AI safety research. We target a specific category of deception achieved through obfuscation and equivocation. We broadly explain the two types of deception by analogizing them with the rabbit-out-of-hat magic trick, where (i) the rabbit either comes out of a hidden trap door or (ii) (our focus) the audience is completely distracted to see the magician bring out the rabbit right in front of them using sleight of hand or misdirection. Our novel testbed framework displays intrinsic deception capabilities of LLM agents in a goal-driven environment when directed to be deceptive in their natural language generations in a two-agent adversarial dialogue system built upon the legislative task of "lobbying" for a bill. Along the lines of a goal-driven environment, we show developing deceptive capacity through a reinforcement learning setup, building it around the theories of language philosophy and cognitive psychology. We find that the lobbyist agent increases its deceptive capabilities by ~ 40% (relative) through subsequent reinforcement trials of adversarial interactions, and our deception detection mechanism shows a detection capability of up to 92%. Our results highlight potential issues in agent-human interaction, with agents potentially manipulating humans towards its programmed end-goal.
近年来,大型语言模型(LLMs)的发展为开发自然语言代理提供了强大的基础,但也引发了关于它们和基于它们的自主代理的安全问题。特别令人担忧的是欺骗,这是我们指的欺骗性行为或陈述,包括误导、隐瞒真相或促进不真实的信念。我们在前人工智能安全研究中,通过直言不讳地撒谎、做出客观的自私决策或提供虚假信息,远离了欺骗的传统理解。我们针对通过混淆和模棱两可达到的欺骗特定类别进行攻击。我们详细解释了两种欺骗类型。通过类比兔子脱出帽子魔术表演,我们(i)要么让兔子从隐藏的陷阱门中出来,要么(ii)我们的重点是,观众被魔术师利用魔术手法或误导观众带出现在他们面前的兔子所吸引。我们的新测试台框架在两个代理人的竞争对话系统中,当它们被设计为在自然语言生成中具有欺骗能力时,展示了LLM代理在目标导向环境中的内在欺骗能力。沿着目标导向环境的目标,我们通过强化学习框架开发了欺骗能力,并围绕语言哲学和认知心理学理论进行构建。我们发现,通过后续的对抗性交互试验, lobbyist代理的欺骗能力增加了~ 40%(相对),我们的欺骗检测机制具有高达92%的检测能力。我们的结果突出了人工智能代理与人类交互中潜在的问题,即代理可能会操纵人类以实现其预设目标。
https://arxiv.org/abs/2405.04325
Offline reinforcement learning (RL) provides a promising approach to avoid costly online interaction with the real environment. However, the performance of offline RL highly depends on the quality of the datasets, which may cause extrapolation error in the learning process. In many robotic applications, an inaccurate simulator is often available. However, the data directly collected from the inaccurate simulator cannot be directly used in offline RL due to the well-known exploration-exploitation dilemma and the dynamic gap between inaccurate simulation and the real environment. To address these issues, we propose a novel approach to combine the offline dataset and the inaccurate simulation data in a better manner. Specifically, we pre-train a generative adversarial network (GAN) model to fit the state distribution of the offline dataset. Given this, we collect data from the inaccurate simulator starting from the distribution provided by the generator and reweight the simulated data using the discriminator. Our experimental results in the D4RL benchmark and a real-world manipulation task confirm that our method can benefit more from both inaccurate simulator and limited offline datasets to achieve better performance than the state-of-the-art methods.
离线强化学习(RL)提供了一种有前途的方法来避免与真实环境进行昂贵的在线交互。然而,离线RL的性能高度依赖于数据质量,这可能导致学习过程中的扩展误差。在许多机器人应用中,通常缺乏准确的仿真器。然而,由于众所周知的学习-探索困境和仿真器和现实环境之间的动态差距,直接从不准确的仿真器中收集的数据无法直接用于离线RL。为解决这些问题,我们提出了一种结合离线数据和低质量仿真数据的新方法。具体来说,我们预训练了一个生成对抗网络(GAN)模型来适应离线数据的状态分布。然后,我们从生成器提供的分布开始收集离线仿真器的数据,并使用判别器对模拟数据进行重新加权。我们在D4RL基准和现实世界的操作任务中的实验结果证实,我们的方法可以从不准确的仿真器和有限离线数据中获得更好的性能,比最先进的方法具有更高的性能。
https://arxiv.org/abs/2405.04307