Imitation learning is a promising paradigm for training robot control policies, but these policies can suffer from distribution shift, where the conditions at evaluation time differ from those in the training data. A popular approach for increasing policy robustness to distribution shift is interactive imitation learning (i.e., DAgger and variants), where a human operator provides corrective interventions during policy rollouts. However, collecting a sufficient amount of interventions to cover the distribution of policy mistakes can be burdensome for human operators. We propose IntervenGen (I-Gen), a novel data generation system that can autonomously produce a large set of corrective interventions with rich coverage of the state space from a small number of human interventions. We apply I-Gen to 4 simulated environments and 1 physical environment with object pose estimation error and show that it can increase policy robustness by up to 39x with only 10 human interventions. Videos and more results are available at this https URL.
模仿学习是一种训练机器人控制策略的有前途的范式,但这些策略可能由于评估时与训练数据中的条件不同而受到分布漂移的影响。提高策略对分布漂移的鲁棒性的一个受欢迎的方法是交互式模仿学习(即DAgger及其变体),其中人类操作员在策略部署过程中提供纠正干预。然而,收集到足够的干预以覆盖策略错误的分布可能对人类操作员来说具有负担。我们提出IntervenGen(I-Gen),一种新数据生成系统,可以自主生成大量具有丰富覆盖状态空间小数个人类干预的纠正干预。我们将I-Gen应用于4个模拟环境和1个物理环境,对象姿态估计误差,并表明它可以通过仅使用10个人类干预来增加策略鲁棒性高达39倍。视频和其他结果可在此链接查看。
https://arxiv.org/abs/2405.01472
Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice. To bridge this gap, we offered free access to ChatGPT for online users in exchange for their affirmative, consensual opt-in to anonymously collect their chat transcripts and request headers. From this, we compiled WildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. We compare WildChat with other popular user-chatbot interaction datasets, and find that our dataset offers the most diverse user prompts, contains the largest number of languages, and presents the richest variety of potentially toxic use-cases for researchers to study. In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses, alongside request headers. This augmentation allows for more detailed analysis of user behaviors across different geographical regions and temporal dimensions. Finally, because it captures a broad range of use cases, we demonstrate the dataset's potential utility in fine-tuning instruction-following models. WildChat is released at this https URL under AI2 ImpACT Licenses.
像GPT-4和ChatGPT这样的聊天机器人现在服务着数百万用户。尽管它们在范围内得到了广泛应用,但目前还没有公开的数据集展示这些工具如何在实际用户中使用。为了填补这个空白,我们向在线用户免费提供ChatGPT,条件是他们同意匿名收集他们的聊天记录并请求头。从中,我们编写了WildChat,一个由100万用户与ChatGPT的对话组成的语料库,包括超过250万交互回合。我们比较WildChat与其他受欢迎的用户聊天机器人交互数据集,发现我们的数据集提供了最丰富的用户提示,包含了最多的语言,以及研究人员可以研究的最可能具有破坏性的用例的丰富多样性。除了时间戳化的聊天记录外,我们还通过包括人口统计学数据(包括州、国家 和哈希IP地址)来丰富这个数据集。这使得我们可以更详细地分析用户行为在不同地理区域和时间维度上的差异。最后,因为它涵盖了广泛的用例,我们证明了该数据集在微调指令跟随模型的潜在用途上的价值。WildChat现已在https://this URL上发布,符合AI2 ImpACT许可证。
https://arxiv.org/abs/2405.01470
AI Foundation models are gaining traction in various applications, including medical fields like radiology. However, medical foundation models are often tested on limited tasks, leaving their generalisability and biases unexplored. We present RayDINO, a large visual encoder trained by self-supervision on 873k chest X-rays. We compare RayDINO to previous state-of-the-art models across nine radiology tasks, from classification and dense segmentation to text generation, and provide an in depth analysis of population, age and sex biases of our model. Our findings suggest that self-supervision allows patient-centric AI proving useful in clinical workflows and interpreting X-rays holistically. With RayDINO and small task-specific adapters, we reach state-of-the-art results and improve generalization to unseen populations while mitigating bias, illustrating the true promise of foundation models: versatility and robustness.
AI 基础模型在各种应用领域都取得了越来越广泛的应用,包括医学领域如 radiology。然而,这些模型通常仅在有限的任务上进行测试,导致其泛化能力和偏见未被探索。我们提出了 RayDINO,一种通过自监督学习在 873k 张胸部 X 光片上进行训练的大视觉编码器。我们比较了 RayDINO 与之前最先进的模型在九个放射学任务上的效果,从分类和密集分割到文本生成,并对我们模型的Population、Age和Gender偏见进行了深入分析。我们的研究结果表明,自监督训练使患者为中心的 AI 在临床工作流程中具有实际价值,并能够全面地解释 X 光片。与 RayDINO 和针对具体任务的适配器相结合,我们达到了最先进的结果,并在未见过的受众上进行了泛化能力的提高,同时减轻了偏见,证明了基础模型的真正价值:多样性和稳健性。
https://arxiv.org/abs/2405.01469
Pre-trained contrastive vision-language models have demonstrated remarkable performance across a wide range of tasks. However, they often struggle on fine-trained datasets with categories not adequately represented during pre-training, which makes adaptation necessary. Recent works have shown promising results by utilizing samples from web-scale databases for retrieval-augmented adaptation, especially in low-data regimes. Despite the empirical success, understanding how retrieval impacts the adaptation of vision-language models remains an open research question. In this work, we adopt a reflective perspective by presenting a systematic study to understand the roles of key components in retrieval-augmented adaptation. We unveil new insights on uni-modal and cross-modal retrieval and highlight the critical role of logit ensemble for effective adaptation. We further present theoretical underpinnings that directly support our empirical observations.
预训练的对比性视觉语言模型在各种各样的任务上都表现出惊人的性能。然而,它们通常在训练过程中遇到在预训练过程中未得到充分代表的类别的数据,因此需要进行调整。最近的工作通过利用大规模网络数据库的样本进行检索增强适应,尤其是在低数据量的情况下,取得了积极的结果。尽管经验证实的成功,但理解检索如何影响视觉语言模型的适应仍然是一个开放的研究问题。在这项工作中,我们采用反思性观点,通过系统地研究检索增强适应的关键组件,揭示了有关单模态和跨模态检索的新见解,并突出了逻辑集成对于有效适应的关键作用。我们进一步提供了理论支持,直接支持我们的实证观察。
https://arxiv.org/abs/2405.01468
Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance.
Text to Motion(T2M)模型的稳健性如何?最近,T2M模型的进步主要源于更精确地预测特定动作。然而,文本模态通常仅依赖于预训练的对比性语言-图像预训练(CLIP)模型。我们的研究揭示了一个与T2M模型相关的重大问题:其预测经常表现出不一致的输出,导致在语义上相似或相同的文本输入上呈现出截然不同或甚至错误的姿势。在本文中,我们进行了分析,阐明了这种不稳定性背后的原因,并建立了模型输出不可预测性与文本编码器模块的异常关注模式之间的明确联系。因此,我们引入了一个旨在解决这一问题的形式框架,我们称之为稳定T2M框架(SATO)。SATO包括三个模块,每个模块都致力于稳定性注意力和稳定性预测,并在准确性和稳健性之间保持平衡。我们提出了一个构建SATO的方法,满足注意力和预测的稳定性。为了验证模型的稳定性,我们引入了基于HumanML3D和KIT-ML的新文本同义词扰动数据集。结果表明,SATO在对抗同义词和其他微小扰动时表现出显著的稳定性,同时保持其高准确率性能。
https://arxiv.org/abs/2405.01461
Unlearnable examples (UEs) seek to maximize testing error by making subtle modifications to training examples that are correctly labeled. Defenses against these poisoning attacks can be categorized based on whether specific interventions are adopted during training. The first approach is training-time defense, such as adversarial training, which can mitigate poisoning effects but is computationally intensive. The other approach is pre-training purification, e.g., image short squeezing, which consists of several simple compressions but often encounters challenges in dealing with various UEs. Our work provides a novel disentanglement mechanism to build an efficient pre-training purification method. Firstly, we uncover rate-constrained variational autoencoders (VAEs), demonstrating a clear tendency to suppress the perturbations in UEs. We subsequently conduct a theoretical analysis for this phenomenon. Building upon these insights, we introduce a disentangle variational autoencoder (D-VAE), capable of disentangling the perturbations with learnable class-wise embeddings. Based on this network, a two-stage purification approach is naturally developed. The first stage focuses on roughly eliminating perturbations, while the second stage produces refined, poison-free results, ensuring effectiveness and robustness across various scenarios. Extensive experiments demonstrate the remarkable performance of our method across CIFAR-10, CIFAR-100, and a 100-class ImageNet-subset. Code is available at this https URL.
不可学习示例(UEs)试图通过在已正确标注的训练示例上进行微妙修改来最大化测试误差。这些 poisoning 攻击的防御可以根据在训练期间采用的具体干预进行分类。第一种方法是训练时防御,例如对抗性训练,它可以减轻毒性影响,但计算代价较高。第二种方法是预训练净化,例如图像短缩,它包括几个简单的压缩,但在处理各种 UEs 时往往遇到挑战。我们的工作提供了一种新颖的解纠缠机制来构建高效的预训练净化方法。首先,我们发现了速率约束的变分自编码器(VAEs),这表明有明显抑制 UEs 中扰动的趋势。随后,我们进行了理论分析来研究这种现象。基于这些洞察,我们引入了一种可学习类级嵌入的变分自编码器(D-VAE),它能够通过可学习类级嵌入实现对扰动的解纠缠。基于这个网络,我们自然地得到了一个两阶段净化方法。第一阶段关注消除大约的扰动,而第二阶段产生精确、无毒的结果,确保在各种场景中的有效性和鲁棒性。在 CIFAR-10、CIFAR-100 和 ImageNet 子集上进行的大量实验证明了我们方法在各个领域的非凡性能。代码可以从该链接获取:https://thisurl.com/
https://arxiv.org/abs/2405.01460
This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at this http URL.
本文介绍了一个名为UQA的新型数据集,用于 Urdu 语料库中问题回答和文本理解。UQA 由翻译斯坦福问题回答数据集 (SQuAD2.0) 生成,这是一个大型英语问题回答数据集,使用一种称为 EATS (将括号内保留翻译文本上下文中的答案范围) 的技术生成。本文描述了从两个候选者(Google 翻译器和 Seamless M4T)中选择和评估最佳翻译模型的过程。此外,本文还在 UQA 上基准了多种最先进的跨语言 QA 模型,包括 mBERT、XLM-RoBERTa 和 mT5,并报告了有希望的结果。对于 XLM-RoBERTa-XL,我们的 F1 分数为 85.99 和 74.56。UQA 是一个有价值的资源,可用于开发和测试 Urdu 和其他多语言 NLP 系统,以及增强现有模型的跨语言可转移性。此外,本文还证明了 EATS 在创建高质量数据集在其他语言和领域中的有效性。UQA 数据集和代码可在此处下载:<http://www.aclweb.org/anthology/N18-1196>
https://arxiv.org/abs/2405.01458
In this paper, we discuss approaches for integrating Computational Creativity (CC) with research in large language and vision models (LLVMs) to address a key limitation of these models, i.e., creative problem solving. We present preliminary experiments showing how CC principles can be applied to address this limitation through augmented prompting. With this work, we hope to foster discussions of Computational Creativity in the context of ML algorithms for creative problem solving in LLVMs. Our code is at: this https URL
在本文中,我们讨论了将计算创造力(CC)与大型语言和视觉模型(LLVMs)的研究相结合的方法,以解决这些模型的一个关键限制,即创新问题解决。我们展示了通过增强提示如何应用CC原则来解决这个问题。通过这项工作,我们希望激发关于在LLVMs中使用计算创造力(CC)的讨论。我们的代码在此处:<https://this URL>
https://arxiv.org/abs/2405.01453
Despite remarkable advancements, mainstream gaze estimation techniques, particularly appearance-based methods, often suffer from performance degradation in uncontrolled environments due to variations in illumination and individual facial attributes. Existing domain adaptation strategies, limited by their need for target domain samples, may fall short in real-world applications. This letter introduces Branch-out Auxiliary Regularization (BAR), an innovative method designed to boost gaze estimation's generalization capabilities without requiring direct access to target domain data. Specifically, BAR integrates two auxiliary consistency regularization branches: one that uses augmented samples to counteract environmental variations, and another that aligns gaze directions with positive source domain samples to encourage the learning of consistent gaze features. These auxiliary pathways strengthen the core network and are integrated in a smooth, plug-and-play manner, facilitating easy adaptation to various other models. Comprehensive experimental evaluations on four cross-dataset tasks demonstrate the superiority of our approach.
尽管在可见的进步中,主流的视差估计技术(特别是以外观为基础的方法)在未受控的环境中往往性能下降,因为照明和个体面部属性的变化会导致性能下降。现有的领域自适应策略,由于需要目标领域样本,可能在其现实应用中不够有效。本文介绍了一种名为Branch-out Auxiliary Regularization(BAR)的创新方法,旨在提高视差估计的泛化能力,而无需直接访问目标领域数据。具体来说,BAR结合了两个辅助一致性正则化分支:一个使用增强样本来对抗环境变化,另一个将目光方向与积极源域样本对齐,以促进学习一致的视差特征。这些辅助通道加强了核心网络,以一种平滑、可插拔的方式集成,便于轻松适应各种其他模型。在四个跨数据集任务的综合实验评估中,证明了我们的方法具有优越性。
https://arxiv.org/abs/2405.01439
For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at this https URL.
对于基于扩散的生成模型,特别是在包含主题和复杂细节的图像中保持一致的内容,这是一个具有挑战性的任务。在本文中,我们提出了一种新的自注意力计算方法,称为一致自注意力,它显著提高了生成图像和预训练扩散-基于文本-图像模型的一致性。为了将我们的方法扩展到长距离视频生成,我们进一步引入了一个名为语义运动预测器的 novel 语义空间时间运动预测模块。它被训练估计提供给两个给定图像之间的语义空间运动情况。这个模块将生成的图像序列转换为具有平滑过渡和一致主题的视频。与仅基于潜在空间的方法相比,尤其是在长视频生成方面,这个模块更加稳定。通过合并这两个新颖组件,我们的框架(称为StoryDiffusion)可以描述一个文本为基础的故事,包括一系列包含丰富内容的一致图像或视频。该故事性扩散框架在视觉故事生成方面具有开拓性的探索,通过展示图像和视频,我们希望激发更多从架构修改方面的研究。我们的代码现在公开可用,在https://这个URL上。
https://arxiv.org/abs/2405.01434
This paper investigates the use of Large Language Models (LLMs) for automating the generation of hardware description code, aiming to explore their potential in supporting and enhancing the development of efficient neuromorphic computing architectures. Building on our prior work, we employ OpenAI's ChatGPT4 and natural language prompts to synthesize a RTL Verilog module of a programmable recurrent spiking neural network, while also generating test benches to assess the system's correctness. The resultant design was validated in three case studies, the exclusive OR,the IRIS flower classification and the MNIST hand-written digit classification, achieving accuracies of up to 96.6%. To verify its synthesizability and implementability, the design was prototyped on a field-programmable gate array and implemented on SkyWater 130 nm technology by using an open-source electronic design automation flow. Additionally, we have submitted it to Tiny Tapeout 6 chip fabrication program to further evaluate the system on-chip performance in the future.
本文研究了使用大型语言模型(LLMs)自动生成硬件描述代码的应用,旨在探讨它们在支持和发展高效神经形态计算架构方面的潜力。在我们之前的工作基础上,我们利用OpenAI的ChatGPT4和自然语言提示来合成一个可编程反复抽动的神经网络的RTL Verilog模块,同时生成测试基准以评估系统的正确性。基于所得设计,我们在三个案例研究中进行了验证: exclusive OR,IRIS花分类和MNIST手写数字分类,达到96.6%的准确度。为了验证其可合成性和可实现性,该设计在一块现场可编程门阵列上进行了原型设计,并使用SkyWater 130纳米技术通过开源电子设计自动化流程进行了实现。此外,我们还将其提交给Tiny Tapeout 6芯片制造程序,以进一步评估系统在芯片上的性能。
https://arxiv.org/abs/2405.01419
Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. Code and weights are available at this https URL.
大型的2D视觉语言模型(2D-LLMs)通过简单地使用投影器将大型语言模型(LLMs)与图像相连接,已经引起了 significant 的关注。受到他们的成功启发,大型3D点云语言模型(3D-LLMs)也 将点云集成到 LLMs 中。然而,直接将点云与 LLM 对齐需要昂贵的训练成本,通常在 A100 上需要数百个 GPU-小时,这阻碍了 3D-LLMs 的开发。在本文中,我们介绍了 MiniGPT-3D,一种高效且强大的3D-LLM,在仅训练27小时的情况下实现了多个SOTA结果。具体来说,我们提出了一种使用来自2D-LLMs的2D先验来对齐3D点云与LLM的方法,可以利用2D和3D视觉信息的相似性。我们还提出了一个新颖的四阶段模态对齐训练策略,以及一个混合查询专家模块以高效地适应性地聚合特征。此外,我们还利用参数高效的微调方法 LoRA 和 Norm 微调,实现了仅47.8M可学习参数,比现有方法少260倍。 extensive实验证明,MiniGPT-3D在3D物体分类和文本摘要任务上实现了SOTA,具有显著的训练成本优势。值得注意的是,与ShapeLLM-13B相比,MiniGPT-3D在具有挑战性的物体文本摘要任务上获得了8.12的提高,而后者需要160个总共的GPU-小时,在8个A800上。我们是第一个探索高效3D-LLM,为社区提供了新的见解。代码和权重可以从该https URL获取。
https://arxiv.org/abs/2405.01413
Transesophageal echocardiography (TEE) plays a pivotal role in cardiology for diagnostic and interventional procedures. However, using it effectively requires extensive training due to the intricate nature of image acquisition and interpretation. To enhance the efficiency of novice sonographers and reduce variability in scan acquisitions, we propose a novel ultrasound (US) navigation assistance method based on contrastive learning as goal-conditioned reinforcement learning (GCRL). We augment the previous framework using a novel contrastive patient batching method (CPB) and a data-augmented contrastive loss, both of which we demonstrate are essential to ensure generalization to anatomical variations across patients. The proposed framework enables navigation to both standard diagnostic as well as intricate interventional views with a single model. Our method was developed with a large dataset of 789 patients and obtained an average error of 6.56 mm in position and 9.36 degrees in angle on a testing dataset of 140 patients, which is competitive or superior to models trained on individual views. Furthermore, we quantitatively validate our method's ability to navigate to interventional views such as the Left Atrial Appendage (LAA) view used in LAA closure. Our approach holds promise in providing valuable guidance during transesophageal ultrasound examinations, contributing to the advancement of skill acquisition for cardiac ultrasound practitioners.
经食道超声检查(TEE)在心血管病学中具有重要的诊断和干预作用。然而,要有效地使用它,需要进行广泛的培训,因为图像获取和解释的复杂性。为了提高新手超声技术员的效率,减少扫描获取的变异性,我们提出了一种基于对比学习的目标条件强化学习(GCRL)超声导航辅助方法。我们通过一种新颖的对比患者批量方法(CPB)和数据增强对比损失来增强先前的框架。我们证明了CPB和数据增强对比损失对确保患者间解剖变异的泛化至关重要。所提出的框架能够通过单个模型实现对标准诊断和复杂干预视图的导航。我们的方法基于一个大型数据集(789名患者)开发,在测试数据集(140名患者)上的平均误差为6.56毫米的位置和9.36度的角度,与单个视图训练的模型相当或更好。此外,我们通过定量验证了我们的方法在到达干预视图(如左心房附壁)方面的能力,这些视图在LAA关闭中使用。我们的方法在提供心血管超声检查中的有价值的指导方面具有潜力,有助于提高心脏超声技术员的技能。
https://arxiv.org/abs/2405.01409
The design of dialogue flows is a critical but time-consuming task when developing task-oriented dialogue (TOD) systems. We propose an approach for the unsupervised discovery of flows from dialogue history, thus making the process applicable to any domain for which such an history is available. Briefly, utterances are represented in a vector space and clustered according to their semantic similarity. Clusters, which can be seen as dialogue states, are then used as the vertices of a transition graph for representing the flows visually. We present concrete examples of flows, discovered from MultiWOZ, a public TOD dataset. We further elaborate on their significance and relevance for the underlying conversations and introduce an automatic validation metric for their assessment. Experimental results demonstrate the potential of the proposed approach for extracting meaningful flows from task-oriented conversations.
对话流设计的有效性是一个关键但耗时费力的任务,尤其是在开发面向任务的对话系统(TOD)时。我们提出了一个无监督地从对话历史中发现流的方法,从而使该过程适用于任何可以获得这种历史的领域。简而言之,的话语用向量空间来表示,并根据其语义相似性进行聚类。然后,这些聚类被用作表示流 visually 的顶点,我们可以从MultiWOZ等公共TOD数据集中发现这些流。我们进一步详细介绍了它们在对话背后的意义和重要性,并引入了一个自动验证指标来评估它们的准确性。实验结果表明,所提出的方案具有从面向任务对话中提取有意义的流的潜力。
https://arxiv.org/abs/2405.01403
Controlling contact forces during interactions is critical for locomotion and manipulation tasks. While sim-to-real reinforcement learning (RL) has succeeded in many contact-rich problems, current RL methods achieve forceful interactions implicitly without explicitly regulating forces. We propose a method for training RL policies for direct force control without requiring access to force sensing. We showcase our method on a whole-body control platform of a quadruped robot with an arm. Such force control enables us to perform gravity compensation and impedance control, unlocking compliant whole-body manipulation. The learned whole-body controller with variable compliance makes it intuitive for humans to teleoperate the robot by only commanding the manipulator, and the robot's body adjusts automatically to achieve the desired position and force. Consequently, a human teleoperator can easily demonstrate a wide variety of loco-manipulation tasks. To the best of our knowledge, we provide the first deployment of learned whole-body force control in legged manipulators, paving the way for more versatile and adaptable legged robots.
在交互过程中控制接触力对于运动和操作任务至关重要。虽然基于模拟-实测强化学习(RL)在许多接触丰富的問題上已经取得了成功,但目前的RL方法在隐式地调节力之外,无法实现有意义的交互。我们提出了一种不需要访问力感测器的直接力控制RL策略的训练方法。我们在四足机器人的全身控制平台上展示了我们的方法。这种力控制使我们能够执行重力补偿和阻尼控制,解锁顺从的全身操作。具有可变顺应性的全身控制器使得人类通过仅命令操作器,就可以轻松地操作机器人,而机器人的身体会自动调整以达到所需的位置和力。因此,人类遥控器可以很容易地展示各种loco-manipulation任务。据我们所知,我们提供了第一个将学习到的全身力控制应用于腿式操作器的部署,为更加多才多艺和适应性强的腿式机器人铺平道路。
https://arxiv.org/abs/2405.01402
Natural language explanations have become a proxy for evaluating explainable and multi-step Natural Language Inference (NLI) models. However, assessing the validity of explanations for NLI is challenging as it typically involves the crowd-sourcing of apposite datasets, a process that is time-consuming and prone to logical errors. To address existing limitations, this paper investigates the verification and refinement of natural language explanations through the integration of Large Language Models (LLMs) and Theorem Provers (TPs). Specifically, we present a neuro-symbolic framework, named Explanation-Refiner, that augments a TP with LLMs to generate and formalise explanatory sentences and suggest potential inference strategies for NLI. In turn, the TP is employed to provide formal guarantees on the logical validity of the explanations and to generate feedback for subsequent improvements. We demonstrate how Explanation-Refiner can be jointly used to evaluate explanatory reasoning, autoformalisation, and error correction mechanisms of state-of-the-art LLMs as well as to automatically enhance the quality of human-annotated explanations of variable complexity in different domains.
自然语言解释已成为评估可解释性和多步骤自然语言推理(NLI)模型的指标。然而,评估解释的有效性具有挑战性,因为它通常涉及apprise数据的众包,这个过程费时且容易出错。为了应对现有局限,本文研究了通过整合大型语言模型(LLMs)和定理证明器(TPs)来验证和优化自然语言解释的方法。具体来说,我们提出了一个名为Explanation-Refiner的神经符号框架,该框架通过在TP中添加LLMs来生成和形式化解释性句子,并建议可能的NLI推理策略。TP则用于提供关于解释逻辑有效性的正式保证,并生成后续改进的反馈。我们证明了Explanation-Refiner可以与现有LLM的推理、自动形式化和错误纠正机制共同用于评估可解释推理、自动形式化和不同领域的变体解释的质量。
https://arxiv.org/abs/2405.01379
Reduced articulatory precision is common in speech, but for dialog its acoustic properties and pragmatic functions have been little studied. We here try to remedy this gap. This technical report contains content that was omitted from the journal article (Ward et al. 2024, submitted). Specifically, we here report 1) lessons learned about annotating for perceived reduction, 2) the finding that, unlike in read speech, the correlates of reduction in dialog include high pitch, wide pitch range, and intensity, and 3) a baseline model for predicting reduction in dialog, using simple acoustic/prosodic features, that achieves correlations with human perceptions of 0.24 for English, and 0.17 for Spanish. We also provide examples of additional possible pragmatic functions of reduction in English, and various discussion, observations and speculations
减少发声精确度在 speech 中很常见,但在对话中,对其音学和语用功能的研究还很少。在这里,我们试图弥补这个空白。本技术报告包含了 journal 文章中省略的内容(Ward 等人,2024 年提交)。具体来说,我们在这里报道了关于为感知减少的教训(1)以及对话中减少的关系因素(2),包括高音、宽音域和强度,并且用简单的声学/语调特征预测对话中减少的基准模型,其与人类对英语的感知的相关系数为 0.24,对西班牙语的感知的相关系数为 0.17。我们还提供了英语中可能具有额外语用功能的减少的例子,以及各种讨论、观察和推测。
https://arxiv.org/abs/2405.01376
Recent works in dataset distillation seek to minimize training expenses by generating a condensed synthetic dataset that encapsulates the information present in a larger real dataset. These approaches ultimately aim to attain test accuracy levels akin to those achieved by models trained on the entirety of the original dataset. Previous studies in feature and distribution matching have achieved significant results without incurring the costs of bi-level optimization in the distillation process. Despite their convincing efficiency, many of these methods suffer from marginal downstream performance improvements, limited distillation of contextual information, and subpar cross-architecture generalization. To address these challenges in dataset distillation, we propose the ATtentiOn Mixer (ATOM) module to efficiently distill large datasets using a mixture of channel and spatial-wise attention in the feature matching process. Spatial-wise attention helps guide the learning process based on consistent localization of classes in their respective images, allowing for distillation from a broader receptive field. Meanwhile, channel-wise attention captures the contextual information associated with the class itself, thus making the synthetic image more informative for training. By integrating both types of attention, our ATOM module demonstrates superior performance across various computer vision datasets, including CIFAR10/100 and TinyImagenet. Notably, our method significantly improves performance in scenarios with a low number of images per class, thereby enhancing its potential. Furthermore, we maintain the improvement in cross-architectures and applications such as neural architecture search.
近年来在数据蒸馏领域的研究旨在通过生成一个压缩合成数据集来最小化训练成本,该数据集包含了较大真实数据集中的信息。这些方法最终旨在实现与整个原始数据集训练出的模型具有相似的测试准确度。之前在特征匹配和分布匹配方面的研究表明,在蒸馏过程中没有产生双层优化成本,但取得了显著的成果。尽管这些方法在节省训练成本方面具有令人满意的效率,但它们在下游性能方面存在微小的改进,对上下文信息的提取有限,并且模型扩展性较差。为了应对这些挑战,我们在数据蒸馏领域提出了ATtentiOn Mixer(ATOM)模块,利用混合通道和空间注意在特征匹配过程中高效地蒸馏大型数据集。空间注意可以帮助根据类在各自图像上的一致定位来指导学习过程,实现从更广泛的感受野进行蒸馏。同时,通道注意可以捕捉类本身相关的上下文信息,从而使合成图像对训练更加有用。通过整合这两种注意,我们的ATOM模块在各种计算机视觉数据集上的表现都超过了之前的水平,包括CIFAR10/100和TinyImagenet。值得注意的是,我们的方法在图像数量较低的情况下显著提高了性能,从而增强了其潜力。此外,我们还保持了在神经架构搜索等方面的改进。
https://arxiv.org/abs/2405.01373
Bilateral teleoperation of an aerial manipulator facilitates the execution of industrial missions thanks to the combination of the aerial platform's maneuverability and the ability to conduct complex tasks with human supervision. Heretofore, research on such operations has focused on flying without any physical interaction or exerting a pushing force on a contact surface that does not involve abrupt changes in the interaction force. In this paper, we propose a human reaction time compensating haptic-based bilateral teleoperation strategy for an aerial manipulator extracting a wedged object from a static structure (i.e., plug-pulling), which incurs an abrupt decrease in the interaction force and causes additional difficulty for an aerial platform. A haptic device composed of a 4-degree-of-freedom robotic arm and a gripper is made for the teleoperation of aerial wedged object-extracting tasks, and a haptic-based teleoperation method to execute the aerial manipulator by the haptic device is introduced. We detect the extraction of the object by the estimation of the external force exerted on the aerial manipulator and generate reference trajectories for both the aerial manipulator and the haptic device after the extraction. As an example of the extraction of a wedged object, we conduct comparative plug-pulling experiments with a quadrotor-based aerial manipulator. The results validate that the proposed bilateral teleoperation method reduces the overshoot in the aerial manipulator's position and ensures fast recovery to its initial position after extracting the wedged object.
双边遥控操作航空手爪能够通过结合航空平台的可操纵性和在人类监督下执行复杂任务的特性,促进工业任务的执行。迄今为止,关于这种操作的研究主要集中在没有身体交互或对接触表面施加推力的情况下飞行。在本文中,我们提出了一个基于触觉反馈的双边遥控操作策略,用于从静态结构中提取楔形物体(即插销拉动)的航空手爪,该操作会导致相互作用力的急剧下降,并为航空平台带来额外的困难。 为了进行遥控操作,我们设计了一个4自由度机器人手臂和夹具组成的触觉装置,并引入了基于触觉的遥控方法来执行航空手爪。在提取楔形物体的过程中,我们通过估计航空手爪受到的外力来检测物体的提取,并为航空手爪和触觉装置生成参考轨迹。 以楔形物体的提取为例,我们与基于四旋翼的航空手爪进行了比较插销拉动实验。结果证实了所提出的双边遥控方法可以降低航空手爪的位置超差,并在提取楔形物体后确保其迅速恢复到初始位置。
https://arxiv.org/abs/2405.01361
Large-scale machines like particle accelerators are usually run by a team of experienced operators. In case of a particle accelerator, these operators possess suitable background knowledge on both accelerator physics and the technology comprising the machine. Due to the complexity of the machine, particular subsystems of the machine are taken care of by experts, who the operators can turn to. In this work the reasoning and action (ReAct) prompting paradigm is used to couple an open-weights large language model (LLM) with a high-level machine control system framework and other tools, e.g. the electronic logbook or machine design documentation. By doing so, a multi-expert retrieval augmented generation (RAG) system is implemented, which assists operators in knowledge retrieval tasks, interacts with the machine directly if needed, or writes high level control system scripts. This consolidation of expert knowledge and machine interaction can simplify and speed up machine operation tasks for both new and experienced human operators.
大规模的机器通常由经验丰富的操作员团队运行。在粒子加速器的情况下,这些操作员对加速器物理和机器的组成部分都具有适当的背景知识。由于机器的复杂性,特别关注机器的子系统由专家处理,操作员可以向他们寻求帮助。在这项工作中,使用了推理和动作(ReAct)提示模式将带有开放标签的大型语言模型(LLM)与高级机器控制系统框架和其他工具(例如电子日志或机器设计文档)相结合。通过这样做,实现了一个多专家检索增强生成(RAG)系统,该系统有助于操作员在知识检索任务中查找信息,在需要时直接与机器交互,或编写高级控制系统脚本。这一专家知识和机器交互的整合可以简化并加速新经验和经验丰富的操作员的机器操作任务。
https://arxiv.org/abs/2405.01359