Physical reasoning is important for effective robot manipulation. Recent work has investigated both vision and language modalities for physical reasoning; vision can reveal information about objects in the environment and language serves as an abstraction and communication medium for additional context. Although these works have demonstrated success on a variety of physical reasoning tasks, they are limited to physical properties that can be inferred from visual or language inputs. In this work, we investigate combining tactile perception with language, which enables embodied systems to obtain physical properties through interaction and apply common-sense reasoning. We contribute a new dataset PhysiCleAR, which comprises both physical/property reasoning tasks and annotated tactile videos obtained using a GelSight tactile sensor. We then introduce Octopi, a system that leverages both tactile representation learning and large vision-language models to predict and reason about tactile inputs with minimal language fine-tuning. Our evaluations on PhysiCleAR show that Octopi is able to effectively use intermediate physical property predictions to improve physical reasoning in both trained tasks and for zero-shot reasoning. PhysiCleAR and Octopi are available on this https URL.
物理推理对于有效机器人操作非常重要。最近的工作调查了物理推理中的视觉和语言模态;视觉可以揭示环境中物体的信息,而语言则作为附加上下文和通信媒介。尽管这些工作在各种物理推理任务上取得了成功,但它们仅限于可以从视觉或语言输入中推断出的物理属性。在这项工作中,我们研究了通过结合触觉感知和语言,使实体系统通过交互获得物理属性并应用常识推理。我们贡献了一个新的数据集 PhysiCleAR,该数据集包括物理/属性推理任务和通过GelSight 触觉传感器获得的带注释的触觉视频。然后我们引入了 Octopi 系统,该系统利用触觉表示学习和大型视觉-语言模型来预测和推理关于触觉输入的物理属性,最小化语言微调。我们对 PhysiCleAR 的评估显示,Octopi 能够有效利用中间物理属性预测来改进训练任务和零散推理。 PhysiCleAR 和 Octopi 可以在该 https URL 上找到。
https://arxiv.org/abs/2405.02794
This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks under certain conditions. We demonstrate the efficacy of our methodology across a diverse array of tasks (NLU, NLG, zero-shot) with models up to 2.7B, showing that it consistently surpasses other selection methods. Moreover, our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour. Our code is open-sourced (Code repository: https://anonymous.4open.science/r/DV4LLM-D761/ ). While fine-tuning offers significant potential for enhancing performance across diverse tasks, its associated costs often limit its widespread adoption; with this work, we hope to lay the groundwork for cost-effective fine-tuning, making its benefits more accessible.
这项工作重点利用和选择广泛的未标注、开源数据集来预训练语言模型。目标是减少后续微调 costly领域特定数据的需求,同时实现所需性能水平。虽然许多数据选择算法是为小型应用设计的,但它们并不适用于我们的情况,一些新兴方法则关注语言数据规模。然而,它们通常优先考虑与目标分布相符的数据。虽然这种策略在从头训练模型时可能有效,但当模型已经在另一个分布上预训练时,效果可能有限。与之前的工作不同,我们的关键想法是选择数据,它推动了预训练分布更接近目标分布。我们证明了这种方法在某些条件下进行微调的优化性。我们在一系列任务(NLU、NLG、零击)上展示了这种方法的有效性,这些任务的模型规模从2.7亿到2.7亿。此外,与现有技术相比,我们的方法显著更快,能够在单个GPU小时内扩展到数百万个样本。我们的代码是开源的(代码仓库:https://anonymous.4open.science/r/DV4LLM-D761/)。尽管微调在多样任务上具有显著的提高潜力,但与这种工作相关的成本通常会限制其广泛的采用;通过这项工作,我们希望为低成本微调打下基础,使其益处更加易于获取。
https://arxiv.org/abs/2405.02774
The task of Information Retrieval (IR) requires a system to identify relevant documents based on users' information needs. In real-world scenarios, retrievers are expected to not only rely on the semantic relevance between the documents and the queries but also recognize the nuanced intents or perspectives behind a user query. For example, when asked to verify a claim, a retrieval system is expected to identify evidence from both supporting vs. contradicting perspectives, for the downstream system to make a fair judgment call. In this work, we study whether retrievers can recognize and respond to different perspectives of the queries -- beyond finding relevant documents for a claim, can retrievers distinguish supporting vs. opposing documents? We reform and extend six existing tasks to create a benchmark for retrieval, where we have diverse perspectives described in free-form text, besides root, neutral queries. We show that current retrievers covered in our experiments have limited awareness of subtly different perspectives in queries and can also be biased toward certain perspectives. Motivated by the observation, we further explore the potential to leverage geometric features of retriever representation space to improve the perspective awareness of retrievers in a zero-shot manner. We demonstrate the efficiency and effectiveness of our projection-based methods on the same set of tasks. Further analysis also shows how perspective awareness improves performance on various downstream tasks, with 4.2% higher accuracy on AmbigQA and 29.9% more correlation with designated viewpoints on essay writing, compared to non-perspective-aware baselines.
信息检索(IR)任务的目的是根据用户的需要识别相关的文档。在现实场景中,检索器不仅应该根据文档和查询之间的语义相关性来查找相关文档,还应该识别用户查询背后的细微意图或观点。例如,当被要求验证一个主张时,检索系统应该从支持者和反驳者的角度来看明证据,以便下游系统做出公正的判断。在这项工作中,我们研究了检索器是否能够识别和响应不同查询的角度 - 不仅限于找到相关文档,还可以区分支持者和反对者的文档吗?我们将现有的六个任务进行改革和扩展,为检索创建了一个基准,其中我们用自由文本描述了多样化的观点。我们发现,我们实验中的现有检索器对查询中的微妙不同角度缺乏意识,并且可能存在偏见。为了激发这种观察,我们进一步研究了利用检索器表示空间的几何特征来以零散的方式改善检索器在零散观点上的视角意识的可能性。我们在同一任务集上展示了我们的投影基方法的有效性和有效性。进一步的分析还表明,视角意识在各种下游任务上的改善,与非视角意识的基线相比,在Am ambigQA上的准确度提高了4.2%,在论文写作上的指定观点上的相关性提高了29.9%。
https://arxiv.org/abs/2405.02714
Vision-language foundation models like CLIP have shown impressive zero-shot generalization, but finetuning on downstream datasets can cause overfitting and loss of its generalization ability on unseen domains. Although collecting additional data from new domains of interest is possible, this method is often impractical due to the challenges in obtaining annotated data. To address this, we propose a plug-and-play feature augmentation method called LDFS (Language-Guided Diverse Feature Synthesis) to synthesize new domain features and improve existing CLIP fine-tuning strategies. LDFS has three main contributions: 1) To synthesize novel domain features and promote diversity, we propose an instance-conditional feature augmentation strategy based on a textguided feature augmentation loss. 2) To maintain feature quality after augmenting, we introduce a pairwise regularizer to preserve augmented feature coherence within the CLIP feature space. 3) We propose to use stochastic text feature augmentation to reduce the modality gap and further facilitate the process of text-guided feature synthesis. Extensive experiments show LDFS superiority in improving CLIP generalization ability on unseen domains without collecting data from those domains. The code will be made publicly available.
类似于CLIP的视觉语言基础模型已经展示了令人印象深刻的零样本泛化,但在下游数据集上进行微调可能导致过拟合和在未见过的域上失去泛化能力。尽管可以从感兴趣的新领域收集额外数据,但这种方法通常由于获取标注数据的有挑战性而不可行。为了解决这个问题,我们提出了一个插件式特征增强方法,称为LDFS(语言指导的多样特征合成),以合成新的领域特征并改进现有的CLIP微调策略。LDFS有三个主要贡献:1)为了合成新的领域特征并促进多样性,我们提出了一个基于文本指导的特征增强策略。2)为了在微调后保留特征质量,我们引入了一对互斥正则化器来保留CLIP特征空间中增强的特征相关性。3)我们提出使用随机文本特征增强来减少模态差距,并进一步促进文本指导特征合成过程。大量实验证明,LDFS在未收集感兴趣领域的数据时,显著提高了CLIP的泛化能力。代码将公开发布。
https://arxiv.org/abs/2405.02586
The development of large vision-language models, notably CLIP, has catalyzed research into effective adaptation techniques, with a particular focus on soft prompt tuning. Conjointly, test-time augmentation, which utilizes multiple augmented views of a single image to enhance zero-shot generalization, is emerging as a significant area of interest. This has predominantly directed research efforts toward test-time prompt tuning. In contrast, we introduce a robust MeanShift for Test-time Augmentation (MTA), which surpasses prompt-based methods without requiring this intensive training procedure. This positions MTA as an ideal solution for both standalone and API-based applications. Additionally, our method does not rely on ad hoc rules (e.g., confidence threshold) used in some previous test-time augmentation techniques to filter the augmented views. Instead, MTA incorporates a quality assessment variable for each view directly into its optimization process, termed as the inlierness score. This score is jointly optimized with a density mode seeking process, leading to an efficient training- and hyperparameter-free approach. We extensively benchmark our method on 15 datasets and demonstrate MTA's superiority and computational efficiency. Deployed easily as plug-and-play module on top of zero-shot models and state-of-the-art few-shot methods, MTA shows systematic and consistent improvements.
大视觉语言模型的开发,特别是CLIP,已经推动了有效适应技术的研究,特别是对软提示进行优化。同时,测试时间增强,利用单张图像的多个增强视图来提高零样本通用性,正在成为一个有趣的领域。这一方向主要将研究精力集中在测试时间提示调整上。相比之下,我们引入了一个稳健的MeanShift for Test-time Augmentation(MTA),它超过了需要这种密集训练过程的基于提示的方法。这使得MTA成为适用于离线和API基础应用的理想解决方案。此外,我们的方法不依赖于某些以前测试时间增强技术中使用的临界值(例如置信度阈值)来过滤增强视图。相反,MTA将每个视图的直接质量评估量融入优化过程,称为异常得分。这个分数与密度模式寻求过程共同优化,导致了一种高效的学习- 和超参数- 免费的方法。我们在15个数据集上对方法进行了广泛的基准,证明了MTA的优越性和计算效率。部署容易地作为零样本模型和最先进的少量样本方法的插件,MTA显示出系统性和一致性的改进。
https://arxiv.org/abs/2405.02266
Compact robotic platforms with powerful compute and actuation capabilities are key enablers for practical, real-world deployments of multi-agent research. This article introduces a tightly integrated hardware, control, and simulation software stack on a fleet of holonomic ground robot platforms designed with this motivation. Our robots, a fleet of customised DJI Robomaster S1 vehicles, offer a balance between small robots that do not possess sufficient compute or actuation capabilities and larger robots that are unsuitable for indoor multi-robot tests. They run a modular ROS2-based optimal estimation and control stack for full onboard autonomy, contain ad-hoc peer-to-peer communication infrastructure, and can zero-shot run multi-agent reinforcement learning (MARL) policies trained in our vectorized multi-agent simulation framework. We present an in-depth review of other platforms currently available, showcase new experimental validation of our system's capabilities, and introduce case studies that highlight the versatility and reliabilty of our system as a testbed for a wide range of research demonstrations. Our system as well as supplementary material is available online: this https URL
紧凑型机器人平台具有强大的计算和执行能力是多智能体研究实际应用的关键推动力。本文介绍了一种基于holonomic地面机器人平台的设计,该平台具有强大的计算和执行能力,以实现多智能体研究的实际应用。我们的机器人,是一支由DJI Robomaster S1车辆组成的定制车队,提供小机器人不具备足够的计算或执行能力,以及不适合室内多机器人测试的较大机器人的平衡。它们运行了一个基于ROS2的模块化最优估计和控制栈来实现全车载自主,包含一个自适应的点对点通信基础设施,并且可以通过我们的向量式多智能体仿真框架训练的多智能体强化学习(MARL)策略实现零 shots。我们深入研究了其他可用的平台,展示了我们系统能力的实验验证,并引入了案例研究,突出了我们系统作为各种研究展示平台的多样性和可靠性。我们的系统和补充材料均可在线获取:https://www. this URL
https://arxiv.org/abs/2405.02198
This paper introduces a novel framework for zero-shot learning (ZSL), i.e., to recognize new categories that are unseen during training, by using a multi-model and multi-alignment integration method. Specifically, we propose three strategies to enhance the model's performance to handle ZSL: 1) Utilizing the extensive knowledge of ChatGPT and the powerful image generation capabilities of DALL-E to create reference images that can precisely describe unseen categories and classification boundaries, thereby alleviating the information bottleneck issue; 2) Integrating the results of text-image alignment and image-image alignment from CLIP, along with the image-image alignment results from DINO, to achieve more accurate predictions; 3) Introducing an adaptive weighting mechanism based on confidence levels to aggregate the outcomes from different prediction methods. Experimental results on multiple datasets, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our model can significantly improve classification accuracy compared to single-model approaches, achieving AUROC scores above 96% across all test datasets, and notably surpassing 99% on the CIFAR-10 dataset.
本文提出了一种新颖的零样本学习(ZSL)框架,即利用多模型和多对齐整合方法,在训练过程中识别未见过的类别。具体来说,我们提出了三种策略来提高模型的性能以处理ZSL:1)利用ChatGPT的广泛知识以及DALL-E强大的图像生成能力,生成可以精确描述未见过的类别的参考图像,从而减轻信息瓶颈问题;2)将CLIP中的文本图像对齐和图像图像对齐的结果,与DINO中的图像图像对齐结果相结合,以实现更准确的预测;3)引入基于置信度级别的自适应加权机制,以汇总不同预测方法的结果。在多个数据集上的实验结果,包括CIFAR-10、CIFAR-100和TinyImageNet,表明我们的模型可以在与单模型方法相比显著提高分类精度,实现所有测试数据集的AUROC分数都超过96%,并且在CIFAR-10数据集上更是超过了99%。
https://arxiv.org/abs/2405.02155
The diversity of knowledge encoded in large language models (LLMs) and their ability to apply this knowledge zero-shot in a range of settings makes them a promising candidate for use in decision-making. However, they are currently limited by their inability to reliably provide outputs which are explainable and contestable. In this paper, we attempt to reconcile these strengths and weaknesses by introducing a method for supplementing LLMs with argumentative reasoning. Concretely, we introduce argumentative LLMs, a method utilising LLMs to construct argumentation frameworks, which then serve as the basis for formal reasoning in decision-making. The interpretable nature of these argumentation frameworks and formal reasoning means that any decision made by the supplemented LLM may be naturally explained to, and contested by, humans. We demonstrate the effectiveness of argumentative LLMs experimentally in the decision-making task of claim verification. We obtain results that are competitive with, and in some cases surpass, comparable state-of-the-art techniques.
大型语言模型(LLMs)所编码的知识的多样性以及它们在各种场景下应用此知识的零样本使它们成为决策制定的有前途的候选者。然而,它们目前仍然受到其无法可靠提供可解释、可争议输出的限制。在本文中,我们试图通过引入一种补充LLM的推理方法来调和这些优势和劣势。具体来说,我们引入了用于构建论证框架的LLM,该框架作为决策制定的正式推理基础。这些论证框架的可解释性质和形式推理使得由补充LLM做出的任何决策都可以自然地解释给,以及由人类进行反驳。我们通过实验验证了用于声称验证的推理LLM的有效性。我们获得了与 comparable state-of-the-art技术竞争的结果,并且在某些情况下超过了这些技术的性能。
https://arxiv.org/abs/2405.02079
Large Language Models (LLMs) are deep learning models designed to generate text based on textual input. Although researchers have been developing these models for more complex tasks such as code generation and general reasoning, few efforts have explored how LLMs can be applied to combinatorial problems. In this research, we investigate the potential of LLMs to solve the Travelling Salesman Problem (TSP). Utilizing GPT-3.5 Turbo, we conducted experiments employing various approaches, including zero-shot in-context learning, few-shot in-context learning, and chain-of-thoughts (CoT). Consequently, we fine-tuned GPT-3.5 Turbo to solve a specific problem size and tested it using a set of various instance sizes. The fine-tuned models demonstrated promising performance on problems identical in size to the training instances and generalized well to larger problems. Furthermore, to improve the performance of the fine-tuned model without incurring additional training costs, we adopted a self-ensemble approach to improve the quality of the solutions.
大语言模型(LLMs)是一种基于文本输入的深度学习模型,旨在生成文本。尽管研究人员已经为更复杂的任务,如代码生成和一般推理开发了这些模型,但很少有人研究过LLMs如何应用于组合问题。在本文中,我们研究了LLMs解决旅行推销员问题(TSP)的潜力。通过利用GPT-3.5涡轮,我们进行了各种实验,包括零散在上下文中学习、少量在上下文中学习和连锁思维(CoT)。因此,我们微调了GPT-3.5涡轮以解决特定问题大小,并使用一系列不同的实例大小对其进行了测试。微调后的模型在大小与训练实例相同的问题上表现出优异的性能,并且对较大问题表现出良好的泛化能力。此外,为了在不产生额外训练成本的情况下提高微调模型的性能,我们还采用了一种自集成方法来提高解决方案的质量。
https://arxiv.org/abs/2405.01997
Time series data are ubiquitous across various domains, making time series analysis critically important. Traditional time series models are task-specific, featuring singular functionality and limited generalization capacity. Recently, large language foundation models have unveiled their remarkable capabilities for cross-task transferability, zero-shot/few-shot learning, and decision-making explainability. This success has sparked interest in the exploration of foundation models to solve multiple time series challenges simultaneously. There are two main research lines, namely \textbf{pre-training foundation models from scratch for time series} and \textbf{adapting large language foundation models for time series}. They both contribute to the development of a unified model that is highly generalizable, versatile, and comprehensible for time series analysis. This survey offers a 3E analytical framework for comprehensive examination of related research. Specifically, we examine existing works from three dimensions, namely \textbf{Effectiveness}, \textbf{Efficiency} and \textbf{Explainability}. In each dimension, we focus on discussing how related works devise tailored solution by considering unique challenges in the realm of time series.Furthermore, we provide a domain taxonomy to help followers keep up with the domain-specific advancements. In addition, we introduce extensive resources to facilitate the field's development, including datasets, open-source, time series libraries. A GitHub repository is also maintained for resource updates (this https URL).
翻译:时间序列数据在各种领域无处不在,使得时间序列分析至关重要。传统的时间序列模型是任务特定的,具有有限的泛化能力。最近,大型语言模型揭示了其在跨任务转移能力、零 shot/少数 shot学习和决策可解释性方面的非凡能力。这一成功引发了人们对基础模型同时解决多个时间序列挑战的探索兴趣。有两个主要的研究方向:从零开始构建时间序列基础模型和调整大型语言模型以应对时间序列。它们都为开发一个高度通用、多才多艺且易于理解的时间序列分析模型做出了贡献。本调查提供了一个3E的分析框架,对相关研究进行全面评估。具体来说,我们从三个维度进行研究,即有效性、效率和可解释性。在每一个维度上,我们关注相关研究如何通过考虑时间序列领域内的独特挑战来制定定制解决方案。此外,我们还为该领域的发展提供了广泛的资源,包括数据集、开源时间和序列库。还维护了一个GitHub仓库,以便于更新资源(此https URL)。
https://arxiv.org/abs/2405.02358
The zero-shot capability of Large Language Models (LLMs) has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.
大规模语言模型(LLMs)的零 shot 能力已经使得各种任务具有高度灵活、参考无关的指标,使得 LLM 评估者成为自然语言处理(NLP)中常见的工具。然而,这些 LLM 评估者的稳健性仍然相对较少被研究;现有的工作主要追求将 LLM 分数与人类专家评分相关联的最好性能。在本文中,我们使用 SummEval 数据集进行了一系列分析,证实 LLMs 是偏差评估者,因为它们:(1)表现出熟悉性偏见,即对较低的歧义文本有偏好;(2)显示评分分布偏斜和不均匀;(3)在多属性判断中经历锚定效应。我们还发现 LLMs 是不可靠的评估者,表现出“样本间”一致性低和对于提示差异对人类理解文本质量的影响微不足道的低敏感度。此外,我们还分享了配置 LLM 评估器以减轻这些限制的食谱。在 RoSE 数据集上的实验结果表明,LLM 评估器已经超越了最先进的水平。
https://arxiv.org/abs/2405.01724
Detecting and segmenting moving objects from a moving monocular camera is challenging in the presence of unknown camera motion, diverse object motions and complex scene structures. Most existing methods rely on a single motion cue to perform motion segmentation, which is usually insufficient when facing different complex environments. While a few recent deep learning based methods are able to combine multiple motion cues to achieve improved accuracy, they depend heavily on vast datasets and extensive annotations, making them less adaptable to new scenarios. To address these limitations, we propose a novel monocular dense segmentation method that achieves state-of-the-art motion segmentation results in a zero-shot manner. The proposed method synergestically combines the strengths of deep learning and geometric model fusion methods by performing geometric model fusion on object proposals. Experiments show that our method achieves competitive results on several motion segmentation datasets and even surpasses some state-of-the-art supervised methods on certain benchmarks, while not being trained on any data. We also present an ablation study to show the effectiveness of combining different geometric models together for motion segmentation, highlighting the value of our geometric model fusion strategy.
在未知相机运动、多样物体运动和复杂场景结构的背景下,检测和分割运动物体是一个具有挑战性的任务。大多数现有方法仅依赖于单个运动提示进行运动分割,这通常不足以处理不同的复杂环境。虽然一些基于深度学习的最新方法能够结合多个运动提示以实现更准确的运动分割,但它们依赖于大量数据和广泛的注释,因此对新的场景不够适应。为了克服这些限制,我们提出了一种新颖的单目相机密集分割方法,在零散拍摄的情况下实现最先进的运动分割。该方法通过在物体提议上进行几何模型融合,将深度学习方法和几何模型融合方法的优势相结合。实验证明,我们的方法在多个运动分割数据集上取得了竞争性的结果,甚至在某些基准上超过了某些最先进的监督方法。此外,我们还进行了消融研究,以展示将不同几何模型结合进行运动分割的有效性,并强调了我们几何模型融合策略的价值。
https://arxiv.org/abs/2405.01723
Meta-analyses statistically aggregate the findings of different randomized controlled trials (RCTs) to assess treatment effectiveness. Because this yields robust estimates of treatment effectiveness, results from meta-analyses are considered the strongest form of evidence. However, rigorous evidence syntheses are time-consuming and labor-intensive, requiring manual extraction of data from individual trials to be synthesized. Ideally, language technologies would permit fully automatic meta-analysis, on demand. This requires accurately extracting numerical results from individual trials, which has been beyond the capabilities of natural language processing (NLP) models to date. In this work, we evaluate whether modern large language models (LLMs) can reliably perform this task. We annotate (and release) a modest but granular evaluation dataset of clinical trial reports with numerical findings attached to interventions, comparators, and outcomes. Using this dataset, we evaluate the performance of seven LLMs applied zero-shot for the task of conditionally extracting numerical findings from trial reports. We find that massive LLMs that can accommodate lengthy inputs are tantalizingly close to realizing fully automatic meta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality). However, LLMs -- including ones trained on biomedical texts -- perform poorly when the outcome measures are complex and tallying the results requires inference. This work charts a path toward fully automatic meta-analysis of RCTs via LLMs, while also highlighting the limitations of existing models for this aim.
元分析是一种统计方法,将不同随机对照试验(RCTs)的研究结果汇总以评估治疗效果。由于这导致了治疗有效性的稳健估计,因此元分析的结果被认为是最好的证据。然而,严谨的证据综述需要花费时间和精力,并需要从每个试验中手动提取数据进行合成。理想情况下,语言技术应该允许完全自动的元分析,即在需要时按需进行。这需要准确从每个试验中提取数值结果,而目前自然语言处理(NLP)模型还无法实现这一目标。在这项工作中,我们评估现代大型语言模型(LLMs)是否可以可靠地执行这项任务。我们注释并发布了一个包含数字结果附加到干预、比较项目和结局的临床试验报告的适度但粗粒度评估数据集。使用这个数据集,我们评估了七种LLM应用于零散元分析在从试验报告中提取数字结果方面的性能。我们发现,具有长输入的大型LLM距离实现完全自动元分析(特别是二分类(二元)结果)非常接近。然而,LLM--包括那些基于生物医学文本训练的模型--在处理复杂结局时表现不佳,需要进行推断。这项工作为通过LLM实现完全自动元分析RCT奠定了道路,同时强调了现有模型的局限性。
https://arxiv.org/abs/2405.01686
Recent Large Language Models (LLMs) have shown the ability to generate content that is difficult or impossible to distinguish from human writing. We investigate the ability of differently-sized LLMs to replicate human writing style in short, creative texts in the domain of Showerthoughts, thoughts that may occur during mundane activities. We compare GPT-2 and GPT-Neo fine-tuned on Reddit data as well as GPT-3.5 invoked in a zero-shot manner, against human-authored texts. We measure human preference on the texts across the specific dimensions that account for the quality of creative, witty texts. Additionally, we compare the ability of humans versus fine-tuned RoBERTa classifiers to detect AI-generated texts. We conclude that human evaluators rate the generated texts slightly worse on average regarding their creative quality, but they are unable to reliably distinguish between human-written and AI-generated texts. We further provide a dataset for creative, witty text generation based on Reddit Showerthoughts posts.
近年来,大型语言模型(LLMs)已经证明了生成难以或无法与人类写作区分的内容的能力。我们在Showerthoughts领域,即在普通活动中可能出现的想法,研究了不同大小的LLM是否具有复制人类写作风格的短小、有创意的文本的能力。我们将GPT-2和GPT-Neo在Reddit数据上微调以及通过零散的方式启动GPT-3.5,与人类创作的文章进行比较。我们在具体体现创意、幽默的文本质量的各个维度上衡量人类偏好。此外,我们还比较了人类和微调后的RoBERTa分类器在检测人工智能生成文本方面的能力。我们得出结论,人类评估者平均将生成的文本创意质量评价为略逊一筹,但他们无法可靠地区分人类创作和人工智能生成的文本。我们还基于Reddit Showerthoughts帖子提供了用于创意、幽默文本生成的数据集。
https://arxiv.org/abs/2405.01660
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation: interacting with unseen objects in novel scenes without test-time adaptation. While typical approaches rely on a large amount of demonstration data for such generalization, we propose an approach that leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects. We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner. We then refine this open-loop plan by predicting residual actions through a closed loop policy trained with a few embodiment-specific demonstrations. We show that this approach of combining scalably learned track prediction with a residual policy requiring minimal in-domain robot-specific data enables zero-shot robot manipulation, and present a wide array of real-world robot manipulation results across unseen tasks, objects, and scenes. this https URL
我们寻求学习一个通用的目标条件策略,实现零 shot 机器人操作:在未见过的场景中与新的对象进行交互,无需测试时间适应。虽然典型的方法依赖于大量演示数据进行这种泛化,我们提出了一种利用网络视频预测合理交互计划的方法,并学会了在现实世界中获得机器人动作的任务无关转换。我们的框架 Track2Act 预测图像中点在未来的时间步应该如何移动,可以训练包括人类和机器人操作日常物品的多样性视频。我们使用这些 2D 轨迹预测来推断要操作的物体的序列刚变换,并获得可以在开环方式下执行的机器人末端执行器姿态。然后通过通过训练了几次特定于身体特定示范的闭环策略来预测残余动作,从而优化这个开环计划。我们证明了这种结合可扩展学习轨迹预测与需要最小域内机器人特定数据的有残策略的方法实现零 shot 机器人操作,并在未见过的任务、物体和场景中展示了广泛的实时机器人操作结果。这个链接:
https://arxiv.org/abs/2405.01527
For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at this https URL.
对于基于扩散的生成模型,特别是在包含主题和复杂细节的图像中保持一致的内容,这是一个具有挑战性的任务。在本文中,我们提出了一种新的自注意力计算方法,称为一致自注意力,它显著提高了生成图像和预训练扩散-基于文本-图像模型的一致性。为了将我们的方法扩展到长距离视频生成,我们进一步引入了一个名为语义运动预测器的 novel 语义空间时间运动预测模块。它被训练估计提供给两个给定图像之间的语义空间运动情况。这个模块将生成的图像序列转换为具有平滑过渡和一致主题的视频。与仅基于潜在空间的方法相比,尤其是在长视频生成方面,这个模块更加稳定。通过合并这两个新颖组件,我们的框架(称为StoryDiffusion)可以描述一个文本为基础的故事,包括一系列包含丰富内容的一致图像或视频。该故事性扩散框架在视觉故事生成方面具有开拓性的探索,通过展示图像和视频,我们希望激发更多从架构修改方面的研究。我们的代码现在公开可用,在https://这个URL上。
https://arxiv.org/abs/2405.01434
Temporally localizing the presence of object states in videos is crucial in understanding human activities beyond actions and objects. This task has suffered from a lack of training data due to object states' inherent ambiguity and variety. To avoid exhaustive annotation, learning from transcribed narrations in instructional videos would be intriguing. However, object states are less described in narrations compared to actions, making them less effective. In this work, we propose to extract the object state information from action information included in narrations, using large language models (LLMs). Our observation is that LLMs include world knowledge on the relationship between actions and their resulting object states, and can infer the presence of object states from past action sequences. The proposed LLM-based framework offers flexibility to generate plausible pseudo-object state labels against arbitrary categories. We evaluate our method with our newly collected Multiple Object States Transition (MOST) dataset including dense temporal annotation of 60 object state categories. Our model trained by the generated pseudo-labels demonstrates significant improvement of over 29% in mAP against strong zero-shot vision-language models, showing the effectiveness of explicitly extracting object state information from actions through LLMs.
翻译:在理解视频中的人类活动,特别是在动作和物体之外,局部地定位对象状态在时间上是至关重要的。由于对象状态的固有不确定性和多样性,该任务一直缺乏训练数据。为了避免彻底的注释,从教育视频中的转录旁白中学习将是非常有趣的。然而,与动作相比,对象状态在旁白中的描述较少,因此它们的有效性较低。在这项工作中,我们提出了一种利用大型语言模型(LLMs)提取叙述中动作信息中的物体状态信息的方法。我们的观察是,LLMs包括了动作与其导致物体状态之间关系的世界知识,并且可以从过去的动作序列中推断出物体状态的存在。基于LLM的框架为生成任意类别的伪物体状态标签提供了灵活性。我们使用新收集的Multiple Object States Transition(MOST)数据集进行了实验,其中包括对60个物体状态类别的密集时间注释。我们通过训练由生成的伪标签来训练的模型,在强 zero-shot 视觉语言模型上取得了超过29%的mAP的显著改进,表明通过LLMs明确提取动作中的物体状态信息是非常有效的。
https://arxiv.org/abs/2405.01090
Measuring public attitudes toward wildlife provides crucial insights into our relationship with nature and helps monitor progress toward Global Biodiversity Framework targets. Yet, conducting such assessments at a global scale is challenging. Manually curating search terms for querying news and social media is tedious, costly, and can lead to biased results. Raw news and social media data returned from queries are often cluttered with irrelevant content and syndicated articles. We aim to overcome these challenges by leveraging modern Natural Language Processing (NLP) tools. We introduce a folk taxonomy approach for improved search term generation and employ cosine similarity on Term Frequency-Inverse Document Frequency vectors to filter syndicated articles. We also introduce an extensible relevance filtering pipeline which uses unsupervised learning to reveal common topics, followed by an open-source zero-shot Large Language Model (LLM) to assign topics to news article titles, which are then used to assign relevance. Finally, we conduct sentiment, topic, and volume analyses on resulting data. We illustrate our methodology with a case study of news and X (formerly Twitter) data before and during the COVID-19 pandemic for various mammal taxa, including bats, pangolins, elephants, and gorillas. During the data collection period, up to 62% of articles including keywords pertaining to bats were deemed irrelevant to biodiversity, underscoring the importance of relevance filtering. At the pandemic's onset, we observed increased volume and a significant sentiment shift toward horseshoe bats, which were implicated in the pandemic, but not for other focal taxa. The proposed methods open the door to conservation practitioners applying modern and emerging NLP tools, including LLMs "out of the box," to analyze public perceptions of biodiversity during current events or campaigns.
衡量公众对野生动物的态度为我们与自然的关系提供了关键见解,并有助于监测全球生物多样性框架目标的实现。然而,在全球范围内进行此类评估具有挑战性。手动策展关键词以进行新闻和社交媒体搜索是乏味、耗时且可能导致偏见结果的。从查询中返回的新闻和社交媒体数据通常充满无关内容和高尔顿文章。我们希望通过利用现代自然语言处理(NLP)工具来克服这些挑战。我们引入了一种民间分类学方法来改进搜索词生成,并使用余弦相似度在词频-逆文档频率向量上过滤 syndicated 文章。我们还引入了一个可扩展的相关过滤管道,使用无监督学习来揭示共同主题,然后使用开源零击大语言模型(LLM)将主题分配给新闻文章标题,这些标题随后用于确定相关性。最后,我们对结果数据进行情感、主题和数量分析。我们用蝙蝠、穿山甲、大象和刚果黑猩猩等各种哺乳动物类群在COVID-19疫情前和疫情期间的新闻和社交媒体数据进行案例研究,来说明我们的方法。在数据收集期间,包括与蝙蝠关键词相关的文章,有高达62%的文章被认为与生物多样性无关,这凸显了相关性过滤的重要性。在疫情初期,我们观察到穿山甲数量增加和情感倾向明显向穿山甲倾斜,这些穿山甲被认为是导致疫情的原因,但并非其他关键类群。所提出的方法为 conservation practitioners 在当前事件或活动中应用现代和新兴 NLP 工具(包括LLM "out of the box")分析公众对生物多样性
https://arxiv.org/abs/2405.01610
Learning to solve vehicle routing problems (VRPs) has garnered much attention. However, most neural solvers are only structured and trained independently on a specific problem, making them less generic and practical. In this paper, we aim to develop a unified neural solver that can cope with a range of VRP variants simultaneously. Specifically, we propose a multi-task vehicle routing solver with mixture-of-experts (MVMoE), which greatly enhances the model capacity without a proportional increase in computation. We further develop a hierarchical gating mechanism for the MVMoE, delivering a good trade-off between empirical performance and computational complexity. Experimentally, our method significantly promotes the zero-shot generalization performance on 10 unseen VRP variants, and showcases decent results on the few-shot setting and real-world benchmark instances. We further provide extensive studies on the effect of MoE configurations in solving VRPs. Surprisingly, the hierarchical gating can achieve much better out-of-distribution generalization performance. The source code is available at: this https URL.
学习解决车辆路由问题(VRPs)已经引起了广泛关注。然而,大多数神经网络解决方案只能在特定问题上进行结构化和训练,使它们对其他问题不具有通用性和实用性。在本文中,我们旨在开发一个统一的神经网络解决方案,可以同时处理多种 VRP 变体。具体来说,我们提出了一个多任务车辆路由解决方案(MVMoE),极大地增强了模型能力,而不会增加计算成本。我们进一步开发了一个分层的 gate 机制,使得 MVMoE 可以实现良好的实证性能和计算复杂度的平衡。实验表明,我们的方法在未见过的 10 个 VRP 变体上显著促进了零散样本通用性能,在几见过的设置和现实世界的基准实例上的表现也相当不错。我们还对 MoE 配置对解决 VRP 的影响进行了广泛研究。令人惊讶的是,分层门控可以实现更好的离散样本通用性能。代码可在此处下载:https:// this URL。
https://arxiv.org/abs/2405.01029
This report presents the ECO (Ensembled Clip score and cOnsensus score) pipeline from team DSBA LAB, which is a new framework used to evaluate and rank captions for a given image. ECO selects the most accurate caption describing image. It is made possible by combining an Ensembled CLIP score, which considers the semantic alignment between the image and captions, with a Consensus score that accounts for the essentialness of the captions. Using this framework, we achieved notable success in the CVPR 2024 Workshop Challenge on Caption Re-ranking Evaluation at the New Frontiers for Zero-Shot Image Captioning Evaluation (NICE). Specifically, we secured third place based on the CIDEr metric, second in both the SPICE and METEOR metrics, and first in the ROUGE-L and all BLEU Score metrics. The code and configuration for the ECO framework are available at this https URL DSBA-Lab/ECO .
本报告介绍了一个名为ECO(集成剪辑分数和共识分数)的框架,该框架用于评估和排名给定图像的 caption。ECO 选择最准确的图像描述。这是通过将集成 CLIP 分数(考虑图像与捕获的文本之间的语义对齐)与共识分数(考虑捕获的文本的重要性)相结合而实现的。使用这个框架,我们在 CVPR 2024 工作站挑战中对捕捉重新排名评估的新前沿(NICE)取得了显著的成功。具体来说,我们在 CIDEr 指标上获得了第三名的成绩,在 SPICE 和 METEOR 指标上排名第二,而在 ROUGE-L 和所有 BLEU 分数指标上排名第一。ECO 框架的代码和配置可用於此链接 DSBA-Lab/ECO。
https://arxiv.org/abs/2405.01028