Recent advancements in instruction-following models have made user interactions with models more user-friendly and efficient, broadening their applicability. In graphic design, non-professional users often struggle to create visually appealing layouts due to limited skills and resources. In this work, we introduce a novel multimodal instruction-following framework for layout planning, allowing users to easily arrange visual elements into tailored layouts by specifying canvas size and design purpose, such as for book covers, posters, brochures, or menus. We developed three layout reasoning tasks to train the model in understanding and executing layout instructions. Experiments on two benchmarks show that our method not only simplifies the design process for non-professionals but also surpasses the performance of few-shot GPT-4V models, with mIoU higher by 12% on Crello. This progress highlights the potential of multimodal instruction-following models to automate and simplify the design process, providing an approachable solution for a wide range of design tasks on visually-rich documents.
近年来,指令跟随模型的进步使得用户与模型之间的交互更加友好和高效,拓宽了其应用范围。在图形设计中,非专业用户通常由于技能和资源有限,难以创建视觉上吸引人的布局。在这项工作中,我们引入了一个新颖的多模态指令跟随布局规划框架,允许用户通过指定画布大小和设计目的,轻松地将视觉元素排版到定制布局中,如书籍封面、海报、宣传册或菜单。我们开发了三个布局推理任务来训练模型理解并执行布局指令。在两个基准测试上的实验证明,我们的方法不仅简化了非专业用户的设计流程,而且超越了少样本GPT-4V模型的性能,在Crello上的mIoU值较高。这一进步突出了多模态指令跟随模型的潜力,可以自动化和简化设计过程,为各种视觉丰富的文档提供了一种易于设计的解决方案。
https://arxiv.org/abs/2404.15271
Natural language generation tools are powerful and effective for generating content. However, language models are known to display bias and fairness issues, making them impractical to deploy for many use cases. We here focus on how fairness issues impact automatically generated test content, which can have stringent requirements to ensure the test measures only what it was intended to measure. Specifically, we identify test content that is focused on particular domains and experiences that only reflect a certain demographic or that are potentially emotionally upsetting; both of which could inadvertently impact a test-taker's score. This kind of content doesn't reflect typical biases out of context, making it challenging even for modern models that contain safeguards. We build a dataset of 621 generated texts annotated for fairness and explore a variety of methods for classification: fine-tuning, topic-based classification, and prompting, including few-shot and self-correcting prompts. We find that combining prompt self-correction and few-shot learning performs best, yielding an F1 score of .791 on our held-out test set, while much smaller BERT- and topic-based models have competitive performance on out-of-domain data.
自然语言生成工具对于生成内容非常强大和有效。然而,语言模型已经被证明存在偏见和不公平问题,这使得它们在许多用例上部署不实用。在这里,我们关注公平性问题如何影响自动生成的测试内容,这些内容可能对测试者得分产生严格的要求,以确保测试只衡量了它本应测量的内容。具体来说,我们识别出关注特定领域和经验的测试内容,这可能只反映了某些人口统计学或可能引起情感不安的内容;这两者都可能无意中影响测试者的得分。这类内容不反映上下文的典型偏见,这使得现代模型(包含安全措施)更难以处理。我们建立了一个为公平性 annotated的621个生成的文本的数据集,并探讨了分类的方法:微调、基于主题的分类和提示,包括少样本和自纠正提示。我们发现,结合自纠正提示和少样本学习效果最好,在 hold-out 测试集上的 F1 分数为.791,而BERT 和基于主题的模型在离域数据上的竞争性能较小。
https://arxiv.org/abs/2404.15104
Graphs play an important role in representing complex relationships in various domains like social networks, knowledge graphs, and molecular discovery. With the advent of deep learning, Graph Neural Networks (GNNs) have emerged as a cornerstone in Graph Machine Learning (Graph ML), facilitating the representation and processing of graph structures. Recently, LLMs have demonstrated unprecedented capabilities in language tasks and are widely adopted in a variety of applications such as computer vision and recommender systems. This remarkable success has also attracted interest in applying LLMs to the graph domain. Increasing efforts have been made to explore the potential of LLMs in advancing Graph ML's generalization, transferability, and few-shot learning ability. Meanwhile, graphs, especially knowledge graphs, are rich in reliable factual knowledge, which can be utilized to enhance the reasoning capabilities of LLMs and potentially alleviate their limitations such as hallucinations and the lack of explainability. Given the rapid progress of this research direction, a systematic review summarizing the latest advancements for Graph ML in the era of LLMs is necessary to provide an in-depth understanding to researchers and practitioners. Therefore, in this survey, we first review the recent developments in Graph ML. We then explore how LLMs can be utilized to enhance the quality of graph features, alleviate the reliance on labeled data, and address challenges such as graph heterogeneity and out-of-distribution (OOD) generalization. Afterward, we delve into how graphs can enhance LLMs, highlighting their abilities to enhance LLM pre-training and inference. Furthermore, we investigate various applications and discuss the potential future directions in this promising field.
图在表示复杂关系方面在社交网络、知识图谱和分子发现等领域中发挥着重要作用。随着深度学习的出现,图神经网络(GNNs)成为图机器学习(Graph ML)的一个支柱,推动了图结构的表示和处理。近年来,LLM在语言任务上的表现已经达到了史无前例的水平,并在各种应用领域(如计算机视觉和推荐系统)得到了广泛应用。这一显著的成功也引起了将LLM应用于图形领域的兴趣。越来越多的努力致力于探索LLM在推动图机器学习的一般化、可迁移性和少样本学习能力方面的潜力。同时,特别是知识图谱,图形在可靠的事实知识方面非常丰富,可以利用来增强LLM的推理能力,并可能减轻其局限性,如幻觉和缺乏可解释性。鉴于这一研究领域的快速进步,对于LLM时代图机器学习的系统综述总结最新的进展是必要的,以提供研究人员和实践者对这一领域的深入理解。因此,在本次调查中,我们首先回顾了图机器学习领域的最新发展。然后,我们探讨了LLM如何用于提高图形特征的质量、减轻对标注数据的依赖以及解决诸如图形异质性和离散(OOD)泛化等问题。接着,我们深入研究了图形如何增强LLM,强调了它们在提高LLM预训练和推理能力方面的能力。最后,我们调查了各种应用,并讨论了这一充满前景的领域未来的潜在方向。
https://arxiv.org/abs/2404.14928
In multi-sample keyword spotting, each keyword class is represented by multiple spoken instances, called samples. A naïve approach to detect keywords in a target sequence consists of querying all samples of all classes using sub-sequence dynamic time warping. However, the resulting processing time increases linearly with respect to the number of samples belonging to each class. Alternatively, only a single Fréchet mean can be queried for each class, resulting in reduced processing time but usually also in worse detection performance as the variability of the query samples is not captured sufficiently well. In this work, multi-sample dynamic time warping is proposed to compute class-specific cost-tensors that include the variability of all query samples. To significantly reduce the computational complexity during inference, these cost tensors are converted to cost matrices before applying dynamic time warping. In experimental evaluations for few-shot keyword spotting, it is shown that this method yields a very similar performance as using all individual query samples as templates while having a runtime that is only slightly slower than when using Fréchet means.
在多样本关键词检测中,每个关键词类别由多个口头实例表示,这些实例被称为样本。检测目标序列中关键词的一种 naive 方法包括对所有类别的样本使用子序列动态时间压缩。然而,由于每个类别的样本数量不同,因此处理时间会线性增加。另外,为每个类别只能查询一个 Fréchet mean,导致处理时间降低,但通常检测性能也会较差,因为查询样本的变异程度没有被捕捉足够好。 在本文中,提出了一种多样本动态时间压缩方法来计算包括所有查询样本变异性的类特定成本张量。为了在推理过程中显著降低计算复杂性,这些成本张量在应用动态时间压缩之前被转换为成本矩阵。在少量样本关键词检测的实验评估中,研究表明,这种方法与使用所有单个查询样本作为模板时的性能非常相似,但运行时间略慢于使用 Fréchet mean。
https://arxiv.org/abs/2404.14903
Testing and evaluating the safety performance of autonomous vehicles (AVs) is essential before the large-scale deployment. Practically, the number of testing scenarios permissible for a specific AV is severely limited by tight constraints on testing budgets and time. With the restrictions imposed by strictly restricted numbers of tests, existing testing methods often lead to significant uncertainty or difficulty to quantifying evaluation results. In this paper, we formulate this problem for the first time the "few-shot testing" (FST) problem and propose a systematic framework to address this challenge. To alleviate the considerable uncertainty inherent in a small testing scenario set, we frame the FST problem as an optimization problem and search for the testing scenario set based on neighborhood coverage and similarity. Specifically, under the guidance of better generalization ability of the testing scenario set on AVs, we dynamically adjust this set and the contribution of each testing scenario to the evaluation result based on coverage, leveraging the prior information of surrogate models (SMs). With certain hypotheses on SMs, a theoretical upper bound of evaluation error is established to verify the sufficiency of evaluation accuracy within the given limited number of tests. The experiment results on cut-in scenarios demonstrate a notable reduction in evaluation error and variance of our method compared to conventional testing methods, especially for situations with a strict limit on the number of scenarios.
在自动驾驶车辆(AVs)的大规模部署之前,对AV的安全性能进行测试和评估是至关重要的。实际上,特定AV允许的测试场景数量受到严格预算和时间限制的严重限制。由于限制了测试预算和时间,现有测试方法通常导致对评估结果的不确定性和量化评估结果的困难。在本文中,我们首次将这个问题定义为“少样本测试”(FST)问题,并提出了一个系统框架来解决这个挑战。为了减轻小测试场景集中存在的相当大的不确定性,我们将FST问题定义为优化问题,并基于邻域覆盖和相似性搜索测试场景集。具体来说,在AV测试场景集的更好泛化能力指导下,我们动态调整该集,并根据覆盖率基于每个测试场景对评估结果的贡献进行调整,利用代理模型(SMs)的先验信息。在某些假设关于SMs的情况下,建立了评估误差的理论上限,以验证在给定的有限测试数量内评估准确性的充分性。对切分场景的实验结果表明,与传统测试方法相比,我们的方法在评估误差和方差方面具有显著的减少,尤其是在有限场景数量的情况下。
https://arxiv.org/abs/2402.01795
This paper introduces \textbf{Q-tuning}, a novel approach for continual prompt tuning that enables the lifelong learning of a pre-trained language model. When learning a new task, Q-tuning trains a task-specific prompt by adding it to a prompt queue consisting of the prompts from older tasks. To better transfer the knowledge of old tasks, we design an adaptive knowledge aggregation technique that reweighs previous prompts in the queue with a learnable low-rank matrix. Once the prompt queue reaches its maximum capacity, we leverage a PCA-based eviction rule to reduce the queue's size, allowing the newly trained prompt to be added while preserving the primary knowledge of old tasks. In order to mitigate the accumulation of information loss caused by the eviction, we additionally propose a globally shared prefix prompt and a memory retention regularization based on information theory. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods substantially on continual prompt tuning benchmarks. Moreover, our approach enables lifelong learning on linearly growing task sequences while requiring constant complexity for training and inference.
本文介绍了一种名为 \textbf{Q-tuning} 的新方法,用于持续 prompt 调整,从而实现预训练语言模型的终身学习。在学习新任务时,Q-tuning 通过将新任务加入一个由 older 任务提示组成的提示队列中,来训练一个任务特定的提示。为了更好地转移旧任务的知識,我們設計了一種自適應的知識聚合技術,通過可學習的低秩矩陣重新權重队列中的先前提示。一旦提示隊列達到其最大容量,我們利用基于主成分分析(PCA)的驱逐规则来减少队列的大小,从而在保留主要舊任务知识的同时,允许新训练的提示加入队列。为了减轻驱逐操作造成的信息损失的累积,我们还提出了一个全局共享前缀提示和基于信息理论的内存保留 regularization。大量实验证明,与最先进的 methods相比,我们的方法在持续 prompt 调整基准测试中显著表现出优势。此外,我们的方法在 linearly growing 任务序列上实现终身学习,同时需要训练和推理的常规模度为不变。
https://arxiv.org/abs/2404.14607
Medical errors in clinical text pose significant risks to patient safety. The MEDIQA-CORR 2024 shared task focuses on detecting and correcting these errors across three subtasks: identifying the presence of an error, extracting the erroneous sentence, and generating a corrected sentence. In this paper, we present our approach that achieved top performance in all three subtasks. For the MS dataset, which contains subtle errors, we developed a retrieval-based system leveraging external medical question-answering datasets. For the UW dataset, reflecting more realistic clinical notes, we created a pipeline of modules to detect, localize, and correct errors. Both approaches utilized the DSPy framework for optimizing prompts and few-shot examples in large language model (LLM) based programs. Our results demonstrate the effectiveness of LLM based programs for medical error correction. However, our approach has limitations in addressing the full diversity of potential errors in medical documentation. We discuss the implications of our work and highlight future research directions to advance the robustness and applicability of medical error detection and correction systems.
临床文本中的医疗错误对患者安全构成重大风险。MEDIQA-CORR 2024 共享任务的重点是在三个子任务中检测和纠正这些错误:发现错误的存在、提取错误的句子和生成修正的句子。在本文中,我们提出了我们在所有三个子任务中实现最出色表现的策略。对于包含微妙错误的 MS 数据集,我们开发了一个基于检索的外部医疗问题回答数据集的系统。对于反映更真实临床笔记的 UW 数据集,我们创建了一个模块来检测、定位和纠正错误。两种方法都利用了 DSPy 框架优化大型语言模型(LLM)程序中的提示和少样本示例。我们的结果证明了基于 LLM 的程序在医疗错误更正方面的有效性。然而,我们的方法在解决医疗记录中可能出现的各种错误方面存在局限性。我们讨论了我们工作的影响,并强调了未来研究的方向,以提高医疗错误检测和纠正系统的稳健性和适用性。
https://arxiv.org/abs/2404.14544
In this paper, we present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input. As a key idea, we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation, enabling to conveniently and efficiently compose and render them together. In particular, we address the scenarios with severely limited and sparse observations in 3D human reconstruction, a common challenge encountered in the real world. To tackle this challenge, we introduce a novel approach to optimize the 3D-GS representation in a canonical space by fusing the sparse cues in the common space, where we leverage a pre-trained 2D diffusion model to synthesize unseen views while keeping the consistency with the observed 2D appearances. We demonstrate our method can reconstruct high-quality animatable 3D humans in various challenging examples, in the presence of occlusion, image crops, few-shot, and extremely sparse observations. After reconstruction, our method is capable of not only rendering the scene in any novel views at arbitrary time instances, but also editing the 3D scene by removing individual humans or applying different motions for each human. Through various experiments, we demonstrate the quality and efficiency of our methods over alternative existing approaches.
在本文中,我们提出了一种从单目视频输入中重构世界和多个动态人类的方法。关键思想是,我们通过最近新兴的3D高斯扩散(3D-GS)表示来表示世界和多个人类,使得可以方便地组合和渲染它们。特别地,我们解决了在3D人类重建中严重有限和稀疏观察到的挑战,这是在现实生活中常见的一个挑战。为了解决这个问题,我们引入了一种通过融合共同空间中稀疏提示的新方法来优化3D-GS表示。我们利用预训练的2D扩散模型在保持观察到的2D外观一致性的同时,合成未见过的视图。我们证明了我们的方法可以在各种具有遮挡、图像裁剪、少样本和极其稀疏观察的挑战例子中重构高质量的可动画3D人类。在重构后,我们的方法不仅可以在任意时间点渲染场景的新视图,还可以通过删除单个人类或为每个人类应用不同的运动来编辑3D场景。通过各种实验,我们证明了我们的方法在各种现有方法中的质量和效率。
https://arxiv.org/abs/2404.14410
Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, \textit{DataTune}, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49\% and improves over existing methods that use synthetic or retrieved training data by 34\%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: this https URL.
尽管近年来在大型语言模型方面取得了进展,但构建可靠且可部署的自然语言处理(NLP)模型通常需要丰富、高质量的训练数据。然而,许多用例的任务特定数据不可用,并且手动整理任务特定数据费力。最近的工作研究了使用大型语言模型进行提示驱动的合成数据生成,但这些生成的数据集往往缺乏复杂性和多样性。为了克服这些限制,我们引入了\textit{DataTune}方法,以更好地利用现有的、公开可用的数据集来提高自动数据生成。DataTune进行数据转换,使公开可用的数据集可以重新配置为与目标任务的具体要求相符的格式。在BIG-Bench基准的多语言任务上,我们发现,通过DataTune对语言模型进行微调,可以在仅几轮提示的基础上提高超过49\%,并将使用合成或检索训练数据的传统方法的效果提高34\%。我们发现,数据转换在许多任务上显著增加了生成的数据的多样性和难度。我们将DataTune集成到一个开源存储库中,使这种方法对社区可用:https://url.com/。
https://arxiv.org/abs/2404.14361
Stance detection has been widely studied as the task of determining if a social media post is positive, negative or neutral towards a specific issue, such as support towards vaccines. Research in stance detection has however often been limited to a single language and, where more than one language has been studied, research has focused on few-shot settings, overlooking the challenges of developing a zero-shot cross-lingual stance detection model. This paper makes the first such effort by introducing a novel approach to zero-shot cross-lingual stance detection, Multilingual Translation-Augmented BERT (MTAB), aiming to enhance the performance of a cross-lingual classifier in the absence of explicit training data for target languages. Our technique employs translation augmentation to improve zero-shot performance and pairs it with adversarial learning to further boost model efficacy. Through experiments on datasets labeled for stance towards vaccines in four languages English, German, French, Italian. We demonstrate the effectiveness of our proposed approach, showcasing improved results in comparison to a strong baseline model as well as ablated versions of our model. Our experiments demonstrate the effectiveness of model components, not least the translation-augmented data as well as the adversarial learning component, to the improved performance of the model. We have made our source code accessible on GitHub.
作为一种确定社交媒体帖子是否支持、反对或中立的特定问题(如对疫苗的支持)的任务,姿态检测(Stance detection)已经受到了广泛研究。然而,姿态检测研究通常局限于一种语言,并且在研究多个语言时,研究重点在于少样本设置,忽略了开发零样本跨语言姿态检测模型的挑战。本文通过引入一种名为多语言翻译增强BERT(MTAB)的新方法,第一次在零样本跨语言姿态检测上做出了尝试,旨在提高在没有明确目标语言训练数据的情况下跨语言分类器的性能。我们的技术采用翻译增强来提高零样本性能,并将其与对抗学习相结合,以进一步提高模型的功效。通过在四个语言(英语、德语、法语、意大利)的数据集上进行实验,我们证明了所提出方法的效力,并将其与强基线模型以及我们模型的衰减版本进行了比较。我们的实验结果表明,模型组件(尤其是翻译增强数据和对抗学习组件)对模型的提高性能具有重要作用。我们在GitHub上公开了我们的源代码。
https://arxiv.org/abs/2404.14339
While multi-modal large language models (MLLMs) have shown significant progress on many popular visual reasoning benchmarks, whether they possess abstract visual reasoning abilities remains an open question. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns (e.g., repetition constraints) that control the input shapes (e.g., digits) in a specific task configuration (e.g., matrix). However, existing AVR benchmarks only considered a limited set of patterns (addition, conjunction), input shapes (rectangle, square), and task configurations (3 by 3 matrices). To evaluate MLLMs' reasoning abilities comprehensively, we introduce MARVEL, a multidimensional AVR benchmark with 770 puzzles composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations. To inspect whether the model accuracy is grounded in perception and reasoning, MARVEL complements the general AVR question with perception questions in a hierarchical evaluation framework. We conduct comprehensive experiments on MARVEL with nine representative MLLMs in zero-shot and few-shot settings. Our experiments reveal that all models show near-random performance on the AVR question, with significant performance gaps (40%) compared to humans across all patterns and task configurations. Further analysis of perception questions reveals that MLLMs struggle to comprehend the visual features (near-random performance) and even count the panels in the puzzle ( <45%), hindering their ability for abstract reasoning. We release our entire code and dataset.
虽然多模态大型语言模型(MLLMs)在许多流行视觉推理基准测试中取得了显著进展,但它们是否具有抽象视觉推理能力仍然是一个开放的问题。与数独谜题类似,抽象视觉推理(AVR)问题需要找到控制特定任务配置(例如矩阵)输入形状(例如数字)的高层次模式。然而,现有的AVR基准仅考虑了有限的一组模式(加法、组合),输入形状(矩形、正方形)和任务配置(3x3矩阵)。为了全面评估MLLMs的推理能力,我们引入了MARVEL,一个由六个核心知识模式组成的二维AVR基准,包含770个谜题。为了检查模型准确性是否基于感知和推理,MARVEL在分层评估框架中补充了感知问题。我们在零散和少量设置下对MARVEL进行全面的实验。我们的实验发现,所有模型在AVR问题上都表现出近似随机的表现,所有模式和任务配置下的性能差距(40%)在人类和其他模型之间显著。进一步分析感知问题揭示了MLLMs很难理解视觉特征(近似随机表现),甚至无法计数谜题中的面板(<45%,阻止其进行抽象推理)。我们发布了我们的完整代码和数据集。
https://arxiv.org/abs/2404.13591
Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in 3D environments. One of the primary challenges in EIF is compositional task planning, which is often addressed with supervised or in-context learning with labeled data. To this end, we introduce the Socratic Planner, the first zero-shot planning method that infers without the need for any training data. Socratic Planner first decomposes the instructions into substructural information of the task through self-questioning and answering, translating it into a high-level plan, i.e., a sequence of subgoals. Subgoals are executed sequentially, with our visually grounded re-planning mechanism adjusting plans dynamically through a dense visual feedback. We also introduce an evaluation metric of high-level plans, RelaxedHLP, for a more comprehensive evaluation. Experiments demonstrate the effectiveness of the Socratic Planner, achieving competitive performance on both zero-shot and few-shot task planning in the ALFRED benchmark, particularly excelling in tasks requiring higher-dimensional inference. Additionally, a precise adjustments in the plan were achieved by incorporating environmental visual information.
实体化指令跟随(EIF)是通过在3D环境中导航和与物体互动来执行自然语言指令的任务。在EIF中,一个主要的挑战是合成任务规划,通常通过带有标签数据的监督或上下文学习来解决。为此,我们引入了Socratic Planner,第一种无需训练数据的原子规划方法。Socratic Planner首先通过自问自答将指令分解为任务的主干信息,然后将其转换为高级计划,即一系列子目标。子目标按顺序执行,我们的视觉 grounded 的再规划机制通过密集的视觉反馈动态调整计划。我们还引入了一个高层次计划评估指标,RelaxedHLP,以进行更全面的评估。实验证明Socratic Planner的有效性,在ALFRED基准中实现了与零 shot和少 shot 任务规划的竞争性能,特别是在需要更高维推理的任务中表现尤为出色。此外,通过将环境视觉信息融入规划,实现了精确的计划调整。
https://arxiv.org/abs/2404.15190
The conversion of natural language queries into SQL queries, known as Text-to-SQL, is a critical yet challenging task. This paper introduces EPI-SQL, a novel methodological framework leveraging Large Language Models (LLMs) to enhance the performance of Text-to-SQL tasks. EPI-SQL operates through a four-step process. Initially, the method involves gathering instances from the Spider dataset on which LLMs are prone to failure. These instances are then utilized to generate general error-prevention instructions (EPIs). Subsequently, LLMs craft contextualized EPIs tailored to the specific context of the current task. Finally, these context-specific EPIs are incorporated into the prompt used for SQL generation. EPI-SQL is distinguished in that it provides task-specific guidance, enabling the model to circumvent potential errors for the task at hand. Notably, the methodology rivals the performance of advanced few-shot methods despite being a zero-shot approach. An empirical assessment using the Spider benchmark reveals that EPI-SQL achieves an execution accuracy of 85.1\%, underscoring its effectiveness in generating accurate SQL queries through LLMs. The findings indicate a promising direction for future research, i.e. enhancing instructions with task-specific and contextualized rules, for boosting LLMs' performance in NLP tasks.
将自然语言查询转换为 SQL 查询,称为文本到 SQL,是一个关键但具有挑战性的任务。本文介绍了一种利用大型语言模型(LLMs)增强文本到 SQL 任务性能的新型方法论框架 EPI-SQL。EPI-SQL 通过四个步骤进行操作。首先,该方法涉及从 Spider 数据集中收集容易失败的实例。然后,这些实例用于生成通用错误预防说明(EPIs)。接下来,LLMs 针对当前任务的特定上下文创建了上下文特定的 EPIs。最后,这些上下文特定的 EPIs 被纳入到 SQL 生成的提示中。EPI-SQL 的特点在于,它提供了针对任务的特定指导,使模型能够绕过当前任务上的潜在错误。值得注意的是,尽管该方法是零 shot方法,但其表现与高级少样本方法相当。使用 Spider 基准测试进行实证评估发现,EPI-SQL 的执行准确率为 85.1%,强调了通过 LLMs 生成准确 SQL 查询的有效性。这些发现为未来研究提供了有前景的方向,即通过任务特定和上下文相关的规则增强 LLMs 在自然语言处理任务中的性能。
https://arxiv.org/abs/2404.14453
This document outlines the Text-dependent Speaker Verification (TdSV) Challenge 2024, which centers on analyzing and exploring novel approaches for text-dependent speaker verification. The primary goal of this challenge is to motive participants to develop single yet competitive systems, conduct thorough analyses, and explore innovative concepts such as multi-task learning, self-supervised learning, few-shot learning, and others, for text-dependent speaker verification.
本文概述了2024年文本相关说话人验证(TdSV)挑战,重点关注分析并探索文本相关说话人验证的新方法。挑战的主要目标是激发参与者开发具有竞争力的单个系统,进行深入分析,并探索诸如多任务学习、自监督学习、少样本学习等创新概念,用于文本相关说话人验证。
https://arxiv.org/abs/2404.13428
Evaluating the in-context learning classification performance of language models poses challenges due to small dataset sizes, extensive prompt-selection using the validation set, and intentionally difficult tasks that lead to near-random performance. The standard random baseline -- the expected accuracy of guessing labels uniformly at random -- is stable when the evaluation set is used only once or when the dataset is large. We account for the common practice of validation set reuse and existing small datasets with a stronger random baseline: the expected maximum accuracy across multiple random classifiers. When choosing the best prompt demonstrations across six quantized language models applied to 16 BIG-bench Lite tasks, more than 20\% of the few-shot results that exceed the standard baseline do not exceed this stronger random baseline. When held-out test sets are available, this stronger baseline is also a better predictor of held-out performance than the standard baseline, avoiding unnecessary test set evaluations. This maximum random baseline provides an easily calculated drop-in replacement for the standard baseline.
评估语言模型的上下文学习分类性能存在挑战,由于数据集大小较小,使用了验证集进行广泛的提示选择,并且存在故意难任务,导致近随机的性能。标准的随机 baseline——猜标签的概率均匀分布的预期准确度——在仅使用一次验证集或数据集较大时是稳定的。我们考虑了验证集复用和现有小数据集的更强的随机 baseline:多个随机分类器的预期最大准确度。当选择应用于16个BIG-bench Lite任务的六个量化语言模型的最佳提示时,超过20%的超过标准基线的少样本结果没有超过这个更强的随机 baseline。当有保留测试集可用时,这个更强的基准也比标准基线更好预测保留测试的性能,避免不必要的测试集评估。这个最大随机基准提供了一个易于计算的降级替代标准基准。
https://arxiv.org/abs/2404.13020
While Large Language Models (LLMs) exhibit remarkable capabilities in zero-shot and few-shot scenarios, they often require computationally prohibitive sizes. Conversely, smaller Masked Language Models (MLMs) like BERT and RoBERTa achieve state-of-the-art results through fine-tuning but struggle with extending to few-shot and zero-shot settings due to their architectural constraints. Hence, we propose Statement-Tuning, a technique that models discriminative tasks as a set of finite statements and trains an Encoder model to discriminate between the potential statements to determine the label. We do Statement-Tuning on multiple tasks to enable cross-task generalization. Experimental results demonstrate that Statement Tuning achieves competitive performance compared to state-of-the-art LLMs with significantly fewer parameters. Moreover, the study investigates the impact of several design choices on few-shot and zero-shot generalization, revealing that Statement Tuning can achieve sufficient performance with modest training data and benefits from task and statement diversity for unseen task generalizability.
虽然大型语言模型(LLMs)在零 shot和少 shot 场景中表现出非凡的能力,但它们通常需要计算上可负担的大小。相反,像 BERT 和 RoBERTa 这样的小掩膜语言模型(MLMs)通过微调实现最佳结果,但它们由于架构限制,难以扩展到少 shot 和零 shot 设置。因此,我们提出了语句调整技术,这是一种将区分任务作为一个集合有限陈述并训练编码器模型来区分潜在陈述以确定标签的技术。我们在多个任务上进行语句调整,以实现跨任务泛化。实验结果表明,语句调整技术与最先进的LLM具有相当少的参数,但具有竞争力的性能。此外,该研究调查了几个设计选择对少 shot 和零 shot 泛化的影响,揭示了语句调整技术可以通过较少的训练数据实现足够的性能,并从未见过的任务的可塑性中获得优势。
https://arxiv.org/abs/2404.12897
Land-cover mapping is one of the vital applications in Earth observation, aiming at classifying each pixel's land-cover type of remote-sensing images. As natural and human activities change the landscape, the land-cover map needs to be rapidly updated. However, discovering newly appeared land-cover types in existing classification systems is still a non-trivial task hindered by various scales of complex land objects and insufficient labeled data over a wide-span geographic area. In this paper, we propose a generalized few-shot segmentation-based framework, named SegLand, to update novel classes in high-resolution land-cover mapping. Specifically, the proposed framework is designed in three parts: (a) Data pre-processing: the base training set and the few-shot support sets of novel classes are analyzed and augmented; (b) Hybrid segmentation structure; Multiple base learners and a modified Projection onto Orthogonal Prototypes (POP) network are combined to enhance the base-class recognition and to dig novel classes from insufficient labels data; (c) Ultimate fusion: the semantic segmentation results of the base learners and POP network are reasonably fused. The proposed framework has won first place in the leaderboard of the OpenEarthMap Land Cover Mapping Few-Shot Challenge. Experiments demonstrate the superiority of the framework for automatically updating novel land-cover classes with limited labeled data.
土地覆盖图是地球观测中一个至关重要的应用,旨在对每个像素的遥感图像中的土地覆盖类型进行分类。由于自然和人类活动会改变地貌,土地覆盖图需要快速更新。然而,在现有分类系统中发现新出现的土地覆盖类型仍然是一个困难的任务,受到各种规模复杂陆地对象和缺乏在整个地理区域内足够的标记数据的影响。在本文中,我们提出了一个基于通用的少样本分割框架,名为SegLand,用于更新高分辨率土地覆盖图中的新类。具体来说,所提出的框架分为三个部分:(a)数据预处理:对基训练集和新类支持集的数据进行分析和增强;(b)混合分割结构;将多个基学习器和一种经过修改的投影 onto 异面原型(POP)网络相结合,以增强基类识别并从不足的标记数据中挖掘新类;(c)终极融合:将基学习器和POP网络的语义分割结果进行合理的融合。所提出的框架在OpenEarthMap Land Cover Mapping Few-Shot Challenge的排行榜上获得了第一。实验结果表明,该框架可以自动用有限标记数据更新新土地覆盖类的实例。
https://arxiv.org/abs/2404.12721
The ability of a robot to pick an object, known as robot grasping, is crucial for several applications, such as assembly or sorting. In such tasks, selecting the right target to pick is as essential as inferring a correct configuration of the gripper. A common solution to this problem relies on semantic segmentation models, which often show poor generalization to unseen objects and require considerable time and massive data to be trained. To reduce the need for large datasets, some grasping pipelines exploit few-shot semantic segmentation models, which are capable of recognizing new classes given a few examples. However, this often comes at the cost of limited performance and fine-tuning is required to be effective in robot grasping scenarios. In this work, we propose to overcome all these limitations by combining the impressive generalization capability reached by foundation models with a high-performing few-shot classifier, working as a score function to select the segmentation that is closer to the support set. The proposed model is designed to be embedded in a grasp synthesis pipeline. The extensive experiments using one or five examples show that our novel approach overcomes existing performance limitations, improving the state of the art both in few-shot semantic segmentation on the Graspnet-1B (+10.5% mIoU) and Ocid-grasp (+1.6% AP) datasets, and real-world few-shot grasp synthesis (+21.7% grasp accuracy). The project page is available at: this https URL
机器人抓取(Robot Grasping)的能力对于许多应用场景至关重要,如组装或分类。在这些任务中,选择正确的目标物体如同推断正确的爪子配置一样重要。解决这个问题的一种常见方法是基于语义分割模型的,这些模型通常对未见过的物体表现不佳,需要大量的时间和数据来进行训练。为了减少需要的大型数据集,一些抓取管道利用少样本语义分割模型,这些模型能够根据几个示例识别出新类别。然而,这往往需要在机器人抓取场景中进行细调才能产生有效的效果。在这项工作中,我们通过将基础模型令人印象深刻的泛化能力与高性能的少样本分类器相结合,实现了克服各种限制的目标,作为分数函数选择距离支持集更近的分割。所提出的模型旨在嵌入抓取合成管道中。使用一个或五个示例的广泛实验表明,我们新颖的方法超出了现有性能限制,提高了Grabnet-1B (+10.5% mIoU)和Ocid-grasp (+1.6% AP)数据集中的少样本语义分割的领先状态,以及现实生活中的少样本抓取合成 (+21.7% grasp accuracy)。项目页面可以在以下这个链接中访问:https://this URL。
https://arxiv.org/abs/2404.12717
We present FastFit, a method, and a Python package design to provide fast and accurate few-shot classification, especially for scenarios with many semantically similar classes. FastFit utilizes a novel approach integrating batch contrastive learning and token-level similarity score. Compared to existing few-shot learning packages, such as SetFit, Transformers, or few-shot prompting of large language models via API calls, FastFit significantly improves multiclass classification performance in speed and accuracy across FewMany, our newly curated English benchmark, and Multilingual datasets. FastFit demonstrates a 3-20x improvement in training speed, completing training in just a few seconds. The FastFit package is now available on GitHub and PyPi, presenting a user-friendly solution for NLP practitioners.
我们提出了FastFit方法和一个Python软件包设计,旨在提供快速和准确的零散shot分类,尤其是在具有许多相似语义类别的场景中。FastFit采用了一种新颖的方法,将批式对比学习与词级相似度分数相结合。与现有的零散shot学习软件包(如SetFit、Transformers或通过API调用的大语言模型的一小 shots提示)相比,FastFit在速度和准确性上显著提高了多分类分类性能。FastFit在FewMany、我们的新编英语基准和多语言数据集上的表现表明,其训练速度提高了3-20倍,训练时间仅需几秒钟。FastFit软件包现在可以在GitHub和PyPI上获得,为NLP从业者提供了一个易于使用的解决方案。
https://arxiv.org/abs/2404.12365
The increasing threat of disinformation calls for automating parts of the fact-checking pipeline. Identifying text segments requiring fact-checking is known as claim detection (CD) and claim check-worthiness detection (CW), the latter incorporating complex domain-specific criteria of worthiness and often framed as a ranking task. Zero- and few-shot LLM prompting is an attractive option for both tasks, as it bypasses the need for labeled datasets and allows verbalized claim and worthiness criteria to be directly used for prompting. We evaluate the LLMs' predictive and calibration accuracy on five CD/CW datasets from diverse domains, each utilizing a different worthiness criterion. We investigate two key aspects: (1) how best to distill factuality and worthiness criteria into a prompt and (2) what amount of context to provide for each claim. To this end, we experiment with varying the level of prompt verbosity and the amount of contextual information provided to the model. Our results show that optimal prompt verbosity is domain-dependent, adding context does not improve performance, and confidence scores can be directly used to produce reliable check-worthiness rankings.
随着虚假信息的威胁越来越大,需要自动化事实核查管道中的某些部分。识别需要核实的事实文本段落称为断言检测(CD),后者包括复杂的领域特定价值标准,通常被框架为一个排名任务。零和少样本LLM提示对于两种任务来说都是具有吸引力的选择,因为它绕过了需要标记数据集的需求,并允许直接使用口头断言和价值标准进行提示。我们评估了五种不同领域的CD/CW数据集上LLM的预测和校准准确性,每种数据集都使用不同的价值标准。我们研究了两个关键方面:(1)如何将事实性和价值标准精炼成提示;(2)为每个断言提供多少上下文。为此,我们尝试改变提示的措辞和提供给模型的上下文信息水平。我们的结果表明,最优的提示措辞是受领域影响的,增加上下文并不能提高性能,而自信分数可以直接用于产生可靠的检查价值排名。
https://arxiv.org/abs/2404.12174