The robustness of Transformer-based Natural Language Inference encoders is frequently compromised as they tend to rely more on dataset biases than on the intended task-relevant features. Recent studies have attempted to mitigate this by reducing the weight of biased samples during the training process. However, these debiasing methods primarily focus on identifying which samples are biased without explicitly determining the biased components within each case. This limitation restricts those methods' capability in out-of-distribution inference. To address this issue, we aim to train models to adopt the logic humans use in explaining causality. We propose a simple, comprehensive, and interpretable method: Explanation based Bias Decoupling Regularization (EBD-Reg). EBD-Reg employs human explanations as criteria, guiding the encoder to establish a tripartite parallel supervision of Distinguishing, Decoupling and Aligning. This method enables encoders to identify and focus on keywords that represent the task-relevant features during inference, while discarding the residual elements acting as biases. Empirical evidence underscores that EBD-Reg effectively guides various Transformer-based encoders to decouple biases through a human-centric lens, significantly surpassing other methods in terms of out-of-distribution inference capabilities.
基于Transformer的自然语言推理模型的稳健性经常受到数据集偏差的影响,而不是依赖于预期的任务相关特征。为了解决这个问题,最近的研究尝试通过在训练过程中减少偏差样本的权重来减轻这种偏差。然而,这些去偏方法主要关注于确定哪些样本存在偏差,而没有明确确定每个案例中的偏差组件。这个限制限制了这些方法在离散分布推理方面的能力。为了应对这个问题,我们旨在训练模型以模仿人类解释因果关系的逻辑。我们提出了一种简单、全面、可解释的方法:解释偏差去耦正则化(EBD-Reg)。EBD-Reg利用人类解释作为标准,指导编码器建立区分、去耦和align的三角关系。这种方法使编码器能够在推理过程中识别和关注代表任务相关特征的关键字,同时丢弃作为偏见的残余元素。实证证据表明,EBD-Reg有效地通过人本主义视角引导各种Transformer编码器通过去偏来解耦偏差,在离散分布推理能力上显著超越其他方法。
https://arxiv.org/abs/2404.13390
Data augmentation is widely used to mitigate data bias in the training dataset. However, data augmentation exposes machine learning models to privacy attacks, such as membership inference attacks. In this paper, we propose an effective combination of data augmentation and machine unlearning, which can reduce data bias while providing a provable defense against known attacks. Specifically, we maintain the fairness of the trained model with diffusion-based data augmentation, and then utilize multi-shard unlearning to remove identifying information of original data from the ML model for protection against privacy attacks. Experimental evaluation across diverse datasets demonstrates that our approach can achieve significant improvements in bias reduction as well as robustness against state-of-the-art privacy attacks.
数据增强在训练数据中广泛应用,以减轻数据偏差。然而,数据增强会暴露机器学习模型到诸如成员推断攻击等隐私攻击。在本文中,我们提出了一种有效的数据增强和机器学习相结合的方法,可以在减轻数据偏差的同时为已知攻击提供有理的防御。具体来说,我们在基于扩散的数据增强上保留训练模型的公平性,然后利用多片分箱学习消除原始数据的识别信息,以保护隐私攻击。在多样数据集的实验评估中,我们的方法在减轻数据偏差和应对最先进的隐私攻击方面取得了显著的改进。
https://arxiv.org/abs/2404.13194
Transforming two-dimensional (2D) images into three-dimensional (3D) volumes is a well-known yet challenging problem for the computer vision community. In the medical domain, a few previous studies attempted to convert two or more input radiographs into computed tomography (CT) volumes. Following their effort, we introduce a diffusion model-based technology that can rotate the anatomical content of any input radiograph in 3D space, potentially enabling the visualization of the entire anatomical content of the radiograph from any viewpoint in 3D. Similar to previous studies, we used CT volumes to create Digitally Reconstructed Radiographs (DRRs) as the training data for our model. However, we addressed two significant limitations encountered in previous studies: 1. We utilized conditional diffusion models with classifier-free guidance instead of Generative Adversarial Networks (GANs) to achieve higher mode coverage and improved output image quality, with the only trade-off being slower inference time, which is often less critical in medical applications; and 2. We demonstrated that the unreliable output of style transfer deep learning (DL) models, such as Cycle-GAN, to transfer the style of actual radiographs to DRRs could be replaced with a simple yet effective training transformation that randomly changes the pixel intensity histograms of the input and ground-truth imaging data during training. This transformation makes the diffusion model agnostic to any distribution variations of the input data pixel intensity, enabling the reliable training of a DL model on input DRRs and applying the exact same model to conventional radiographs (or DRRs) during inference.
将二维(2D)图像转换为三维(3D)体积在计算机视觉领域是一个众所周知但具有挑战性的问题。在医学领域,之前的一些研究表明,将两张或多张输入X光片转换为计算机断层扫描(CT)体积是可能的。他们的努力之后,我们引入了一种基于扩散模型的技术,该技术可以旋转任何输入X光片在3D空间中的解剖内容,从而有可能从任何角度观察到整个X光片的解剖内容。与之前的研究类似,我们使用CT体积来创建数字重建X光片(DRRs)作为模型的训练数据。然而,我们在之前的研究中遇到了两个显著的局限性:1.我们使用条件扩散模型(无分类指导)而不是生成对抗网络(GANs)来实现更高的模态覆盖和改善的输出图像质量,唯一的代价是推理时间更快,这在医疗应用中并不关键;2.我们证明了将深度学习模型(如循环神经网络)的风格迁移到DRRs的不确定输出可以被简单而有效的训练转换所取代,该转换在训练过程中随机改变输入和真实成像数据的像素强度直方图。这种转换使扩散模型对输入数据的像素强度分布变化具有免疫力,从而能够可靠地对DL模型在DRRs上的训练以及对常规X光片(或DRRs)的应用进行相同的模型。
https://arxiv.org/abs/2404.13000
Session-based recommendation aims to predict intents of anonymous users based on their limited behaviors. Modeling user behaviors involves two distinct rationales: co-occurrence patterns reflected by item IDs, and fine-grained preferences represented by item modalities (e.g., text and images). However, existing methods typically entangle these causes, leading to their failure in achieving accurate and explainable recommendations. To this end, we propose a novel framework DIMO to disentangle the effects of ID and modality in the task. At the item level, we introduce a co-occurrence representation schema to explicitly incorporate cooccurrence patterns into ID representations. Simultaneously, DIMO aligns different modalities into a unified semantic space to represent them uniformly. At the session level, we present a multi-view self-supervised disentanglement, including proxy mechanism and counterfactual inference, to disentangle ID and modality effects without supervised signals. Leveraging these disentangled causes, DIMO provides recommendations via causal inference and further creates two templates for generating explanations. Extensive experiments on multiple real-world datasets demonstrate the consistent superiority of DIMO over existing methods. Further analysis also confirms DIMO's effectiveness in generating explanations.
会话式推荐旨在根据匿名用户的有限行为预测其意图。建模用户行为涉及两个截然不同的原因:项目编号(ID)所反映的联合模式,以及项目维度(如文本和图像)所代表的细粒度偏好。然而,现有方法通常将这些原因纠缠在一起,导致其无法实现准确和可解释的推荐。因此,我们提出了一个名为DIMO的新框架,以分离ID和维度在任务中的影响。在项目级别,我们引入了一个共现表示方案,明确地将联合模式融入ID表示中。同时,DIMO将不同的维度统一到一个语义空间中,以表示它们的一致性。在会话级别,我们提出了一个多视角自监督解开,包括代理机制和反事实推理,以在没有监督信号的情况下解开ID和维度的影响。利用这些解开的原因,DIMO通过因果推理提供建议,并进一步为生成解释创建了两个模板。在多个现实世界数据集上的大量实验证明,DIMO相对于现有方法具有始终优越的性能。进一步的分析还证实了DIMO在生成解释方面的有效性。
https://arxiv.org/abs/2404.12969
Conventional diffusion models typically relies on a fixed forward process, which implicitly defines complex marginal distributions over latent variables. This can often complicate the reverse process' task in learning generative trajectories, and results in costly inference for diffusion models. To address these limitations, we introduce Neural Flow Diffusion Models (NFDM), a novel framework that enhances diffusion models by supporting a broader range of forward processes beyond the fixed linear Gaussian. We also propose a novel parameterization technique for learning the forward process. Our framework provides an end-to-end, simulation-free optimization objective, effectively minimizing a variational upper bound on the negative log-likelihood. Experimental results demonstrate NFDM's strong performance, evidenced by state-of-the-art likelihood estimation. Furthermore, we investigate NFDM's capacity for learning generative dynamics with specific characteristics, such as deterministic straight lines trajectories. This exploration underscores NFDM's versatility and its potential for a wide range of applications.
传统的扩散模型通常依赖于一个固定的前向过程,这暗示着在潜在变量上定义了复杂边缘分布。这通常会使得反向过程在学习生成轨迹的任务中变得复杂,并导致对于扩散模型的代价性推断。为了克服这些限制,我们引入了神经流扩散模型(NFDM),一种通过支持更广泛的正向过程来增强扩散模型的全新框架。我们还提出了一种新的参数化技术来学习前向过程。我们的框架提供了一个端到端的、无实验的优化目标,有效地最小化了负对数似然的上界。实验结果证明了NFDM的强大性能,这得益于最先进的 likelihood 估计。此外,我们研究了NFDM在学习具有特定特征的生成动态方面的能力,例如确定性直线轨迹。这种探索突显了NFDM的多样性和其在各种应用中的潜在能力。
https://arxiv.org/abs/2404.12940
While Retrieval-Augmented Generation (RAG) plays a crucial role in the application of Large Language Models (LLMs), existing retrieval methods in knowledge-dense domains like law and medicine still suffer from a lack of multi-perspective views, which are essential for improving interpretability and reliability. Previous research on multi-view retrieval often focused solely on different semantic forms of queries, neglecting the expression of specific domain knowledge perspectives. This paper introduces a novel multi-view RAG framework, MVRAG, tailored for knowledge-dense domains that utilizes intention-aware query rewriting from multiple domain viewpoints to enhance retrieval precision, thereby improving the effectiveness of the final inference. Experiments conducted on legal and medical case retrieval demonstrate significant improvements in recall and precision rates with our framework. Our multi-perspective retrieval approach unleashes the potential of multi-view information enhancing RAG tasks, accelerating the further application of LLMs in knowledge-intensive fields.
虽然Retrieval-Augmented Generation(RAG)在大型语言模型的应用中扮演着关键角色,但知识密集领域(如法律和医学)中的现有检索方法仍然存在多视角视图的缺乏,这些多视角视图对提高可解释性和可靠性至关重要。先前关于多视角检索的研究通常仅关注不同语义形式的查询,而忽视了特定领域知识视角的表达。本文介绍了一种名为MVRAG的新多视角检索框架,专门针对知识密集领域,利用多个领域视角的意图感知查询重写来提高检索精度,从而提高最终推理的有效性。在法律和医学案例检索实验中,我们的框架显示出显著的召回和精确率提高。我们的多视角检索方法释放了多视角信息增强RAG任务的潜力,加速了LLM在知识密集领域进一步应用。
https://arxiv.org/abs/2404.12879
Query rewrite, which aims to generate more efficient queries by altering a SQL query's structure without changing the query result, has been an important research problem. In order to maintain equivalence between the rewritten query and the original one during rewriting, traditional query rewrite methods always rewrite the queries following certain rewrite rules. However, some problems still remain. Firstly, existing methods of finding the optimal choice or sequence of rewrite rules are still limited and the process always costs a lot of resources. Methods involving discovering new rewrite rules typically require complicated proofs of structural logic or extensive user interactions. Secondly, current query rewrite methods usually rely highly on DBMS cost estimators which are often not accurate. In this paper, we address these problems by proposing a novel method of query rewrite named LLM-R2, adopting a large language model (LLM) to propose possible rewrite rules for a database rewrite system. To further improve the inference ability of LLM in recommending rewrite rules, we train a contrastive model by curriculum to learn query representations and select effective query demonstrations for the LLM. Experimental results have shown that our method can significantly improve the query execution efficiency and outperform the baseline methods. In addition, our method enjoys high robustness across different datasets.
翻译:查询重写是一种通过修改 SQL 查询的结构而不改变查询结果来生成更有效的查询的方法,一直是一个重要的研究问题。为了在重写过程中保持等价关系,传统的查询重写方法总是根据某些重写规则重写查询。然而,一些问题仍然存在。首先,现有的寻找最优选择或重写规则的方法仍然有限,并且重写过程始终需要大量的资源。涉及发现新重写规则的方法通常需要复杂的证明结构逻辑或广泛的用户交互。其次,当前的查询重写方法通常高度依赖于数据库管理系统成本估算器,这些估算器通常不准确。在本文中,我们通过提出一种名为 LLM-R2 的查询重写新方法来解决这些问题。我们使用一个大型语言模型(LLM)来提出数据库重写系统可能的重写规则。为了进一步提高 LLM 在推荐重写规则方面的推理能力,我们通过课程学习曲线训练对比模型,学习查询表示,并为 LLM 选择有效的查询演示。实验结果表明,我们的方法可以显著提高查询执行效率,并优于基线方法。此外,我们的方法在不同数据集上的鲁棒性很高。
https://arxiv.org/abs/2404.12872
Extracting structured information from unstructured text is critical for many downstream NLP applications and is traditionally achieved by closed information extraction (cIE). However, existing approaches for cIE suffer from two limitations: (i) they are often pipelines which makes them prone to error propagation, and/or (ii) they are restricted to sentence level which prevents them from capturing long-range dependencies and results in expensive inference time. We address these limitations by proposing REXEL, a highly efficient and accurate model for the joint task of document level cIE (DocIE). REXEL performs mention detection, entity typing, entity disambiguation, coreference resolution and document-level relation classification in a single forward pass to yield facts fully linked to a reference knowledge graph. It is on average 11 times faster than competitive existing approaches in a similar setting and performs competitively both when optimised for any of the individual subtasks and a variety of combinations of different joint tasks, surpassing the baselines by an average of more than 6 F1 points. The combination of speed and accuracy makes REXEL an accurate cost-efficient system for extracting structured information at web-scale. We also release an extension of the DocRED dataset to enable benchmarking of future work on DocIE, which is available at this https URL.
从无结构文本中提取结构化信息对于许多下游自然语言处理(NLP)应用至关重要,而且通常通过关闭信息提取(cIE)来实现。然而,现有的cIE方法存在两个局限:(i)它们通常是流水线,容易传播错误,(ii)它们仅限于句子级别,无法捕捉长距离依赖关系,导致推理时间昂贵。为了克服这些局限,我们提出了REXEL,一种高效且准确的文档级别cIE(DocIE)模型。REXEL在单向传递过程中实现提举检测、实体类型、实体歧义、关系分类和文档级别关系,以产生完全链接到参考知识图谱的事实。在类似设置中,REXEL的平均速度是现有方法的11倍,而且在优化任何单个子任务或各种组合任务时,表现出色,超过了基线平均6个F1分。速度和准确性的结合使REXEL成为在网页规模上提取结构化信息的准确且高效系统。我们还发布了DocRED数据集的扩展,以便于未来在DocIE上进行基准测试,该扩展可通过此链接获得。
https://arxiv.org/abs/2404.12788
Vision-based ego-lane inference using High-Definition (HD) maps is essential in autonomous driving and advanced driver assistance systems. The traditional approach necessitates well-calibrated cameras, which confines variation of camera configuration, as the algorithm relies on intrinsic and extrinsic calibration. In this paper, we propose a learning-based ego-lane inference by directly estimating the ego-lane index from a single image. To enhance robust performance, our model incorporates the two-head structure inferring ego-lane in two perspectives simultaneously. Furthermore, we utilize an attention mechanism guided by vanishing point-and-line to adapt to changes in viewpoint without requiring accurate calibration. The high adaptability of our model was validated in diverse environments, devices, and camera mounting points and orientations.
基于视觉的自我车道推断在自动驾驶和高级驾驶辅助系统中至关重要。传统的解决方案需要经过良好校准的摄像头,但这会限制摄像头配置的变化,因为算法依赖于内生和外生标定。在本文中,我们提出了一种基于学习的自我车道推断方法,通过直接从单张图片中估计自车道的索引。为了提高稳健的性能,我们的模型在两个视角上同时引入了自车道的结构推断。此外,我们还使用基于消失点-线结构的注意力机制来适应视点的变化,而无需进行精确的标定。我们模型的适应性在不同的环境、设备和摄像头安装点得到了验证。
https://arxiv.org/abs/2404.12770
With the continuous development of OCR technology and the expansion of application fields, text recognition in complex scenes has become a key challenge. Factors such as multiple fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models. Although OCR models based on deep learning have performed well in specific fields or similar data sets in recent years, the generalization ability and robustness of the model are still a big challenge when facing complex environments with multiple scenes. Furthermore, training an OCR model from scratch or fine-tuning all parameters is very demanding on computing resources and inference time, which limits the flexibility of its application. This study focuses on a fundamental aspect of mixed text recognition in response to the challenges mentioned above, which involves effectively fine-tuning the pre-trained basic OCR model to demonstrate exceptional performance across various downstream tasks. To this end, we propose a parameter-efficient hybrid text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR. This method embeds DoRA into the image encoder and LoRA into the internal structure of the text decoder, enabling efficient parameter fine-tuning for downstream tasks. Experimental results show that compared to similar parameter adjustment methods, our model DLoRA-TrOCR has the smallest number of parameters and performs better. It can achieve state-of-the-art performance on complex scene data sets involving simultaneous recognition of mixed handwritten, printed and street view texts.
随着OCR技术的持续发展和应用领域的扩展,复杂场景中的文本识别已成为一个关键挑战。诸如多种字体、混合场景和复杂布局等因素,都严重影响了传统OCR模型的识别准确性。虽然基于深度学习的OCR模型在某些领域或类似数据集上的表现已经很好,但在面对复杂环境(多场景)时,模型的泛化能力和鲁棒性仍然是一个巨大的挑战。此外,从零开始训练OCR模型或微调所有参数在计算资源和推理时间上非常耗时,这限制了其应用的灵活性。本文关注于应对上述挑战的基本文本识别方面,这涉及通过预训练的基本OCR模型有效地微调以在各种下游任务上展示卓越性能。为此,我们提出了一个参数高效的混合文本识别方法,基于预训练的OCR Transformer,即DLoRA-TrOCR。该方法将DoRA嵌入到图像编码器中,将LoRA嵌入到文本解码器的内部结构中,以实现对下游任务的参数高效微调。实验结果表明,与类似的参数调整方法相比,我们的模型DLoRA-TrOCR具有最小的参数数量并表现出更好的性能。它可以在涉及同时识别混合手写、打印和街头街景文本的复杂场景数据集上实现最先进的性能。
https://arxiv.org/abs/2404.12734
Analogical reasoning is a unique ability of humans to address unfamiliar challenges by transferring strategies from relevant past experiences. One key finding in psychology is that compared with irrelevant past experiences, recalling relevant ones can help humans better handle new tasks. Coincidentally, the NLP community has also recently found that self-generating relevant examples in the context can help large language models (LLMs) better solve a given problem than hand-crafted prompts. However, it is yet not clear whether relevance is the key factor eliciting such capability, i.e., can LLMs benefit more from self-generated relevant examples than irrelevant ones? In this work, we systematically explore whether LLMs can truly perform analogical reasoning on a diverse set of reasoning tasks. With extensive experiments and analysis, we show that self-generated random examples can surprisingly achieve comparable or even better performance, e.g., 4% performance boost on GSM8K with random biological examples. We find that the accuracy of self-generated examples is the key factor and subsequently design two improved methods with significantly reduced inference costs. Overall, we aim to advance a deeper understanding of LLM analogical reasoning and hope this work stimulates further research in the design of self-generated contexts.
类比推理是人类独特的能力,通过将相关过去的经验策略应用于未知的挑战来解决问题。心理学的一个关键发现是,与无关的过去经验相比,回忆相关的经验可以帮助人类更好地处理新的任务。此外,自然语言处理(NLP)社区最近也发现,在特定背景下自生成相关的示例可以帮助大型语言模型(LLMs)比自手编写的提示更好地解决某个问题。然而,目前尚不清楚是否相关性是导致这种能力的关键因素,即LLM是否比无关的示例更有利?在这项工作中,我们系统地探讨LLMs是否可以在多样化的推理任务中进行类比推理。通过广泛的实验和分析,我们发现自生成随机示例可以意外地实现与自手编写的示例相当甚至更好的性能,例如在GSM8K上的性能提升达到4%。我们发现自生成示例的准确性是关键因素,因此我们设计了两项改进方法,降低了推理成本。总体而言,我们希望深入研究LLM的类比推理,并希望这项工作能激发进一步研究在自生成上下文的设计上。
https://arxiv.org/abs/2404.12728
Existing methods for uncertainty quantification incur massive memory and compute overhead, often requiring multiple models/inferences. Hence they are impractical on ultra-low-power KB-sized TinyML devices. To reduce overhead, prior works have proposed the use of early-exit networks as ensembles to quantify uncertainty in a single forward-pass. However, they still have a prohibitive cost for tinyML. To address these challenges, we propose QUTE, a novel resource-efficient early-exit-assisted ensemble architecture optimized for tinyML models. QUTE adds additional output blocks at the final exit of the base network and distills the knowledge of early-exits into these blocks to create a diverse and lightweight ensemble architecture. Our results show that QUTE outperforms popular prior works, and improves the quality of uncertainty estimates by 6% with 3.1x lower model size on average compared to the most relevant prior work. Furthermore, we demonstrate that QUTE is also effective in detecting co-variate shifted and out-of-distribution inputs, and shows competitive performance relative to G-ODIN, a state-of-the-art generalized OOD detector.
现有的不确定性量化方法存在巨大的内存和计算开销,通常需要多个模型/推理。因此,在超低功耗的TinyML设备上,它们通常是不可行的。为了减少开销,之前的工作提出了使用早期退出网络作为单个前向传递的不确定度量。然而,对于TinyML模型,它们的成本仍然很高。为了应对这些挑战,我们提出了QUTE,一种专为TinyML模型设计的资源效率早退辅助聚类架构。QUTE在基本网络的最后一个退出处增加了额外的输出块,并将早退的知识提炼到这些块中,以创建一个多样且轻量级的聚类架构。我们的结果表明,QUTE在性能上优于流行的先前工作,并且与最相关的先前工作相比,将不确定性估计的质量提高了6%。此外,我们还证明了QUTE在检测共同转移和离散输入方面同样有效,并且相对于最先进的G-ODIN,其性能具有竞争力。
https://arxiv.org/abs/2404.12599
Over the past year, the field of Natural Language Generation (NLG) has experienced an exponential surge, largely due to the introduction of Large Language Models (LLMs). These models have exhibited the most effective performance in a range of domains within the Natural Language Processing and Generation domains. However, their application in domain-specific tasks, such as paraphrasing, presents significant challenges. The extensive number of parameters makes them difficult to operate on commercial hardware, and they require substantial time for inference, leading to high costs in a production setting. In this study, we tackle these obstacles by employing LLMs to develop three distinct models for the paraphrasing field, applying a method referred to as sequence-level knowledge distillation. These distilled models are capable of maintaining the quality of paraphrases generated by the LLM. They demonstrate faster inference times and the ability to generate diverse paraphrases of comparable quality. A notable characteristic of these models is their ability to exhibit syntactic diversity while also preserving lexical diversity, features previously uncommon due to existing data quality issues in datasets and not typically observed in neural-based approaches. Human evaluation of our models shows that there is only a 4% drop in performance compared to the LLM teacher model used in the distillation process, despite being 1000 times smaller. This research provides a significant contribution to the NLG field, offering a more efficient and cost-effective solution for paraphrasing tasks.
在过去的一年里,自然语言生成(NLG)领域经历了一次指数级的增长,主要得益于大型语言模型(LLMs)的引入。这些模型在自然语言处理和生成领域内的各种领域都表现出最有效的性能。然而,在领域特定的任务中,如paraphrasing,它们的应用带来了显著的挑战。由于它们具有大量的参数,使得它们在商业硬件上很难操作,并且需要大量的时间进行推理,导致在生产环境中成本高昂。 在这项研究中,我们通过使用LLMs来开发三个不同的模型来解决这些障碍,应用了一种称为序列级知识蒸馏的方法。这些蒸馏模型能够保持由LLM生成的paraphrases的质量。它们还显示了更快的推理速度以及生成具有相似质量的多样paraphrases的能力。这些模型的一个显著的特点是,在展示语义多样性的同时,也保留了词汇多样性,这是由于现有数据质量问题在数据集中很少观察到的特征,并且通常不会在基于神经的方法中观察到。 通过对我们的模型进行人机评估,结果显示与LLM教师模型在蒸馏过程中使用的模型相比,性能只有降低了4%。尽管这个模型是LLM的1000倍 smaller,但这项研究为自然语言生成领域提供了重要的贡献,为paraphrasing任务提供了更高效、更经济的方法。
https://arxiv.org/abs/2404.12596
Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cross-modal cues for the efficient adaptation of parameters in visual-language models. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions. Additionally, it explores hard samples based on differences in cross-modal affinity and enhances model performance through adaptive adjustment of sample learning intensity. Extensive experimental results on benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly regarding accuracy, generalization, and efficiency.
基于适配器的参数高效迁移学习在视觉语言模型上取得了令人兴奋的结果。传统的适配方法通常需要训练或微调,面临着样本不足或资源受限等挑战。虽然一些方法通过利用图像模态缓存和检索来避免训练需求,但它们忽视了文本模态对视觉语言模型参数高效适应的重要性以及跨模态提示。本文介绍了一种跨模态参数高效的称为XMAdapter的方法。XMAdapter为文本和图像模态建立了缓存模型。然后通过利用视觉语言的异构信息进行检索,收集关于推理的线索。通过动态调整亲和度比,实现跨模态融合,解耦不同模态的相似性以评估它们各自的贡献。此外,它基于跨模态亲和度的差异探索 hard 样本,并通过自适应调整样本学习强度来增强模型性能。在基准数据集上进行广泛的实验结果表明,XMAdapter在准确性、泛化能力和效率方面显著优于基于适配器的先前的方法。
https://arxiv.org/abs/2404.12588
Video editing methods based on diffusion models that rely solely on a text prompt for the edit are hindered by the limited expressive power of text prompts. Thus, incorporating a reference target image as a visual guide becomes desirable for precise control over edit. Also, most existing methods struggle to accurately edit a video when the shape and size of the object in the target image differ from the source object. To address these challenges, we propose "GenVideo" for editing videos leveraging target-image aware T2I models. Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit using our novel target and shape aware InvEdit masks. Further, we propose a novel target-image aware latent noise correction strategy during inference to improve the temporal consistency of the edits. Experimental analyses indicate that GenVideo can effectively handle edits with objects of varying shapes, where existing approaches fail.
基于扩散模型的视频编辑方法,仅依赖文本提示进行编辑,受到文本提示表达能力的限制。因此,将参考目标图像作为视觉指南变得具有吸引力,以便更精确地控制编辑。同时,大多数现有方法在处理目标图像中物体形状和大小与源对象不匹配时,准确性都会受到影响。为解决这些挑战,我们提出了“GenVideo”,一种利用目标图像意识到的T2I模型编辑视频的方法。我们的方法能够处理具有不同形状和大小的目标对象的编辑,同时通过我们新颖的目标和形状意识InvEdit掩码保持编辑的时间一致性。此外,在推理过程中,我们提出了一种新的目标图像意识延迟噪声修复策略,以提高编辑的时间一致性。实验分析结果表明,GenVideo可以有效地处理具有不同形状的物体,而现有方法在此方面都存在局限性。
https://arxiv.org/abs/2404.12541
Hallucination continues to be one of the most critical challenges in the institutional adoption journey of Large Language Models (LLMs). In this context, an overwhelming number of studies have focused on analyzing the post-generation phase - refining outputs via feedback, analyzing logit output values, or deriving clues via the outputs' artifacts. We propose HalluciBot, a model that predicts the probability of hallucination $\textbf{before generation}$, for any query imposed to an LLM. In essence, HalluciBot does not invoke any generation during inference. To derive empirical evidence for HalluciBot, we employ a Multi-Agent Monte Carlo Simulation using a Query Perturbator to craft $n$ variations per query at train time. The construction of our Query Perturbator is motivated by our introduction of a new definition of hallucination - $\textit{truthful hallucination}$. Our training methodology generated 2,219,022 estimates for a training corpus of 369,837 queries, spanning 13 diverse datasets and 3 question-answering scenarios. HalluciBot predicts both binary and multi-class probabilities of hallucination, enabling a means to judge the query's quality with regards to its propensity to hallucinate. Therefore, HalluciBot paves the way to revise or cancel a query before generation and the ensuing computational waste. Moreover, it provides a lucid means to measure user accountability for hallucinatory queries.
幻觉在大型语言模型的机构采用过程中仍然是一个最重要的挑战。在这种背景下,大量研究都集中在了分析后生成阶段 - 通过反馈来优化输出,分析输出对数值,或通过输出的残骸得出提示。我们提出HalluciBot,一种预测在LLM上生成幻觉的概率的模型。本质上,HalluciBot在推理过程中没有发起任何生成。为了获得HalluciBot的实证证据,我们使用查询扰动器进行多智能体蒙特卡洛模拟,在训练时间为您构建了n个查询的变化。我们引入了一个新的定义来描述幻觉 - 真实幻觉。我们的训练方法为369,837个查询的数据集产生了2,219,022个估计,涵盖了13个不同的数据集和3个问题回答场景。HalluciBot预测幻觉的二进制和多分类概率,使得我们可以根据幻觉对查询的质量进行判断。因此,HalluciBot为在生成前修改或取消查询铺平道路,以及随之产生的计算浪费。此外,它还提供了明确衡量用户对幻觉查询的问责的途径。
https://arxiv.org/abs/2404.12535
Theorem proving is an important challenge for large language models (LLMs), as formal proofs can be checked rigorously by proof assistants such as Lean, leaving no room for hallucination. Existing LLM-based provers try to prove theorems in a fully autonomous mode without human intervention. In this mode, they struggle with novel and challenging theorems, for which human insights may be critical. In this paper, we explore LLMs as copilots that assist humans in proving theorems. We introduce Lean Copilot, a framework for running LLM inference in Lean. It enables programmers to build various LLM-based proof automation tools that integrate seamlessly into the workflow of Lean users. Using Lean Copilot, we build tools for suggesting proof steps (tactic suggestion), completing intermediate proof goals (proof search), and selecting relevant premises (premise selection) using LLMs. Users can use our pretrained models or bring their own ones that run either locally (with or without GPUs) or on the cloud. Experimental results demonstrate the effectiveness of our method in assisting humans and automating theorem proving process compared to existing rule-based proof automation in Lean. We open source all codes under a permissive MIT license to facilitate further research.
证明是大型语言模型(LLMs)的一个重要挑战,因为形式化的证明可以通过诸如Lean这样的证明助手进行严格检查,从而排除了幻觉。现有的LLM基证明者试图以完全自主的方式证明定理,而无需人类干预。在这种模式下,他们努力应对新颖且具有挑战性的定理,这些定理可能需要人类洞察力才能理解。在本文中,我们将LLM视为辅助人类证明定理的 copilots。我们引入了Lean Copilot,一个在Lean中运行LLM推理的框架。它使程序员能够构建各种LLM基证明自动化工具,这些工具可以无缝地融入Lean用户的 workflow中。使用Lean Copilot,我们为用户提供了建议证明步骤(策略建议)、完成中间证明目标(证明搜索)和选择相关前提(前提选择)使用LLM的功能。用户可以使用我们的预训练模型,或者根据自己的需要使用本地(使用或不需要GPU)或云计算的现有模型。实验结果表明,与其他基于规则的证明自动化方法相比,我们的方法有效地协助了人类和自动化了定理证明过程。我们开源了所有代码,并使用宽松的MIT许可证促进进一步研究。
https://arxiv.org/abs/2404.12534
Self-supervised monocular depth estimation has garnered considerable attention for its applications in autonomous driving and robotics. While recent methods have made strides in leveraging techniques like the Self Query Layer (SQL) to infer depth from motion, they often overlook the potential of strengthening pose information. In this paper, we introduce SPIdepth, a novel approach that prioritizes enhancing the pose network for improved depth estimation. Building upon the foundation laid by SQL, SPIdepth emphasizes the importance of pose information in capturing fine-grained scene structures. By enhancing the pose network's capabilities, SPIdepth achieves remarkable advancements in scene understanding and depth estimation. Experimental results on benchmark datasets such as KITTI and Cityscapes showcase SPIdepth's state-of-the-art performance, surpassing previous methods by significant margins. Notably, SPIdepth's performance exceeds that of unsupervised models and, after finetuning on metric data, outperforms all existing methods. Remarkably, SPIdepth achieves these results using only a single image for inference, surpassing even methods that utilize video sequences for inference, thus demonstrating its efficacy and efficiency in real-world applications. Our approach represents a significant leap forward in self-supervised monocular depth estimation, underscoring the importance of strengthening pose information for advancing scene understanding in real-world applications.
自监督单目深度估计在自动驾驶和机器人领域引起了相当大的关注。虽然最近的方法已经利用类似Self Query Layer(SQL)的技术来从运动中推断深度,但它们往往忽视了加强姿态信息潜力的重要性。在本文中,我们引入了SPIdepth,一种新方法,重点提高姿态网络以提高深度估计。在构建在SQL基础之上的SPIdepth中,我们强调了姿态信息在捕捉细粒度场景结构中的重要性。通过增强姿态网络的功能,SPIdepth在场景理解和深度估计方面取得了显著的进步。在KITTI和Cityscapes等基准数据集上的实验结果展示了SPIdepth的尖端性能,超过了以前的方法。值得注意的是,SPIdepth的性能超过了无监督模型,并且在指标数据上进行微调后,超过了所有现有方法。非凡地,SPIdepth仅使用一张图像进行推理,超越了甚至使用视频序列进行推理的方法,证明了其在现实应用中的高效性和效率。我们的方法在自监督单目深度估计方面取得了显著的飞跃,突出了加强姿态信息对于在现实应用中增进场景理解的重要性。
https://arxiv.org/abs/2404.12501
Large language models primarily rely on inductive reasoning for decision making. This results in unreliable decisions when applied to real-world tasks that often present incomplete contexts and conditions. Thus, accurate probability estimation and appropriate interpretations are required to enhance decision-making reliability. In this paper, we propose a Bayesian inference framework called BIRD for large language models. BIRD provides controllable and interpretable probability estimation for model decisions, based on abductive factors, LLM entailment, as well as learnable deductive Bayesian modeling. Experiments show that BIRD produces probability estimations that align with human judgments over 65% of the time using open-sourced Llama models, outperforming the state-of-the-art GPT-4 by 35%. We also show that BIRD can be directly used for trustworthy decision making on many real-world applications.
大语言模型主要依赖归纳推理来进行决策。这导致在应用于现实任务时,由于通常呈现不完整的语境和条件,因此不可靠的决策结果。因此,准确的概率估计和适当的解释是提高决策可靠性的必要条件。在本文中,我们提出了一个名为BIRD的大语言模型推理框架。BIRD为大型语言模型提供了可控制和可解释的概率估计,基于类推因素、LLM包含以及可学习演绎贝叶斯建模。实验结果表明,使用开源的Llama模型,BIRD在65%的时间里概率估计与人类判断相一致,比最先进的GPT-4模型快35%。我们还证明了BIRD可以直接用于许多现实应用的可信决策。
https://arxiv.org/abs/2404.12494
Joint entity and relation extraction plays a pivotal role in various applications, notably in the construction of knowledge graphs. Despite recent progress, existing approaches often fall short in two key aspects: richness of representation and coherence in output structure. These models often rely on handcrafted heuristics for computing entity and relation representations, potentially leading to loss of crucial information. Furthermore, they disregard task and/or dataset-specific constraints, resulting in output structures that lack coherence. In our work, we introduce EnriCo, which mitigates these shortcomings. Firstly, to foster rich and expressive representation, our model leverage attention mechanisms that allow both entities and relations to dynamically determine the pertinent information required for accurate extraction. Secondly, we introduce a series of decoding algorithms designed to infer the highest scoring solutions while adhering to task and dataset-specific constraints, thus promoting structured and coherent outputs. Our model demonstrates competitive performance compared to baselines when evaluated on Joint IE datasets.
联合实体和关系提取在各种应用中发挥着重要作用,特别是在知识图谱的构建中。尽管最近取得了进展,但现有的方法在两个关键方面往往存在不足:表示的丰富性和输出结构的连贯性。这些方法通常依赖于手工构建的启发式规则计算实体和关系表示,可能导致关键信息的丢失。此外,它们忽视了任务和/或数据集特定的约束,导致输出结构缺乏连贯性。在我们的工作中,我们引入了EnriCo模型,从而缓解了这些不足。首先,为了促进丰富和表现性的表示,我们的模型利用了注意机制,允许实体和关系动态确定所需的相关信息。其次,我们引入了一系列解码算法,旨在在遵守任务和数据集特定约束的情况下推断最高得分解决方案,从而促进结构和连贯的输出。与基线相比,我们的模型在Joint IE数据集上的评估表现具有竞争力。
https://arxiv.org/abs/2404.12493