Art reinterpretation is the practice of creating a variation of a reference work, making a paired artwork that exhibits a distinct artistic style. We ask if such an image pair can be used to customize a generative model to capture the demonstrated stylistic difference. We propose Pair Customization, a new customization method that learns stylistic difference from a single image pair and then applies the acquired style to the generation process. Unlike existing methods that learn to mimic a single concept from a collection of images, our method captures the stylistic difference between paired images. This allows us to apply a stylistic change without overfitting to the specific image content in the examples. To address this new task, we employ a joint optimization method that explicitly separates the style and content into distinct LoRA weight spaces. We optimize these style and content weights to reproduce the style and content images while encouraging their orthogonality. During inference, we modify the diffusion process via a new style guidance based on our learned weights. Both qualitative and quantitative experiments show that our method can effectively learn style while avoiding overfitting to image content, highlighting the potential of modeling such stylistic differences from a single image pair.
艺术再诠释是对参考作品的一种变体,创作了一对对比艺术作品,展示了独特的艺术风格。我们询问,这样的图像对是否可以用于定制生成模型来捕捉演示的文体差异。我们提出了Pair Customization,一种新的定制方法,它从单个图像对中学习文体差异,然后将获得的风格应用于生成过程。与现有的方法不同,我们的方法从单个图像对中捕获文体差异。这使我们能够在不需要对实例的具体内容进行过拟合的情况下应用文体变化。为了应对这项新任务,我们采用了一种联合优化方法,将风格和内容明确地分离到两个LORA权重空间中。我们优化了这些风格和内容权重,以复制风格和内容图像,同时鼓励它们的正交性。在推理过程中,我们通过基于我们学习到的权重的新的样式指导来修改扩散过程。所有定性和定量实验都表明,我们的方法可以有效地学习风格,同时避免对图像内容的过拟合,突出了从单个图像对中建模这些文体差异的潜在可能性。
https://arxiv.org/abs/2405.01536
Traditionally, natural language processing (NLP) models often use a rich set of features created by linguistic expertise, such as semantic representations. However, in the era of large language models (LLMs), more and more tasks are turned into generic, end-to-end sequence generation problems. In this paper, we investigate the question: what is the role of semantic representations in the era of LLMs? Specifically, we investigate the effect of Abstract Meaning Representation (AMR) across five diverse NLP tasks. We propose an AMR-driven chain-of-thought prompting method, which we call AMRCoT, and find that it generally hurts performance more than it helps. To investigate what AMR may have to offer on these tasks, we conduct a series of analysis experiments. We find that it is difficult to predict which input examples AMR may help or hurt on, but errors tend to arise with multi-word expressions, named entities, and in the final inference step where the LLM must connect its reasoning over the AMR to its prediction. We recommend focusing on these areas for future work in semantic representations for LLMs. Our code: this https URL.
传统上,自然语言处理(NLP)模型通常使用由语言专业知识创建的丰富特征集,例如语义表示。然而,在大型语言模型(LLMs)的时代,越来越多的任务被转化为通用序列生成问题。在本文中,我们研究了在LLM时代语义表示的作用:具体来说,我们研究了抽象意义表示(AMR)在五个不同NLP任务上的效果。我们提出了一个基于AMR的思绪提示方法,我们称之为AMRCoT,并发现它通常会损害性能,而不是帮助。为了研究AMR在这些任务上可能提供的优势,我们进行了一系列分析实验。我们发现很难预测AMR可能会帮助或损害哪些输入示例,但错误往往会在多词表达、命名实体和最后推理步骤中出现,LLM必须将推理跨越AMR与预测相结合。我们建议将未来LLM语义表示工作集中在这些领域上。我们的代码:<https://this URL>。
https://arxiv.org/abs/2405.01502
Natural language explanations have become a proxy for evaluating explainable and multi-step Natural Language Inference (NLI) models. However, assessing the validity of explanations for NLI is challenging as it typically involves the crowd-sourcing of apposite datasets, a process that is time-consuming and prone to logical errors. To address existing limitations, this paper investigates the verification and refinement of natural language explanations through the integration of Large Language Models (LLMs) and Theorem Provers (TPs). Specifically, we present a neuro-symbolic framework, named Explanation-Refiner, that augments a TP with LLMs to generate and formalise explanatory sentences and suggest potential inference strategies for NLI. In turn, the TP is employed to provide formal guarantees on the logical validity of the explanations and to generate feedback for subsequent improvements. We demonstrate how Explanation-Refiner can be jointly used to evaluate explanatory reasoning, autoformalisation, and error correction mechanisms of state-of-the-art LLMs as well as to automatically enhance the quality of human-annotated explanations of variable complexity in different domains.
自然语言解释已成为评估可解释性和多步骤自然语言推理(NLI)模型的指标。然而,评估解释的有效性具有挑战性,因为它通常涉及apprise数据的众包,这个过程费时且容易出错。为了应对现有局限,本文研究了通过整合大型语言模型(LLMs)和定理证明器(TPs)来验证和优化自然语言解释的方法。具体来说,我们提出了一个名为Explanation-Refiner的神经符号框架,该框架通过在TP中添加LLMs来生成和形式化解释性句子,并建议可能的NLI推理策略。TP则用于提供关于解释逻辑有效性的正式保证,并生成后续改进的反馈。我们证明了Explanation-Refiner可以与现有LLM的推理、自动形式化和错误纠正机制共同用于评估可解释推理、自动形式化和不同领域的变体解释的质量。
https://arxiv.org/abs/2405.01379
Natural language inference (NLI), also known as Recognizing Textual Entailment (RTE), is an important aspect of natural language understanding. Most research now uses machine learning and deep learning to perform this task on specific datasets, meaning their solution is not explainable nor explicit. To address the need for an explainable approach to RTE, we propose a novel pipeline that is based on translating text into an Abstract Meaning Representation (AMR) graph. For this we use a pre-trained AMR parser. We then translate the AMR graph into propositional logic and use a SAT solver for automated reasoning. In text, often commonsense suggests that an entailment (or contradiction) relationship holds between a premise and a claim, but because different wordings are used, this is not identified from their logical representations. To address this, we introduce relaxation methods to allow replacement or forgetting of some propositions. Our experimental results show this pipeline performs well on four RTE datasets.
自然语言推理(NLI),也称为识别文本等价性(RTE),是自然语言理解的重要方面。目前,大多数研究都使用机器学习和深度学习在特定数据集上执行此任务,这意味着他们的解决方案不可解释,也不明确。为了满足具有可解释性方法的需求,我们提出了一个新管道,该管道基于将文本转换为抽象意义表示(AMR)图。为此,我们使用了一个预训练的AMR解析器。然后将AMR图转换为命题逻辑,并使用SAT求解器进行自动推理。在文本中,通常常识表明前提与结论之间存在等价(或矛盾)关系,但因为他们使用的词汇不同,这并不能从它们的逻辑表示中识别出来。为了解决这个问题,我们引入了放松方法,允许替换或忘记某些命题。我们的实验结果表明,该管道在四个RTE数据集上的表现良好。
https://arxiv.org/abs/2405.01259
As human-machine interaction continues to evolve, the capacity for environmental perception is becoming increasingly crucial. Integrating the two most common types of sensory data, images, and point clouds, can enhance detection accuracy. However, currently, no model exists that can simultaneously detect an object's position in both point clouds and images and ascertain their corresponding relationship. This information is invaluable for human-machine interactions, offering new possibilities for their enhancement. In light of this, this paper introduces an end-to-end Consistency Object Detection (COD) algorithm framework that requires only a single forward inference to simultaneously obtain an object's position in both point clouds and images and establish their correlation. Furthermore, to assess the accuracy of the object correlation between point clouds and images, this paper proposes a new evaluation metric, Consistency Precision (CP). To verify the effectiveness of the proposed framework, an extensive set of experiments has been conducted on the KITTI and DAIR-V2X datasets. The study also explored how the proposed consistency detection method performs on images when the calibration parameters between images and point clouds are disturbed, compared to existing post-processing methods. The experimental results demonstrate that the proposed method exhibits excellent detection performance and robustness, achieving end-to-end consistency detection. The source code will be made publicly available at this https URL.
随着人机交互的不断进化,环境感知的能力变得越来越重要。将两种最常见的感官数据,图像和点云,集成在一起可以提高检测精度。然而,目前尚无同时检测物体在点云和图像中的位置以及确定它们相应关系的产品。这些信息对于人机交互非常重要,为它们的增强提供了新的可能性。因此,本文介绍了一个端到端的一致性物体检测(COD)算法框架,只需要一个前向推理即可同时获得物体在点云和图像中的位置并确定它们之间的关联。此外,为了评估点云和图像之间物体关联的准确性,本文提出了一个新的评估指标,一致性精度(CP)。为了验证所提出的框架的有效性,在KITTI和DAIR-V2X数据集上进行了大量实验。研究还探讨了在图像和点云之间的校准参数受到干扰时,所提出的一致性检测方法在图片上的表现,与现有后处理方法进行了比较。实验结果表明,与现有方法相比,所提出的检测方法具有卓越的检测性能和鲁棒性,实现了端到端的一致性检测。源代码将公开在https://这个网址上。
https://arxiv.org/abs/2405.01258
We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art models with memory footprints of hundreds of MBs and methods better suited for resource-constrained systems. To adapt TRAMBA to vibration-based sensing modalities, we pre-train TRAMBA with audio speech datasets that are widely available. Then, users fine-tune with a small amount of bone conduction data. TRAMBA outperforms state-of-art GANs by up to 7.3% in PESQ and 1.8% in STOI, with an order of magnitude smaller memory footprint and an inference speed up of up to 465 times. We integrate TRAMBA into real systems and show that TRAMBA (i) improves battery life of wearables by up to 160% by requiring less data sampling and transmission; (ii) generates higher quality voice in noisy environments than over-the-air speech; (iii) requires a memory footprint of less than 20.0 MB.
我们提出了TRAMBA,一种适用于移动和可穿戴平台的混合变压器和Mamba架构的语音和骨传导增强,为语音和骨传导增强提供了一种高效且可扩展的方法。骨传导增强在移动和可穿戴平台上的实现一直是不实用的几个原因:首先,数据收集工作量很大,导致数据稀缺;其次,与具有数百MB内存开销的先进模型相比,更适用于资源受限系统的方法之间存在性能差距。为了将TRAMBA适应振动感知模式,我们使用音频语音数据集预先训练TRAMBA。然后,用户通过少量的骨传导数据进行微调。TRAMBA在PESQ和STOI方面的性能优于最先进的GAN,其内存开销减小了 orders of magnitude,并且推理速度加快了465倍。我们将TRAMBA集成到实际系统中,并证明了TRAMBA(i)通过要求更少的数据采样和传输来提高可穿戴设备的电池寿命,提高了160%;(ii)在嘈杂的环境中产生的声音质量高于通过空气传播的声音;(iii) 内存开销小于20.0 MB。
https://arxiv.org/abs/2405.01242
Deep learning models often encounter challenges in making accurate inferences when there are domain shifts between the source and target data. This issue is particularly pronounced in clinical settings due to the scarcity of annotated data resulting from the professional and private nature of medical data. Despite the existence of decent solutions, many of them are hindered in clinical settings due to limitations in data collection and computational complexity. To tackle domain shifts in data-scarce medical scenarios, we propose a Random frequency filtering enabled Single-source Domain Generalization algorithm (RaffeSDG), which promises robust out-of-domain inference with segmentation models trained on a single-source domain. A filter-based data augmentation strategy is first proposed to promote domain variability within a single-source domain by introducing variations in frequency space and blending homologous samples. Then Gaussian filter-based structural saliency is also leveraged to learn robust representations across augmented samples, further facilitating the training of generalizable segmentation models. To validate the effectiveness of RaffeSDG, we conducted extensive experiments involving out-of-domain inference on segmentation tasks for three human tissues imaged by four diverse modalities. Through thorough investigations and comparisons, compelling evidence was observed in these experiments, demonstrating the potential and generalizability of RaffeSDG. The code is available at this https URL.
深度学习模型在目标数据和源数据之间存在领域转移时,通常会做出准确的推理。这个问题在临床环境中尤为突出,因为医疗数据的非专业和私有的性质导致缺乏注释数据。尽管存在可行的解决方案,但它们在临床环境中因数据收集和计算复杂性的限制而受到阻碍。为解决数据稀少的医疗场景中的领域转移问题,我们提出了一个由随机频率过滤启发的单源领域泛化算法(RaffeSDG),它承诺在单源领域上训练的分割模型的稳健离域推理能力。首先提出了一种基于滤波器的数据增强策略,通过引入频率空间中的变化和混合同样样本来提高领域差异。然后,利用高斯滤波器进行结构重要性学习,以学习增强样本中的稳健表示,进一步推动泛化分割模型的训练。为了验证RaffeSDG的有效性,我们进行了涉及四个不同模态成像的人体三种组织的分割任务的大量实验。通过深入调查和比较,在这些实验中观察到了令人信服的证据,证明了RaffeSDG的潜力和泛化能力。代码可在此处访问:https://www.kaggle.com/raffeym/raffeymdg
https://arxiv.org/abs/2405.01228
Membership Inference (MI) poses a substantial privacy threat to the training data of Automatic Speech Recognition (ASR) systems, while also offering an opportunity to audit these models with regard to user data. This paper explores the effectiveness of loss-based features in combination with Gaussian and adversarial perturbations to perform MI in ASR models. To the best of our knowledge, this approach has not yet been investigated. We compare our proposed features with commonly used error-based features and find that the proposed features greatly enhance performance for sample-level MI. For speaker-level MI, these features improve results, though by a smaller margin, as error-based features already obtained a high performance for this task. Our findings emphasise the importance of considering different feature sets and levels of access to target models for effective MI in ASR systems, providing valuable insights for auditing such models.
翻译:Membership Inference (MI) 对自动语音识别(ASR)系统的训练数据提出了实质性的隐私威胁,同时为审计这些模型与用户数据有关提供了机会。本文探讨了基于损失的特征与高斯和对抗扰动在ASR模型中进行MI的有效性。据我们所知,这种方法尚未被研究过。我们将提出的特征与常见的基于错误的特征进行比较,发现所提出的特征在样本级MI方面极大地提高了性能。在说话人级别MI方面,这些特征提高了结果,但相对较小,因为基于错误的特征已经在这一任务上取得了很高的性能。我们的研究结果强调了在ASR系统中有效MI时考虑不同特征集和访问目标模型的重要性,为审计这些模型提供了宝贵的见解。
https://arxiv.org/abs/2405.01207
Transformer-based entropy models have gained prominence in recent years due to their superior ability to capture long-range dependencies in probability distribution estimation compared to convolution-based methods. However, previous transformer-based entropy models suffer from a sluggish coding process due to pixel-wise autoregression or duplicated computation during inference. In this paper, we propose a novel transformer-based entropy model called GroupedMixer, which enjoys both faster coding speed and better compression performance than previous transformer-based methods. Specifically, our approach builds upon group-wise autoregression by first partitioning the latent variables into groups along spatial-channel dimensions, and then entropy coding the groups with the proposed transformer-based entropy model. The global causal self-attention is decomposed into more efficient group-wise interactions, implemented using inner-group and cross-group token-mixers. The inner-group token-mixer incorporates contextual elements within a group while the cross-group token-mixer interacts with previously decoded groups. Alternate arrangement of two token-mixers enables global contextual reference. To further expedite the network inference, we introduce context cache optimization to GroupedMixer, which caches attention activation values in cross-group token-mixers and avoids complex and duplicated computation. Experimental results demonstrate that the proposed GroupedMixer yields the state-of-the-art rate-distortion performance with fast compression speed.
近年来,基于Transformer的熵模型因其在概率分布估计中捕获长距离依赖的优越性能而备受关注。然而,之前的基于Transformer的熵模型由于在推理过程中进行像素级自回归或重复计算而变得拖沓。在本文中,我们提出了一种名为GroupedMixer的新颖基于Transformer的熵模型,该模型比之前的基于Transformer的方法具有更快的编码速度和更好的压缩性能。具体来说,我们的方法首先沿着空间维度将潜在变量分组,然后利用提出的基于Transformer的熵模型对分组进行熵编码。全局因果自注意力的分解采用更有效的组内和跨组token-mixer实现。组内token-mixer包含组内的上下文元素,而跨组token-mixer与之前解码的组进行交互。通过交替排列两个token-mixer,可以实现全局上下文参考。为了进一步加速网络推理,我们引入了GroupedMixer的上下文缓存优化,将跨组token-mixer中的注意力激活值缓存起来,并避免复杂和重复计算。实验结果表明,与具有快速压缩速度的顶级性能相比,GroupedMixer具有最先进的速率失真性能。
https://arxiv.org/abs/2405.01170
An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness no failures during tracking. To achieve that, one needs to efficiently tackle challenges, such as: device obscuration by contrast agent or other external devices or wires, changes in field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. To overcome the aforementioned challenges, we propose a novel approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream. Our approach achieves state-of-the-art performance and in particular robustness compared to ultra optimized reference solutions (that use multi-stage feature fusion, multi-task and flow regularization). The experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions (23.20% when flow regularization is used); achieving a success score of 97.95% at a 3x faster inference speed of 42 frames-per-second (on GPU). The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.
准确检测和跟踪诸如引导导管等设备在活体X光成像中的操作,是进行内窥镜心脏干预的必要前提。这一信息用于指导操作,例如指导支架植入。为了确保操作的安全性和有效性,需要在跟踪过程中具备高鲁棒性,以避免失败。为了实现这一目标,需要有效地解决一些挑战,例如:对比剂或其他外部设备或导线对设备的遮挡,视野或成像角度的变化,以及由于心脏和呼吸运动而产生的连续运动。为了克服上述挑战,我们提出了一个新颖的方法,从超过1600万干预X光帧的大型数据集中学取空间-时间特征,通过自监督图像序列数据进行图像序列数据。我们的方法基于遮罩图像建模技术,利用基于重构的帧插值学习细粒度时间对应关系。在得到的模型中编码的特征经过下游微调。我们的方法在超优化参考解决方案(使用多级特征融合、多任务和流 regularization)方面实现了最先进的性能和鲁棒性。实验结果表明,我们的方法将最大跟踪误差减少了66.31%(使用流 regularization 时,降低了23.20%);在每秒42帧的推理速度下,实现了97.95%的成功率(在GPU上)。结果鼓励将我们的方法应用于各种需要在操作图像分析中有效理解空间-时间语义的各种其他任务。
https://arxiv.org/abs/2405.01156
Although pre-trained language models have exhibited great flexibility and versatility with prompt-based few-shot learning, they suffer from the extensive parameter size and limited applicability for inference. Recent studies have suggested that PLMs be used as dataset generators and a tiny task-specific model be trained to achieve efficient inference. However, their applicability to various domains is limited because they tend to generate domain-specific datasets. In this work, we propose a novel approach to universal domain generalization that generates a dataset regardless of the target domain. This allows for generalization of the tiny task model to any domain that shares the label space, thus enhancing the real-world applicability of the dataset generation paradigm. Our experiments indicate that the proposed method accomplishes generalizability across various domains while using a parameter set that is orders of magnitude smaller than PLMs.
尽管预训练语言模型在基于提示的少样本学习方面表现出巨大的灵活性和多样性,但它们仍然受到参数数量庞大和推理应用受限的问题。最近的研究表明,PLM可以作为数据集生成器,并训练一个任务特定的微小模型来实现高效的推理。然而,由于它们倾向于生成特定领域的数据集,因此它们的适用性在各个领域都有限。在本文中,我们提出了一种通用的领域泛化方法,可以生成与目标领域不同的数据集。这使得将微小任务模型应用于任何具有共同标签空间的所有领域成为可能,从而提高了数据生成范式在现实世界的应用价值。我们的实验结果表明,与PLM相比,所提出的方法在各个领域都实现了泛化能力,同时使用了参数集 orders of magnitude 更小。
https://arxiv.org/abs/2405.01022
Recent transformer-based ASR models have achieved word-error rates (WER) below 4%, surpassing human annotator accuracy, yet they demand extensive server resources, contributing to significant carbon footprints. The traditional server-based architecture of ASR also presents privacy concerns, alongside reliability and latency issues due to network dependencies. In contrast, on-device (edge) ASR enhances privacy, boosts performance, and promotes sustainability by effectively balancing energy use and accuracy for specific applications. This study examines the effects of quantization, memory demands, and energy consumption on the performance of various ASR model inference on the NVIDIA Jetson Orin Nano. By analyzing WER and transcription speed across models using FP32, FP16, and INT8 quantization on clean and noisy datasets, we highlight the crucial trade-offs between accuracy, speeds, quantization, energy efficiency, and memory needs. We found that changing precision from fp32 to fp16 halves the energy consumption for audio transcription across different models, with minimal performance degradation. A larger model size and number of parameters neither guarantees better resilience to noise, nor predicts the energy consumption for a given transcription load. These, along with several other findings offer novel insights for optimizing ASR systems within energy- and memory-limited environments, crucial for the development of efficient on-device ASR solutions. The code and input data needed to reproduce the results in this article are open sourced are available on [this https URL].
近年来,基于Transformer的自动语音识别(ASR)模型已经实现了词错误率(WER)低于4%,超过了人类注释者的工作准确率,然而它们需要大量的服务器资源,导致显著的碳排放足迹。传统的基于服务器的ASR架构也存在隐私问题,以及由于网络依赖关系而导致的可靠性和延迟问题。相比之下,在设备级(边缘)ASR上,通过有效平衡能源消耗和准确性,提高了隐私,增强了性能,促进了可持续性。 本研究探讨了量化、内存需求和能源消耗对各种ASR模型在NVIDIA Jetson Orin Nano上的性能的影响。通过分析在干净和噪音数据集上使用FP32、FP16和INT8量化模型的WER和转录速度,我们突出了准确度、速度、量化、能源效率和内存需求之间的关键权衡。我们发现,将精度从fp32变为fp16可以减半不同模型的音频转录能源消耗,同时性能降幅很小。模型大小和参数数量越大,并不能保证对噪声的鲁棒性,也不能预测给定转录负载的能源消耗。这些以及其他发现为在能源和内存受限的环境中优化ASR系统提供了新的见解,对于实现高效的本地ASR解决方案具有关键意义。本文开源的代码和输入数据可在[本文的链接]中找到。
https://arxiv.org/abs/2405.01004
Designing preference elicitation (PE) methodologies that can quickly ascertain a user's top item preferences in a cold-start setting is a key challenge for building effective and personalized conversational recommendation (ConvRec) systems. While large language models (LLMs) constitute a novel technology that enables fully natural language (NL) PE dialogues, we hypothesize that monolithic LLM NL-PE approaches lack the multi-turn, decision-theoretic reasoning required to effectively balance the NL exploration and exploitation of user preferences towards an arbitrary item set. In contrast, traditional Bayesian optimization PE methods define theoretically optimal PE strategies, but fail to use NL item descriptions or generate NL queries, unrealistically assuming users can express preferences with direct item ratings and comparisons. To overcome the limitations of both approaches, we formulate NL-PE in a Bayesian Optimization (BO) framework that seeks to generate NL queries which actively elicit natural language feedback to reduce uncertainty over item utilities to identify the best recommendation. We demonstrate our framework in a novel NL-PE algorithm, PEBOL, which uses Natural Language Inference (NLI) between user preference utterances and NL item descriptions to maintain preference beliefs and BO strategies such as Thompson Sampling (TS) and Upper Confidence Bound (UCB) to guide LLM query generation. We numerically evaluate our methods in controlled experiments, finding that PEBOL achieves up to 131% improvement in MAP@10 after 10 turns of cold start NL-PE dialogue compared to monolithic GPT-3.5, despite relying on a much smaller 400M parameter NLI model for preference inference.
设计能够在冷启动设置中快速确定用户top物品偏好的偏好诱发(PE)方法是一个构建有效且个性化的会话推荐(ConvRec)系统的重要挑战。虽然大型语言模型(LLMs)是一种新兴技术,允许实现完全自然语言(NL)PE对话,但我们假设单体LLM NL-PE方法缺乏多轮、决策理论推理,以有效地平衡NL探索和利用用户偏好的对任意物品集的平衡。相比之下,传统贝叶斯优化PE方法定义了理论最优的PE策略,但无法使用NL项目描述或生成NL查询,错误地假设用户可以通过直接项目评分和比较表达偏好。为了克服两者的局限,我们在贝叶斯优化(BO)框架中设计了一种NL-PE方法,旨在生成NL查询,积极引导自然语言反馈以降低对物品效用的不确定性,以确定最佳推荐。我们在新颖的NL-PE算法PEBOL中展示了我们的框架,该算法使用自然语言推理(NLI)在用户偏好陈述和NL项目描述之间保持偏好信念和BO策略,如Thompson采样(TS)和Upper Confidence Bound(UCB),以引导LLM查询生成。我们在控制实验中数值评估了我们的方法,发现PEBOL在10轮冷启动NL-PE对话后,MAP@10提高了131%,尽管在偏好推理中依赖了参数大小为400M的模型。
https://arxiv.org/abs/2405.00981
Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. Despite that, we show that only model-related bias are amplified by quantization, impacting more low-resource languages and smaller models. Searching for a better compression approach, we propose DistilWhisper, an approach that is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. This dual approach allows us to effectively boost ASR performance while keeping the robustness inherited from the multitask and multilingual pre-training. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters, boosting performance in the targeted languages for both in- and out-of-domain test sets, while introducing only a negligible parameter overhead at inference.
Whisper 是一个多任务、多语言的语音模型,覆盖 99 种语言。在它所覆盖的语言的范围内,它产生了可观的自动语音识别(ASR)结果,但在一些未涵盖的语言上,该模型仍然表现不佳,尤其是在较小的模型版本中,问题更加严重。在这项工作中,我们研究了其局限性,证明了说话者相关(性别、年龄)和模型相关(资源丰富性和模型大小)偏见的存在。尽管如此,我们证明了仅模型相关偏见通过量化被放大,影响更多的低资源语言和较小的模型。为了寻找更好的压缩方法,我们提出了 DistilWhisper,一种能够同时提高这些语言的ASR性能,同时保留多任务和多语言功能优势的方法。我们的方法涉及两个关键策略:通过语言特定的专家对Whisper-small进行轻量级模块化ASR微调,以及从Whisper-large-v2中进行知识蒸馏。这种双策略使我们能够在有效提高ASR性能的同时保持多任务和多语言预训练所带来的稳健性。结果表明,我们的方法比标准的微调或LoRA适配器更有效,在目标语言的测试集中提高性能,同时仅引入了微不足道的参数开销。
https://arxiv.org/abs/2405.00966
We present EchoScene, an interactive and controllable generative model that generates 3D indoor scenes on scene graphs. EchoScene leverages a dual-branch diffusion model that dynamically adapts to scene graphs. Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations. EchoScene overcomes this by associating each node with a denoising process and enables collaborative information exchange, enhancing controllable and consistent generation aware of global constraints. This is achieved through an information echo scheme in both shape and layout branches. At every denoising step, all processes share their denoising data with an information exchange unit that combines these updates using graph convolution. The scheme ensures that the denoising processes are influenced by a holistic understanding of the scene graph, facilitating the generation of globally coherent scenes. The resulting scenes can be manipulated during inference by editing the input scene graph and sampling the noise in the diffusion model. Extensive experiments validate our approach, which maintains scene controllability and surpasses previous methods in generation fidelity. Moreover, the generated scenes are of high quality and thus directly compatible with off-the-shelf texture generation. Code and trained models are open-sourced.
我们提出了EchoScene,一种交互式且可控制的三维室内场景生成模型,它在场景图上生成3D室内场景。EchoScene利用了双分支扩散模型,该模型动态地适应场景图。由于节点数量、边组合和操纵器诱导的节点边缘操作等因素的不同,现有的方法在处理场景图时遇到困难。EchoScene通过将每个节点与去噪过程相关联,实现了协同信息交流,增强了可控制和一致性的生成,同时考虑了全局约束。这是通过形状和布局分支的信息回声方案实现的。在去噪步骤中,所有进程将去噪数据共享给一个信息交换单元,该单元使用图卷积对这些更新进行组合。该方案确保了去噪过程受到场景图的整体理解的影响,从而促进了全局一致场景的生成。生成的场景在推理过程中可以进行编辑,并从扩散模型的噪声中采样。大量实验验证了我们的方法,保持场景的可控性并超过了前方法在生成质量上的表现。此外,生成的场景具有高质量,因此与现成的纹理生成兼容。代码和训练的模型都是开源的。
https://arxiv.org/abs/2405.00915
Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models $\textit{dynamically}$ predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite, DynaMo-7.3B-T3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57$\times$ speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively.
传统语言模型是自回归的,也就是说,它们一次预测一个单词。模型大小迅速膨胀导致了推理时间的高。在这项工作中,我们提出了DynaMo,一套多词预测语言模型,减少了网络推理时间。我们的模型根据预测联合概率分布的自信程度动态预测多个单词。我们提出了一个轻量级的方法来训练这些模型,利用传统自回归同行的权重。此外,我们还提出了提高估计联合概率以改善文本生成质量的新方法,包括共现加权掩码和自适应阈值。我们还提出了系统性的定性和定量方法来严格测试生成文本的质量,对于非自回归生成。我们 suite 中的一 个模型 DynaMo-7.3B-T3,在实现与基线(Pythia-6.9B)相同质量的生成文本的同时,仅比基线提高了5.87%的参数和训练时间,实现了2.67%的提速。
https://arxiv.org/abs/2405.00888
Motion prediction is a challenging problem in autonomous driving as it demands the system to comprehend stochastic dynamics and the multi-modal nature of real-world agent interactions. Diffusion models have recently risen to prominence, and have proven particularly effective in pedestrian motion prediction tasks. However, the significant time consumption and sensitivity to noise have limited the real-time predictive capability of diffusion models. In response to these impediments, we propose a novel diffusion-based, acceleratable framework that adeptly predicts future trajectories of agents with enhanced resistance to noise. The core idea of our model is to learn a coarse-grained prior distribution of trajectory, which can skip a large number of denoise steps. This advancement not only boosts sampling efficiency but also maintains the fidelity of prediction accuracy. Our method meets the rigorous real-time operational standards essential for autonomous vehicles, enabling prompt trajectory generation that is vital for secure and efficient navigation. Through extensive experiments, our method speeds up the inference time to 136ms compared to standard diffusion model, and achieves significant improvement in multi-agent motion prediction on the Argoverse 1 motion forecasting dataset.
运动预测是自动驾驶中的一个具有挑战性的问题,因为它要求系统理解随机动态和真实世界中代理互动的多模态性质。扩散模型最近受到了关注,并在行人运动预测任务中表现尤为有效。然而,显著的运行时间和对噪声的敏感性限制了扩散模型的实时预测能力。为了应对这些障碍,我们提出了一个新型的扩散-基于,加速的框架,能够有效地预测具有抗噪能力的未来轨迹。我们模型的核心思想是学习轨迹的粗粒度先验分布,可以跳过大量去噪步骤。这一进步不仅提高了抽样效率,还保持了预测精度的准确性。我们的方法满足自动驾驶车辆所需的严格实时运行标准,能够快速生成安全高效的轨迹。通过大量实验,我们的方法将标准扩散模型的推理时间缩短至136ms,同时在Argoverse 1运动预测数据集上显著提高了多代理器运动预测的准确性。
https://arxiv.org/abs/2405.00797
To plan safely in uncertain environments, agents must balance utility with safety constraints. Safe planning problems can be modeled as a chance-constrained partially observable Markov decision process (CC-POMDP) and solutions often use expensive rollouts or heuristics to estimate the optimal value and action-selection policy. This work introduces the ConstrainedZero policy iteration algorithm that solves CC-POMDPs in belief space by learning neural network approximations of the optimal value and policy with an additional network head that estimates the failure probability given a belief. This failure probability guides safe action selection during online Monte Carlo tree search (MCTS). To avoid overemphasizing search based on the failure estimates, we introduce $\Delta$-MCTS, which uses adaptive conformal inference to update the failure threshold during planning. The approach is tested on a safety-critical POMDP benchmark, an aircraft collision avoidance system, and the sustainability problem of safe CO$_2$ storage. Results show that by separating safety constraints from the objective we can achieve a target level of safety without optimizing the balance between rewards and costs.
在不确定的环境中进行安全规划时,智能体必须平衡可用性和安全性约束。安全规划问题可以建模为一个带约束的马尔可夫决策过程(CC-POMDP),并且解决方案通常使用昂贵的展开或启发式来估计最优价值和动作选择策略。本文介绍了一种约束零策略迭代算法,通过学习神经网络对最优价值和策略的近似来求解CC-POMDP。这种故障概率在在线蒙特卡洛树搜索(MCTS)中引导安全动作选择。为了避免过分依赖基于故障估计的搜索,我们引入了$\Delta$-MCTS,它使用自适应平滑推理来在规划过程中更新故障阈值。该方法在安全关键POMDP基准、飞机避障系统和可持续性问题上进行了测试。结果表明,通过将安全性约束与目标分离,我们可以在不优化奖励和成本之间的平衡的情况下实现预期的安全性水平。
https://arxiv.org/abs/2405.00644
Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibration of quantized models, considering factors such as language model type and scale as contributors to quantization loss. Firstly, we reveal that quantization with GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different language models. Secondly, we observe fluctuations in the impact on confidence across different scales. Finally, we propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.
近年来,通过后训练量化或低位权重表示,有效压缩技术已经引入了大型语言模型(LLMs)。尽管量化权重具有存储效率,能够加速推理,但已有研究表示,量化可能会影响性能,甚至加剧LLMs中的偏见。本研究调查了量化模型的置信度和校准度,考虑了诸如语言模型类型和规模等因素对量化损失的贡献。首先,我们发现,将GPTQ量化到4位会导致关于真实标签的置信度下降,不同语言模型上观察到的影响有所不同。其次,我们观察到不同缩放级别上置信度对的影响波动。最后,我们提出了一个基于置信度的量化损失解释,表明量化 disproportionately影响在训练过程中首次表现出低置信度的样本。
https://arxiv.org/abs/2405.00632
Large language models (LLMs) with their strong zero-shot topic extraction capabilities offer an alternative to probabilistic topic modelling and closed-set topic classification approaches. As zero-shot topic extractors, LLMs are expected to understand human instructions to generate relevant and non-hallucinated topics based on the given documents. However, LLM-based topic modelling approaches often face difficulties in generating topics with adherence to granularity as specified in human instructions, often resulting in many near-duplicate topics. Furthermore, methods for addressing hallucinated topics generated by LLMs have not yet been investigated. In this paper, we focus on addressing the issues of topic granularity and hallucinations for better LLM-based topic modelling. To this end, we introduce a novel approach that leverages Direct Preference Optimisation (DPO) to fine-tune open-source LLMs, such as Mistral-7B. Our approach does not rely on traditional human annotation to rank preferred answers but employs a reconstruction pipeline to modify raw topics generated by LLMs, thus enabling a fast and efficient training and inference framework. Comparative experiments show that our fine-tuning approach not only significantly improves the LLM's capability to produce more coherent, relevant, and precise topics, but also reduces the number of hallucinated topics.
大语言模型(LLMs)具有其强大的零击主题提取能力,为概率主题建模和关闭集主题分类方法提供了另一种选择。作为零击主题提取器,LLM预计能够根据给定的文档理解人类指令来生成相关和非虚构的主题。然而,基于LLM的主题建模方法通常在生成符合人类指令的丰富主题方面遇到困难,通常导致许多近似主题。此外,尚未研究解决LLM生成的主题出现魔幻的方法。在本文中,我们关注主题粒度问题和魔幻生成问题,以改善基于LLM的主题建模。为此,我们引入了一种新方法,该方法利用直接偏好优化(DPO)对开源LLM进行微调,例如Mistral-7B。我们的方法不依赖于传统的人类注释对首选答案进行排名,而是采用重构管道来修改由LLM生成的原始主题,从而实现快速而有效的训练和推理框架。比较实验表明,我们的微调方法不仅显著提高了LLM产生更有条理、相关和非精确主题的能力,而且减少了生成的魔幻主题的数量。
https://arxiv.org/abs/2405.00611