Transforming large pre-trained low-resolution diffusion models to cater to higher-resolution demands, i.e., diffusion extrapolation, significantly improves diffusion adaptability. We propose tuning-free CutDiffusion, aimed at simplifying and accelerating the diffusion extrapolation process, making it more affordable and improving performance. CutDiffusion abides by the existing patch-wise extrapolation but cuts a standard patch diffusion process into an initial phase focused on comprehensive structure denoising and a subsequent phase dedicated to specific detail refinement. Comprehensive experiments highlight the numerous almighty advantages of CutDiffusion: (1) simple method construction that enables a concise higher-resolution diffusion process without third-party engagement; (2) fast inference speed achieved through a single-step higher-resolution diffusion process, and fewer inference patches required; (3) cheap GPU cost resulting from patch-wise inference and fewer patches during the comprehensive structure denoising; (4) strong generation performance, stemming from the emphasis on specific detail refinement.
将大型预训练低分辨率扩散模型转换为满足更高分辨率需求,即扩散扩展,显著提高了扩散适应性。我们提出了一种无需调整的CutDiffusion,旨在简化并加速扩散扩展过程,使其更加经济且提高性能。CutDiffusion遵循现有的补丁扩展过程,但将标准补丁扩散过程切割成关注全面结构去噪和具体细节精炼的初始阶段,随后阶段为特定细节精炼。全面的实验突出了CutDiffusion的优势:(1)简单的方法构建使得简洁的高分辨率扩散过程成为可能,而无需第三方参与;(2)通过单步高分辨率扩散过程实现快速推理速度,且需要的推理补丁较少;(3)由于补丁推理和全面结构去噪阶段的便宜GPU成本,实现了较低的GPU成本;(4)强调具体细节精炼,从而实现强大的生成性能。
https://arxiv.org/abs/2404.15141
The rapid advancement of large-scale vision-language models has showcased remarkable capabilities across various tasks. However, the lack of extensive and high-quality image-text data in medicine has greatly hindered the development of large-scale medical vision-language models. In this work, we present a diagnosis-guided bootstrapping strategy that exploits both image and label information to construct vision-language datasets. Based on the constructed dataset, we developed MedDr, a generalist foundation model for healthcare capable of handling diverse medical data modalities, including radiology, pathology, dermatology, retinography, and endoscopy. Moreover, during inference, we propose a simple but effective retrieval-augmented medical diagnosis strategy, which enhances the model's generalization ability. Extensive experiments on visual question answering, medical report generation, and medical image diagnosis demonstrate the superiority of our method.
大规模视觉语言模型的快速发展在各种任务中展示了令人印象深刻的性能。然而,在医学领域中缺乏大量高质量的图像-文本数据,大大阻碍了大规模医疗视觉语言模型的开发。在这项工作中,我们提出了一个指导下的bootstrap策略,该策略利用图像和标签信息来构建视觉语言数据集。基于构建的数据集,我们开发了MedDr,一种通用医疗数据处理模型,能够处理各种医疗数据模式,包括放射学、病理学、皮肤病学、眼科和内窥镜。此外,在推理过程中,我们提出了一种简单但有效的检索增强医疗诊断策略,可以增强模型的泛化能力。在视觉问答、医学报告生成和医学图像诊断等大量实验中,证明了我们方法的优势。
https://arxiv.org/abs/2404.15127
Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for optimization. (2) Lacking fine-grained analytical capabilities for multiple semantic concepts within individual tokens. We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. These sub-tokens are then assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The multi-head mechanism enables the model to collectively attend to information from various representation spaces within different experts, while significantly enhances expert activation, thus deepens context understanding and alleviate overfitting. Moreover, our MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance. Extensive experimental results across three tasks: English-focused language modeling, Multi-lingual language modeling and Masked multi-modality modeling tasks, demonstrate the effectiveness of MH-MoE.
Sparse Mixtures of Experts (SMoE)是一种在不显著增加训练和推理成本的情况下扩展模型容量的方法,但存在以下两个问题:(1)专家激活度低,仅激活一小部分专家进行优化;(2)对多个语义概念的细粒度分析能力不足。我们提出了多头专家混合专家(MH-MoE)方法,它采用一个多头机制将每个词划分为多个子词。这些子词随后被分配给多个专家并并行处理,无缝地重新整合到原始词形式中。多头机制使模型能够集体关注不同专家对各个表示空间的信息,从而显著增强专家激活,加深上下文理解,缓解过拟合。此外,我们的MH-MoE易于实现,与其他SMoE优化方法解耦,容易与其他SMoE模型集成以提高性能。在三个任务(英语关注语言建模、多语言语言建模和遮罩多模态建模)上的实验结果表明,MH-MoE的有效性。
https://arxiv.org/abs/2404.15045
This paper studies interpretability of convolutional networks by means of saliency maps. Most approaches based on Class Activation Maps (CAM) combine information from fully connected layers and gradient through variants of backpropagation. However, it is well understood that gradients are noisy and alternatives like guided backpropagation have been proposed to obtain better visualization at inference. In this work, we present a novel training approach to improve the quality of gradients for interpretability. In particular, we introduce a regularization loss such that the gradient with respect to the input image obtained by standard backpropagation is similar to the gradient obtained by guided backpropagation. We find that the resulting gradient is qualitatively less noisy and improves quantitatively the interpretability properties of different networks, using several interpretability methods.
本文通过关注卷积神经网络的清晰度图来研究可解释性。大多数基于类激活图(CAM)的方法结合了全连接层的信息和反向传播中的梯度。然而,人们普遍认为梯度是噪声,因此出现了类似于指导反向传播(GSP)的方法来获得更好的推理可视化。在这项工作中,我们提出了一个新颖的训练方法来提高梯度的质量。特别地,我们引入了一个正则化损失,使得通过标准反向传播获得的输入图像的梯度与通过指导反向传播获得的梯度相似。我们发现,通过这种方法得到的梯度在质上是更少的噪声,并且通过使用几种可解释性方法,提高了不同网络的定量可解释性特性。
https://arxiv.org/abs/2404.15024
Explanations obtained from transformer-based architectures in the form of raw attention, can be seen as a class-agnostic saliency map. Additionally, attention-based pooling serves as a form of masking the in feature space. Motivated by this observation, we design an attention-based pooling mechanism intended to replace Global Average Pooling (GAP) at inference. This mechanism, called Cross-Attention Stream (CA-Stream), comprises a stream of cross attention blocks interacting with features at different network depths. CA-Stream enhances interpretability in models, while preserving recognition performance.
通过Transformer架构获得的解释,以原始注意形式表示,可以看作是一个类无关的显著性图。此外,基于注意的池化作为一种对特征空间进行遮蔽的形式。为了实现这个目标,我们设计了一个基于注意的池化机制,旨在在推理过程中取代全局平均池化(GAP)。这个机制被称为跨注意流(CA-Stream),它包括一系列与不同网络深度的特征交互的跨注意块。CA-Stream提高了模型的可解释性,同时保留了识别性能。
https://arxiv.org/abs/2404.14996
Understanding emotions and expressions is a task of interest across multiple disciplines, especially for improving user experiences. Contrary to the common perception, it has been shown that emotions are not discrete entities but instead exist along a continuum. People understand discrete emotions differently due to a variety of factors, including cultural background, individual experiences, and cognitive biases. Therefore, most approaches to expression understanding, particularly those relying on discrete categories, are inherently biased. In this paper, we present a comparative in-depth analysis of two common datasets (AffectNet and EMOTIC) equipped with the components of the circumplex model of affect. Further, we propose a model for the prediction of facial expressions tailored for lightweight applications. Using a small-scaled MaxViT-based model architecture, we evaluate the impact of discrete expression category labels in training with the continuous valence and arousal labels. We show that considering valence and arousal in addition to discrete category labels helps to significantly improve expression inference. The proposed model outperforms the current state-of-the-art models on AffectNet, establishing it as the best-performing model for inferring valence and arousal achieving a 7% lower RMSE. Training scripts and trained weights to reproduce our results can be found here: this https URL.
理解情感和表达是一个跨越多个学科的任务,尤其是在提高用户体验方面。与普遍认识相反,已经证明情感并不是离散的实体,而是存在于一个连续的过程中。由于各种因素(包括文化背景、个人经历和认知偏见)的不同,人们对离散情感的理解存在差异。因此,大多数表达理解方法,尤其是那些依赖离散类别的,在本质上存在偏见。在本文中,我们对两个常见的数据集(AffectNet和EMOTIC)进行了比较深入的分析和评估,这些数据集配备了共轭模型的组件。此外,我们提出了一个专为轻量级应用设计的面部表情预测模型。通过基于小规模的MaxViT模型架构,我们在训练过程中使用连续的紧张和兴奋标签对离散表达类别标签的影响进行了评估。我们发现,在考虑紧张和兴奋标签的同时,使用离散类别标签可以显著提高表情推断。所提出的模型在AffectNet上优于现有状态,将其确立为推断紧张和兴奋的最佳模型,具有7%的较低MSE。训练脚本和训练权重以复制我们的结果可以从这里找到:https://www. this URL。
https://arxiv.org/abs/2404.14975
Learning-based image stitching techniques typically involve three distinct stages: registration, fusion, and rectangling. These stages are often performed sequentially, each trained independently, leading to potential cascading error propagation and complex parameter tuning challenges. In rethinking the mathematical modeling of the fusion and rectangling stages, we discovered that these processes can be effectively combined into a single, variety-intensity inpainting problem. Therefore, we propose the Simple and Robust Stitcher (SRStitcher), an efficient training-free image stitching method that merges the fusion and rectangling stages into a unified model. By employing the weighted mask and large-scale generative model, SRStitcher can solve the fusion and rectangling problems in a single inference, without additional training or fine-tuning of other models. Our method not only simplifies the stitching pipeline but also enhances fault tolerance towards misregistration errors. Extensive experiments demonstrate that SRStitcher outperforms state-of-the-art (SOTA) methods in both quantitative assessments and qualitative evaluations. The code is released at this https URL
基于学习的图像拼接技术通常包括三个不同的阶段:注册、融合和矩形化。这些阶段通常按顺序执行,每个阶段都经过独立训练,这可能导致级联错误传播和复杂参数调整挑战。在重新考虑融合和矩形化阶段的数学建模时,我们发现这些过程可以有效合并成一个单一的多样性强度在补救问题中。因此,我们提出了简单且鲁棒的全拼接器(SRStitcher)高效的无训练图像拼接方法,将融合和矩形化阶段合并为一个统一的模型。通过采用加权掩码和大规模生成模型,SRStitcher可以在单个推理中解决融合和矩形化问题,而无需其他模型的微调或训练。我们的方法不仅简化了拼接流程,还提高了对配准错误容错的能力。大量实验证明,SRStitcher在定量评估和定性评估方面都优于最先进的(SOTA)方法。代码发布在https://这一URL
https://arxiv.org/abs/2404.14951
Anomaly detection in real-world scenarios poses challenges due to dynamic and often unknown anomaly distributions, requiring robust methods that operate under an open-world assumption. This challenge is exacerbated in practical settings, where models are employed by private organizations, precluding data sharing due to privacy and competitive concerns. Despite potential benefits, the sharing of anomaly information across organizations is restricted. This paper addresses the question of enhancing outlier detection within individual organizations without compromising data confidentiality. We propose a novel method leveraging representation learning and federated learning techniques to improve the detection of unknown anomalies. Specifically, our approach utilizes latent representations obtained from client-owned autoencoders to refine the decision boundary of inliers. Notably, only model parameters are shared between organizations, preserving data privacy. The efficacy of our proposed method is evaluated on two standard financial tabular datasets and an image dataset for anomaly detection in a distributed setting. The results demonstrate a strong improvement in the classification of unknown outliers during the inference phase for each organization's model.
在现实场景中,异常检测 poses 挑战,因为动态且往往不确定的异常分布,需要操作在开放世界假设上的稳健方法。在实际场景中,私人组织使用模型,这使得数据无法共享,因为隐私和竞争担忧。尽管存在潜在好处,但组织之间共享异常信息受到限制。本文回答了一个问题:在保留数据机密性的前提下,如何提高组织内部的个人异常检测。我们提出了一种利用客户端自定义的自动编码器的隐式表示来改善未知的异常检测的新方法。具体来说,我们的方法利用客户端自定义的自动
https://arxiv.org/abs/2404.14933
Graphs play an important role in representing complex relationships in various domains like social networks, knowledge graphs, and molecular discovery. With the advent of deep learning, Graph Neural Networks (GNNs) have emerged as a cornerstone in Graph Machine Learning (Graph ML), facilitating the representation and processing of graph structures. Recently, LLMs have demonstrated unprecedented capabilities in language tasks and are widely adopted in a variety of applications such as computer vision and recommender systems. This remarkable success has also attracted interest in applying LLMs to the graph domain. Increasing efforts have been made to explore the potential of LLMs in advancing Graph ML's generalization, transferability, and few-shot learning ability. Meanwhile, graphs, especially knowledge graphs, are rich in reliable factual knowledge, which can be utilized to enhance the reasoning capabilities of LLMs and potentially alleviate their limitations such as hallucinations and the lack of explainability. Given the rapid progress of this research direction, a systematic review summarizing the latest advancements for Graph ML in the era of LLMs is necessary to provide an in-depth understanding to researchers and practitioners. Therefore, in this survey, we first review the recent developments in Graph ML. We then explore how LLMs can be utilized to enhance the quality of graph features, alleviate the reliance on labeled data, and address challenges such as graph heterogeneity and out-of-distribution (OOD) generalization. Afterward, we delve into how graphs can enhance LLMs, highlighting their abilities to enhance LLM pre-training and inference. Furthermore, we investigate various applications and discuss the potential future directions in this promising field.
图在表示复杂关系方面在社交网络、知识图谱和分子发现等领域中发挥着重要作用。随着深度学习的出现,图神经网络(GNNs)成为图机器学习(Graph ML)的一个支柱,推动了图结构的表示和处理。近年来,LLM在语言任务上的表现已经达到了史无前例的水平,并在各种应用领域(如计算机视觉和推荐系统)得到了广泛应用。这一显著的成功也引起了将LLM应用于图形领域的兴趣。越来越多的努力致力于探索LLM在推动图机器学习的一般化、可迁移性和少样本学习能力方面的潜力。同时,特别是知识图谱,图形在可靠的事实知识方面非常丰富,可以利用来增强LLM的推理能力,并可能减轻其局限性,如幻觉和缺乏可解释性。鉴于这一研究领域的快速进步,对于LLM时代图机器学习的系统综述总结最新的进展是必要的,以提供研究人员和实践者对这一领域的深入理解。因此,在本次调查中,我们首先回顾了图机器学习领域的最新发展。然后,我们探讨了LLM如何用于提高图形特征的质量、减轻对标注数据的依赖以及解决诸如图形异质性和离散(OOD)泛化等问题。接着,我们深入研究了图形如何增强LLM,强调了它们在提高LLM预训练和推理能力方面的能力。最后,我们调查了各种应用,并讨论了这一充满前景的领域未来的潜在方向。
https://arxiv.org/abs/2404.14928
In multi-sample keyword spotting, each keyword class is represented by multiple spoken instances, called samples. A naïve approach to detect keywords in a target sequence consists of querying all samples of all classes using sub-sequence dynamic time warping. However, the resulting processing time increases linearly with respect to the number of samples belonging to each class. Alternatively, only a single Fréchet mean can be queried for each class, resulting in reduced processing time but usually also in worse detection performance as the variability of the query samples is not captured sufficiently well. In this work, multi-sample dynamic time warping is proposed to compute class-specific cost-tensors that include the variability of all query samples. To significantly reduce the computational complexity during inference, these cost tensors are converted to cost matrices before applying dynamic time warping. In experimental evaluations for few-shot keyword spotting, it is shown that this method yields a very similar performance as using all individual query samples as templates while having a runtime that is only slightly slower than when using Fréchet means.
在多样本关键词检测中,每个关键词类别由多个口头实例表示,这些实例被称为样本。检测目标序列中关键词的一种 naive 方法包括对所有类别的样本使用子序列动态时间压缩。然而,由于每个类别的样本数量不同,因此处理时间会线性增加。另外,为每个类别只能查询一个 Fréchet mean,导致处理时间降低,但通常检测性能也会较差,因为查询样本的变异程度没有被捕捉足够好。 在本文中,提出了一种多样本动态时间压缩方法来计算包括所有查询样本变异性的类特定成本张量。为了在推理过程中显著降低计算复杂性,这些成本张量在应用动态时间压缩之前被转换为成本矩阵。在少量样本关键词检测的实验评估中,研究表明,这种方法与使用所有单个查询样本作为模板时的性能非常相似,但运行时间略慢于使用 Fréchet mean。
https://arxiv.org/abs/2404.14903
With the increasingly giant scales of (causal) large language models (LLMs), the inference efficiency comes as one of the core concerns along the improved performance. In contrast to the memory footprint, the latency bottleneck seems to be of greater importance as there can be billions of requests to a LLM (e.g., GPT-4) per day. The bottleneck is mainly due to the autoregressive innateness of LLMs, where tokens can only be generated sequentially during decoding. To alleviate the bottleneck, the idea of speculative execution, which originates from the field of computer architecture, is introduced to LLM decoding in a \textit{draft-then-verify} style. Under this regime, a sequence of tokens will be drafted in a fast pace by utilizing some heuristics, and then the tokens shall be verified in parallel by the LLM. As the costly sequential inference is parallelized, LLM decoding speed can be significantly boosted. Driven by the success of LLMs in recent couple of years, a growing literature in this direction has emerged. Yet, there lacks a position survey to summarize the current landscape and draw a roadmap for future development of this promising area. To meet this demand, we present the very first survey paper that reviews and unifies literature of speculative execution in LLMs (e.g., blockwise parallel decoding, speculative decoding, etc.) in a comprehensive framework and a systematic taxonomy. Based on the taxonomy, we present a critical review and comparative analysis of the current arts. Finally we highlight various key challenges and future directions to further develop the area.
随着大型语言模型(LLMs)越来越大,提高性能的核心问题之一是推理效率。相比之下,内存开销似乎不太重要,因为每天可能有数十亿个请求到LLM(例如GPT-4)。瓶颈主要源于LLMs的自回归性质,其中在解码过程中只能按顺序生成标记。为了减轻瓶颈,借鉴计算机架构领域的思想,以“草案-验证”的方式引入了LLM解码中的speculative execution。在这种模式下,通过使用一些启发式方法,可以快速生成一系列标记,然后由LLM并行验证这些标记。随着成本sequential inference的并行化,LLM解码速度可以大幅提高。 在LLM在过去几年取得成功的情况下,这一方向出现了越来越多的文献。然而,目前尚缺乏一份全面的调查报告,总结当前格局并为未来这个有前景的领域的发展路线图。为了满足这一需求,我们提出了第一篇 survey 论文,它回顾和统一了LLMs中speculative execution(例如块式并行解码,speculative decoding等)的文獻,并建立了一个全面的框架和系统分类学。根据这一分类学,我们给出了对当前艺术的关键审查和比较分析。最后,我们强调了进一步发展和该领域的各种关键挑战和未来方向。
https://arxiv.org/abs/2404.14897
Although the convolutional neural network (CNN) has achieved excellent performance in vision tasks by extracting the intra-sample representation, it will take a higher training expense because of stacking numerous convolutional layers. Recently, as the bilinear models, graph neural networks (GNN) have succeeded in exploring the underlying topological relationship among the graph data with a few graph neural layers. Unfortunately, it cannot be directly utilized on non-graph data due to the lack of graph structure and has high inference latency on large-scale scenarios. Inspired by these complementary strengths and weaknesses, \textit{we discuss a natural question, how to bridge these two heterogeneous networks?} In this paper, we propose a novel CNN2GNN framework to unify CNN and GNN together via distillation. Firstly, to break the limitations of GNN, a differentiable sparse graph learning module is designed as the head of networks to dynamically learn the graph for inductive learning. Then, a response-based distillation is introduced to transfer the knowledge from CNN to GNN and bridge these two heterogeneous networks. Notably, due to extracting the intra-sample representation of a single instance and the topological relationship among the datasets simultaneously, the performance of distilled ``boosted'' two-layer GNN on Mini-ImageNet is much higher than CNN containing dozens of layers such as ResNet152.
虽然卷积神经网络(CNN)通过提取内部样本表示在视觉任务中取得了优秀的性能,但由于堆叠多个卷积层,其训练成本会更高。最近,作为一种线性模型,图神经网络(GNN)通过几个图神经网络层成功探索了图数据的潜在拓扑关系。然而,由于缺乏图形结构,在大型场景上,它无法直接应用于非图形数据。受到这些互补优势和劣势的启发,我们提出了一个自然的问题:如何将这些异构网络连接起来?本文我们提出了一种名为CNN2GNN的新CNN-GNN框架,通过蒸馏将CNN和GNN统一在一起。首先,为了克服GNN的限制,我们设计了一个可导稀疏图学习模块,作为网络的头部,通过归纳学习动态地学习图形。然后,引入了基于响应的蒸馏,将CNN中的知识传递给GNN,并桥接这两个异构网络。值得注意的是,由于同时提取了单个实例的内部表示和数据集中的拓扑关系,用于微调的“加强”二层GNN在Mini-ImageNet上的性能比包含数十层ResNet152的CNN更高。
https://arxiv.org/abs/2404.14822
Chain-of-thought (CoT) prompting can guide language models to engage in complex multi-step reasoning. The quality of provided demonstrations significantly impacts the success of downstream inference tasks. While existing automated methods prioritize accuracy and semantics in these demonstrations, we show that the underlying reasoning patterns play a more crucial role in such tasks. In this paper, we propose Pattern-Aware CoT, a prompting method that considers the diversity of demonstration patterns. By incorporating patterns such as step length and reasoning process within intermediate steps, PA-CoT effectively mitigates the issue of bias induced by demonstrations and enables better generalization to diverse scenarios. We conduct experiments on nine reasoning benchmark tasks using two open-source LLMs. The results show that our method substantially enhances reasoning performance and exhibits robustness to errors. The code will be made publicly available.
提示(CoT)引导语言模型参与复杂的多步骤推理。提供的演示质量对下游推理任务的 success具有重要影响。虽然现有的自动方法在这些演示中优先考虑准确性和语义,但我们表明,在这种任务中,潜在的推理模式起着更关键的作用。在本文中,我们提出了感知模式的CoT,一种关注演示模式多样性的提示方法。通过将模式如步长和推理过程等纳入中间步骤,PA-CoT有效地减轻了演示引起的有偏见问题,并使对不同情景的泛化更好。我们使用两个开源的LLM在九个推理基准任务上进行了实验。结果表明,我们的方法显著增强了推理性能并表现出了对错误的鲁棒性。代码将公开可用。
https://arxiv.org/abs/2404.14812
A graph is a fundamental data model to represent various entities and their complex relationships in society and nature, such as social networks, transportation networks, financial networks, and biomedical systems. Recently, large language models (LLMs) have showcased a strong generalization ability to handle various NLP and multi-mode tasks to answer users' arbitrary questions and specific-domain content generation. Compared with graph learning models, LLMs enjoy superior advantages in addressing the challenges of generalizing graph tasks by eliminating the need for training graph learning models and reducing the cost of manual annotation. In this survey, we conduct a comprehensive investigation of existing LLM studies on graph data, which summarizes the relevant graph analytics tasks solved by advanced LLM models and points out the existing remaining challenges and future directions. Specifically, we study the key problems of LLM-based generative graph analytics (LLM-GGA) with three categories: LLM-based graph query processing (LLM-GQP), LLM-based graph inference and learning (LLM-GIL), and graph-LLM-based applications. LLM-GQP focuses on an integration of graph analytics techniques and LLM prompts, including graph understanding and knowledge graph (KG) based augmented retrieval, while LLM-GIL focuses on learning and reasoning over graphs, including graph learning, graph-formed reasoning and graph representation. We summarize the useful prompts incorporated into LLM to handle different graph downstream tasks. Moreover, we give a summary of LLM model evaluation, benchmark datasets/tasks, and a deep pro and cons analysis of LLM models. We also explore open problems and future directions in this exciting interdisciplinary research area of LLMs and graph analytics.
图形是一种基本的数据模型,用于表示社会和自然中各种实体及其复杂的关系,如社交网络、交通网络、金融网络和生物医学系统。近年来,大型语言模型(LLMs)在处理各种自然语言处理(NLP)和多模态任务方面表现出强大的泛化能力,从而回答用户的任意问题和特定领域内容生成。与图形学习模型相比,LLMs在解决图形任务的挑战方面具有优越的优势,通过消除训练图形学习模型的需求并降低手动注释的成本。在本次调查中,我们对LLM关于图形数据的现有研究进行全面调查,概述了高级LLM模型解决的相关图形分析任务,并指出了现有的剩余挑战和未来发展方向。具体来说,我们研究了基于LLM的生成图数据分析(LLM-GGA)的三个主要问题:LLM-基于图查询处理(LLM-GQP)、LLM-基于图推理和学习(LLM-GIL)和基于图形-LLM的应用。LLM-GQP关注将图形数据分析技术和LLM提示进行集成,包括基于图理解和知识图(KG)的增强检索,而LLM-GIL关注在图形上进行学习和推理,包括图形学习、图形形成推理和图形表示。我们总结了LLM中纳入不同图形下游任务的有用提示。此外,我们还对LLM模型评估、基准数据集/任务以及LLM模型的优缺点进行了总结。此外,我们在LLM和图数据分析这一激动人心的跨学科研究领域中进行了探索。
https://arxiv.org/abs/2404.14809
Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayesian in-Context example Selection method (ByCS) for ICL. Extending the inference probability conditioned on in-context examples based on Bayes' theorem, ByCS focuses on the inverse inference conditioned on test input. Following the assumption that accurate inverse inference probability (likelihood) will result in accurate inference probability (posterior), in-context examples are selected based on their inverse inference results. Diverse and extensive cross-tasking and cross-modality experiments are performed with speech, text, and image examples. Experimental results show the efficacy and robustness of our ByCS method on various models, tasks and modalities.
大语言模型(LLMs)可以通过基于对话历史中提供的几个示例进行基于无模型参数更新的语境学习(ICL)来适应新的任务。尽管这种便利性,但ICL的性能很大程度上取决于提供的语境示例的质量,这使得语境示例选择方法成为一个关键的选择。本文提出了一种新颖的贝叶斯语境示例选择方法(ByCS)用于ICL。基于贝叶斯公式的推理概率条件,ByCS关注于基于测试输入的逆推理条件。假设准确的反向推理概率(概率)会导致准确的后验概率(后),根据逆推理结果选择语境示例。我们对语音、文本和图像例子进行了多样且广泛的跨任务和跨模态实验。实验结果展示了我们ByCS方法在不同模型、任务和模态上的有效性和鲁棒性。
https://arxiv.org/abs/2404.14716
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in this https URL.
近年来,在大型零样本语音合成方面,自然语言处理(NLP)模型和扩散模型的进步显著加快了该领域的进展。然而,这两种方法的生成过程缓慢且计算密集。使用较低的计算预算实现与之前工作相同的质量仍然是一个重要的挑战。在本文中,我们提出了 FlashSpeech,一种大型零样本语音合成系统,与之前的工作相比,其推理时间减少了约 5%。FlashSpeech 基于潜在一致性模型,并应用了一种新颖的对抗性一致性训练方法,可以从零开始训练,无需预先训练的扩散模型作为教师。此外,一个新的元音生成器模块增强了元音的多样性,使语音节奏更加自然。FlashSpeech 的生成过程可以通过一个或两个采样步骤实现高效,同时保持高音频质量和与零样本语音生成的音频提示的高相似度。我们的实验结果证明了 FlashSpeech 的卓越性能。值得注意的是,FlashSpeech 可以在保持与其它零样本语音合成系统相当的声音质量和相似性的同时,大约 20 倍于其他系统。此外,FlashSpeech 通过有效地执行像语音转换、语音编辑和多样语音采样等任务,展示了其多才性。音频样本可在此链接中找到。
https://arxiv.org/abs/2404.14700
The task of accurate and efficient language translation is an extremely important information processing task. Machine learning enabled and automated translation that is accurate and fast is often a large topic of interest in the machine learning and data science communities. In this study, we examine using local Generative Pretrained Transformer (GPT) models to perform automated zero shot black-box, sentence wise, multi-natural-language translation into English text. We benchmark 16 different open-source GPT models, with no custom fine-tuning, from the Huggingface LLM repository for translating 50 different non-English languages into English using translated TED Talk transcripts as the reference dataset. These GPT model inference calls are performed strictly locally, on single A100 Nvidia GPUs. Benchmark metrics that are reported are language translation accuracy, using BLEU, GLEU, METEOR, and chrF text overlap measures, and wall-clock time for each sentence translation. The best overall performing GPT model for translating into English text for the BLEU metric is ReMM-v2-L2-13B with a mean score across all tested languages of $0.152$, for the GLEU metric is ReMM-v2-L2-13B with a mean score across all tested languages of $0.256$, for the chrF metric is Llama2-chat-AYT-13B with a mean score across all tested languages of $0.448$, and for the METEOR metric is ReMM-v2-L2-13B with a mean score across all tested languages of $0.438$.
准确且高效的机器翻译是一个极其重要的信息处理任务。通过机器学习实现和自动翻译,通常在机器学习和数据科学社区是一个大的研究主题。在这项研究中,我们使用局部生成预训练Transformer(GPT)模型在英语文本上进行自动零样本黑色文本翻译。我们使用翻译的TED演讲转录作为参考数据集,将50种非英语语言翻译成英语。这些GPT模型推理都在本地进行,使用单个A100 Nvidia GPU。报告的基准指标包括语言翻译准确性、BLEU、GLEU、METEOR和chrF文本重叠度衡量,以及每个句子翻译的墙钟时间。在BLEU指标上,翻译成英语文本的最佳GPT模型是ReMM-v2-L2-13B,平均分数为所有测试语言的$0.152$;在GLEU指标上,翻译成英语文本的最佳GPT模型是ReMM-v2-L2-13B,平均分数为所有测试语言的$0.256$;在chrF指标上,Llama2-chat-AYT-13B的平均分数为所有测试语言的$0.448$;在METEOR指标上,ReMM-v2-L2-13B的平均分数为所有测试语言的$0.438$。
https://arxiv.org/abs/2404.14680
The reproducibility and transparency of large language models are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks. To this end, we release OpenELM, a state-of-the-art open language model. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. For example, with a parameter budget of approximately one billion parameters, OpenELM exhibits a 2.36% improvement in accuracy compared to OLMo while requiring $2\times$ fewer pre-training tokens. Diverging from prior practices that only provide model weights and inference code, and pre-train on private datasets, our release includes the complete framework for training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations. We also release code to convert models to MLX library for inference and fine-tuning on Apple devices. This comprehensive release aims to empower and strengthen the open research community, paving the way for future open research endeavors. Our source code along with pre-trained model weights and training recipes is available at \url{this https URL}. Additionally, \model models can be found on HuggingFace at: \url{this https URL}.
可重复性和透明度是大型语言模型的关键,对于推动开放研究、确保结果的可信度以及数据和模型偏见的研究和潜在风险的调查都至关重要。为此,我们发布了OpenELM,一个最先进的开放式语言模型。OpenELM使用分层扩展策略,在变换器模型的每个层次上有效地分配参数,导致准确度得到提高。例如,在参数预算约为10亿个参数的情况下,OpenELM在OLMo上的准确度提高了2.36%,而需要2倍 fewer 的预训练标记符。 我们发布了一个完全框架,用于在公开可用的数据集上训练和评估语言模型,包括训练日志、多个检查点和预训练配置。我们还发布了将模型转换为MLX库用于推理和微调的代码,以便在苹果设备上进行使用。这个全面的发布旨在增强和加强开放研究社区,为未来的开放研究探索奠定基础。我们的源代码以及预训练模型权重和训练 recipe都可以在[这个链接](https://github.com/yourggroup/openelm)上找到。此外,模型的预训练模型可以在HuggingFace找到:[这个链接](https://github.com/yourggroup/openelm)。
https://arxiv.org/abs/2404.14619
Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of response quality. Therefore in this work we propose a hybrid inference approach which combines their respective strengths to save cost and maintain quality. Our approach uses a router that assigns queries to the small or large model based on the predicted query difficulty and the desired quality level. The desired quality level can be tuned dynamically at test time to seamlessly trade quality for cost as per the scenario requirements. In experiments our approach allows us to make up to 40% fewer calls to the large model, with no drop in response quality.
大语言模型(LLMs)在大多数自然语言处理任务中表现出色,但它们需要昂贵的云服务器来进行部署,因为它们的规模较大。而较小的模型,可以在较低成本(例如边缘设备)上部署,往往在响应质量方面垫后被超越。因此,在本研究中,我们提出了一个混合推理方法,该方法结合了它们各自的优点以节省成本并保持质量。我们的方法使用一个路由器,根据预测的查询难度和所需质量水平将查询分配给小或大模型。所需质量水平可以在测试时间动态调整,以根据场景需求在不牺牲质量的情况下换取成本。在实验中,我们的方法使我们最多可以减少对大模型的调用次数,而不会降低响应质量。
https://arxiv.org/abs/2404.14618
This paper introduces \textbf{Q-tuning}, a novel approach for continual prompt tuning that enables the lifelong learning of a pre-trained language model. When learning a new task, Q-tuning trains a task-specific prompt by adding it to a prompt queue consisting of the prompts from older tasks. To better transfer the knowledge of old tasks, we design an adaptive knowledge aggregation technique that reweighs previous prompts in the queue with a learnable low-rank matrix. Once the prompt queue reaches its maximum capacity, we leverage a PCA-based eviction rule to reduce the queue's size, allowing the newly trained prompt to be added while preserving the primary knowledge of old tasks. In order to mitigate the accumulation of information loss caused by the eviction, we additionally propose a globally shared prefix prompt and a memory retention regularization based on information theory. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods substantially on continual prompt tuning benchmarks. Moreover, our approach enables lifelong learning on linearly growing task sequences while requiring constant complexity for training and inference.
本文介绍了一种名为 \textbf{Q-tuning} 的新方法,用于持续 prompt 调整,从而实现预训练语言模型的终身学习。在学习新任务时,Q-tuning 通过将新任务加入一个由 older 任务提示组成的提示队列中,来训练一个任务特定的提示。为了更好地转移旧任务的知識,我們設計了一種自適應的知識聚合技術,通過可學習的低秩矩陣重新權重队列中的先前提示。一旦提示隊列達到其最大容量,我們利用基于主成分分析(PCA)的驱逐规则来减少队列的大小,从而在保留主要舊任务知识的同时,允许新训练的提示加入队列。为了减轻驱逐操作造成的信息损失的累积,我们还提出了一个全局共享前缀提示和基于信息理论的内存保留 regularization。大量实验证明,与最先进的 methods相比,我们的方法在持续 prompt 调整基准测试中显著表现出优势。此外,我们的方法在 linearly growing 任务序列上实现终身学习,同时需要训练和推理的常规模度为不变。
https://arxiv.org/abs/2404.14607