Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.
对比性语言-图像预训练(CLIP)是一种被誉为用于训练视觉编码器生成图像/文本表示以促进各种应用的训练方法。最近,CLIP已被广泛采用作为多模态大型语言模型的视觉骨干,以连接图像输入为语言交互。CLIP作为视觉-语言基础模型的成功之处在于在图像级别上与网页爬取到的嘈杂文本注释对齐。然而,对于需要细粒度视觉表示的下游任务来说,这样的标准可能变得不足。在本文中,我们通过几个进展来提高CLIP的局部化能力。我们提出了一种补充方法,称为CLOC(对比性局部化语言-图像预训练),通过补充CLIP与区域文本对比损失和模块来形成一种新的概念:提示性嵌入。为了支持大规模预训练,我们设计了一个视觉丰富且局部化的摘要框架,有效地在规模上生成区域文本伪标签。通过扩展数十亿个注释的图像,CLOC为图像区域识别和检索任务提供高质量的区域嵌入,可以成为CLIP的 drop-in 替换,尤其是在参考和 grounded 任务上。
https://arxiv.org/abs/2410.02746
Information Retrieval (IR) methods aim to identify relevant documents in response to a given query, which have gained remarkable attention due to their successful application in various natural language tasks. However, existing approaches typically consider only the textual information within the documents, which overlooks the fact that documents can contain multiple modalities, including texts, images, and tables. Further, they often segment each long document into multiple discrete passages for embedding, preventing them from capturing the overall document context and interactions between paragraphs. We argue that these two limitations lead to suboptimal document representations for retrieval. In this work, to address them, we aim to produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse information retrieval scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information interleaved within the documents in a unified way.
信息检索(IR)方法旨在针对给定查询识别相关的文档,这是由于其在各种自然语言任务中取得成功而备受关注。然而,现有的方法通常仅考虑文档中的文本信息,而忽略了文档可以包含多种形式的信息,包括文本、图像和表格。此外,它们通常将每个长文档分割成多个离散的段落进行嵌入,阻止了它们捕捉到整个文档的上下文和段落之间的互动。我们认为,这两个限制导致了检索到的文档表示不是最优的。在本文中,为了应对这些限制,我们旨在通过将文档与不同形式的信息集成来生成更全面和细微的文档表示。具体来说,我们通过利用最近在视觉语言模型上取得的处理和整合文本、图像和表格统一格式和表示的能力来实现这一目标。此外,为了减轻将文档分割为段落所带来的信息损失,我们进一步将分割段落的表示合并为一个单独的文档表示,同时引入了重排策略来在必要时将相关段落的重排组合成一个单独的文档表示。然后,通过在各种信息检索场景中进行广泛的实验,包括文本和多模态查询,我们证明了我们的方法在很大程度上超过了相关基线,得益于在文档中考虑了多种形式的信息的统一处理。
https://arxiv.org/abs/2410.02729
Text-to-image (T2I) diffusion models have drawn attention for their ability to generate high-quality images with precise text alignment. However, these models can also be misused to produce inappropriate content. Existing safety measures, which typically rely on text classifiers or ControlNet-like approaches, are often insufficient. Traditional text classifiers rely on large-scale labeled datasets and can be easily bypassed by rephrasing. As diffusion models continue to scale, fine-tuning these safeguards becomes increasingly challenging and lacks flexibility. Recent red-teaming attack researches further underscore the need for a new paradigm to prevent the generation of inappropriate content. In this paper, we introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model, ensuring that generated images adhere to ethical and safety standards with little to no impact on usability. SteerDiff identifies and manipulates inappropriate concepts within the text embedding space to guide the model away from harmful outputs. We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach. Furthermore, we benchmark SteerDiff against multiple red-teaming strategies to assess its robustness. Finally, we explore the potential of SteerDiff for concept forgetting tasks, demonstrating its versatility in text-conditioned image generation.
文本到图像(T2I)扩散模型因能够在精确的文本对齐下生成高质量图像而引起了关注。然而,这些模型也可能被用于制作不适当的内容。现有的安全措施,通常依赖于文本分类器或类似ControlNet的方法,往往是不够的。传统的文本分类器依赖于大规模标记数据集,并且很容易通过重新表述绕过。随着扩散模型不断扩展,对这些安全措施进行微调变得越来越具有挑战性,而且缺乏灵活性。最近的一些红队攻击研究进一步强调了需要一种新的范式来防止生成不适当内容的重要性。在本文中,我们引入了SteerDiff,一个轻量级的适配器模块,旨在充当用户输入和扩散模型之间的中介,确保生成的图像符合道德和安全标准,对可用性影响较小。SteerDiff在文本嵌入空间中识别和操作不适当的概念,引导模型远离有害的输出。我们对各种概念消除任务进行了广泛的实验,以评估我们方法的有效性。此外,我们还对SteerDiff与多种红队策略进行了比较,以评估其稳健性。最后,我们探讨了SteerDiff在概念忘记任务中的潜力,展示了其在文本条件下图像生成的多样性和灵活性。
https://arxiv.org/abs/2410.02710
For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.
许多自动语音识别(ASR)任务中,音频特征作为频谱图显示出比Mel频谱系数(MFCC)更好的结果,但在实践中,由于特征空间复杂维度,它们很难使用。以下论文提出了一种基于卷积变分自编码器(VAE)生成压缩频谱表示的方法。使用LibriSpeech数据集中的子集训练卷积变分自编码器(25ms)重构音频频谱图(13维)中的短片段。用于GoogleSpeechCommands数据集中的说话命令语料库的ASR系统使用了生成的特征。与使用MFCC特征的模型进行了比较。
https://arxiv.org/abs/2410.02560
SDO-FM is a foundation model using data from NASA's Solar Dynamics Observatory (SDO) spacecraft; integrating three separate instruments to encapsulate the Sun's complex physical interactions into a multi-modal embedding space. This model can be used to streamline scientific investigations involving SDO by making the enormous datasets more computationally accessible for heliophysics research and enable investigations that require instrument fusion. We discuss four key components: an ingestion pipeline to create machine learning ready datasets, the model architecture and training approach, resultant embeddings and fine-tunable models, and finally downstream fine-tuned applications. A key component of this effort has been to include subject matter specialists at each stage of development; reviewing the scientific value and providing guidance for model architecture, dataset, and training paradigm decisions. This paper marks release of our pretrained models and embedding datasets, available to the community on Hugging Face and this http URL.
SDO-FM 是一个使用来自 NASA 的 Solar Dynamics Observatory(SDO)太空船收集的数据作为基础的模型,将太阳的复杂物理相互作用集成到一个多模态嵌入空间中。这个模型可以用于通过使极大地数据集更具计算可得性来简化 SDO 在 heliophysics 研究中的科学调查,并使需要仪器融合的研究成为可能。我们讨论了四个关键组件:用于创建机器学习友好数据集的 ingestion 管道、模型架构和训练方法、结果嵌入和可微调的模型,以及最后的下游微调应用程序。 这一努力的关键部分是包括在开发过程中 subject matter specialists 在每个阶段;回顾科学价值并提供模型架构、数据和训练范式决策的指导。本文标志着我们预训练模型的发布,这些模型和嵌入数据可于 Hugging Face 和以下链接获取:
https://arxiv.org/abs/2410.02530
We introduce Learning from Offline Foundation Features with Tensor Augmentations (LOFF-TA), an efficient training scheme designed to harness the capabilities of foundation models in limited resource settings where their direct development is not feasible. LOFF-TA involves training a compact classifier on cached feature embeddings from a frozen foundation model, resulting in up to $37\times$ faster training and up to $26\times$ reduced GPU memory usage. Because the embeddings of augmented images would be too numerous to store, yet the augmentation process is essential for training, we propose to apply tensor augmentations to the cached embeddings of the original non-augmented images. LOFF-TA makes it possible to leverage the power of foundation models, regardless of their size, in settings with limited computational capacity. Moreover, LOFF-TA can be used to apply foundation models to high-resolution images without increasing compute. In certain scenarios, we find that training with LOFF-TA yields better results than directly fine-tuning the foundation model.
我们提出了一个名为Learning from Offline Foundation Features with Tensor Augmentations(LOFF-TA)的训练方案,这是一种旨在利用在有限资源设置中基础模型功能的训练方法。通过在冻存的基础模型特征嵌入上训练一个紧凑的分类器,LOFF-TA实现了最高达37倍的训练速度和最高达26倍的GPU内存使用率的提升。 由于增强图像的嵌入数量太多,无法存储,但增强过程对训练至关重要,因此我们提出将张量增强应用于原始非增强图像的缓存嵌入。LOFF-TA使得在计算能力有限的环境中,可以利用基础模型的力量。此外,LOFF-TA还可以用于在不增加计算的情况下将基础模型应用于高分辨率图像。在某些场景中,我们发现使用LOFF-TA训练产生的结果比直接对基础模型进行微调要好。
https://arxiv.org/abs/2410.02527
Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.
密集文档嵌入是神经检索的核心。主导范式是通过在单个文档上运行编码器来训练和构建嵌入。在本文中,我们 argue,尽管这些嵌入有效,但它们在针对检索目的时是隐含地脱离上下文的,并且应该考虑上下文的文档和邻近文档 - 类似于上下文的词嵌入。我们提出了两种互补的文档嵌入方法:首先,一种 alternative contrastive learning objective 明确地将文档邻居信息融入内部批次的上下文损失中;其次,一种新的文档编码器架构明确地将邻近文档信息编码到编码表示中。结果表明,两种方法在多个设置上都比生物编码器表现更好,尤其在前域表现更加突出。我们在没有进行负样本挖掘、分数蒸馏、数据集特定指令、GPU 内例共享或极其大的批量大小的 MTEB 基准上实现了与生物编码器相媲美的最佳结果。我们的方法可以应用于任何对比学习数据集和任何生物编码器,以提高其性能。
https://arxiv.org/abs/2410.02525
As Large Language Models (LLMs) become increasingly integrated into our daily lives, the potential harms from deceptive behavior underlie the need for faithfully interpreting their decision-making. While traditional probing methods have shown some effectiveness, they remain best for narrowly scoped tasks while more comprehensive explanations are still necessary. To this end, we investigate meta-models-an architecture using a "meta-model" that takes activations from an "input-model" and answers natural language questions about the input-model's behaviors. We evaluate the meta-model's ability to generalize by training them on selected task types and assessing their out-of-distribution performance in deceptive scenarios. Our findings show that meta-models generalize well to out-of-distribution tasks and point towards opportunities for future research in this area.
随着大型语言模型(LLMs)越来越多的融入我们的日常生活,潜在的欺骗行为所带来的潜在危害促使我们还需要忠实解释其决策。虽然传统探索方法在某些情况下显示出一些效果,但在更广泛的任务中,它们仍然最佳。因此,我们研究了元模型-一种使用“元模型”,该模型从“输入模型”的激活状态回答关于输入模型行为的自然语言问题。我们通过将元模型训练为特定任务类型来评估其泛化能力,并通过评估其在欺骗场景中的离群表现来评估其超出分布的性能。我们的研究结果表明,元模型在超出分布任务上表现良好,并指出了未来研究的方向。
https://arxiv.org/abs/2410.02472
In this work, we describe our submissions for the Voice Privacy Challenge 2024. Rather than proposing a novel speech anonymization system, we enhance the provided baselines to meet all required conditions and improve evaluated metrics. Specifically, we implement emotion embedding and experiment with WavLM and ECAPA2 speaker embedders for the B3 baseline. Additionally, we compare different speaker and prosody anonymization techniques. Furthermore, we introduce Mean Reversion F0 for B5, which helps to enhance privacy without a loss in utility. Finally, we explore disentanglement models, namely $\beta$-VAE and NaturalSpeech3 FACodec.
在这项工作中,我们描述了我们对Voice Privacy Challenge 2024的提交内容。我们并没有提出一个新的语音匿名系统,而是对提供的基线进行了增强,以满足所有必需条件并提高评估指标。具体来说,我们为B3基线实现了情感嵌入,并尝试了WavLM和ECAPA2说话人嵌入器。此外,我们还比较了不同的说话人和 prosody 匿名技术。最后,我们还探索了离散化模型,即 $\beta$-VAE 和 NaturalSpeech3 FACodec。
https://arxiv.org/abs/2410.02371
Text plays a crucial role in the transmission of human civilization, and teaching machines to generate online handwritten text in various styles presents an interesting and significant challenge. However, most prior work has concentrated on generating individual Chinese fonts, leaving {complete text line generation largely unexplored}. In this paper, we identify that text lines can naturally be divided into two components: layout and glyphs. Based on this division, we designed a text line layout generator coupled with a diffusion-based stylized font synthesizer to address this challenge hierarchically. More concretely, the layout generator performs in-context-like learning based on the text content and the provided style references to generate positions for each glyph autoregressively. Meanwhile, the font synthesizer which consists of a character embedding dictionary, a multi-scale calligraphy style encoder, and a 1D U-Net based diffusion denoiser will generate each font on its position while imitating the calligraphy style extracted from the given style references. Qualitative and quantitative experiments on the CASIA-OLHWDB demonstrate that our method is capable of generating structurally correct and indistinguishable imitation samples.
文本在人类文明的传播中扮演着关键角色,并且教机器生成各种风格的在线手写文本提出了一个有趣且具有挑战性的问题。然而,大多数先前的研究都集中在生成单个中文字体上, leaving {完整的文本行生成大多没有被探索}。在本文中,我们发现文本行可以自然地分为两个部分:布局和字符。基于这一划分,我们设计了一个文本行布局生成器与扩散为基础的 stylized 字体合成器来解决这个问题。具体来说,布局生成器根据文本内容和提供的风格参考进行上下文类似的学习,以生成每个字符的自动位置。同时,由字符嵌入字典、多尺度书法风格编码器和基于 U-Net 的扩散去噪器组成的字体合成器将在其位置上模仿从给定风格参考中提取的书法风格。在 CASIA-OLHWDB 等数据集上进行的定性和定量实验证明,我们的方法能够生成结构正确且难以区分模仿样本。
https://arxiv.org/abs/2410.02309
Phrases are fundamental linguistic units through which humans convey semantics. This study critically examines the capacity of API-based large language models (LLMs) to comprehend phrase semantics, utilizing three human-annotated datasets. We assess the performance of LLMs in executing phrase semantic reasoning tasks guided by natural language instructions and explore the impact of common prompting techniques, including few-shot demonstrations and Chain-of-Thought reasoning. Our findings reveal that LLMs greatly outperform traditional embedding methods across the datasets; however, they do not show a significant advantage over fine-tuned methods. The effectiveness of advanced prompting strategies shows variability. We conduct detailed error analyses to interpret the limitations faced by LLMs in comprehending phrase semantics. Code and data can be found at this https URL.
短语是人们交流语义的基本语言单位。本研究对基于API的大型语言模型(LLMs)理解短语语义的能力进行了批判性探讨,利用了三个由人类标注的数据集。我们评估了LLMs在执行由自然语言指令指导的短语语义推理任务中的表现,并探讨了包括少样本演示和 Chain-of-Thought推理等常见提示技术的影响。我们的研究结果表明,LLMs在数据集上的表现远远超过了传统嵌入方法;然而,它们并没有表现出相对于微调方法的优势。高级提示策略的有效性表现出多样性。我们详细分析了LLMs在理解短语语义时所面临的局限性。代码和数据可在此链接找到:https://url.
https://arxiv.org/abs/2410.02308
Modeling temporal characteristics plays a significant role in the representation learning of audio waveform. We propose Contrastive Long-form Language-Audio Pretraining (\textbf{CoLLAP}) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words), while enabling contrastive learning across modalities and temporal dynamics. Leveraging recent Music-LLMs to generate long-form music captions for full-length songs, augmented with musical temporal structures, we collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds. We propose a novel contrastive learning architecture that fuses language representations with structured audio representations by segmenting each song into clips and extracting their embeddings. With an attention mechanism, we capture multimodal temporal correlations, allowing the model to automatically weigh and enhance the final fusion score for improved contrastive alignment. Finally, we develop two variants of the CoLLAP model with different types of backbone language models. Through comprehensive experiments on multiple long-form music-text retrieval datasets, we demonstrate consistent performance improvement in retrieval accuracy compared with baselines. We also show the pretrained CoLLAP models can be transferred to various music information retrieval tasks, with heterogeneous long-form multimodal contexts.
建模时变特征在音频波形表示学习中起着重要作用。我们提出了一种名为 Contrastive Long-form Language-Audio Pretraining (CoLLAP) 的方法,显著扩展了输入音频(长达5分钟)和语言描述(超过250个单词)的感知窗口,同时通过跨模态和时变动态进行对比学习。利用最近的 Music-LLMs 生成完整歌曲的长篇音乐摘要,并添加了音乐时序结构,我们收集了基于大型音频集训练数据集的51.3K个音频文本对,平均音频长度达到288秒。我们提出了一种新颖的对比学习架构,通过将语言表示与结构化音频表示结合,将每首歌曲分割为片段并提取它们的嵌入。通过关注机制,我们捕捉了多模态时变关联,使得模型能够自动权衡并增强最终的融合得分,从而改善对比对齐。最后,我们开发了两种类型的 CoLLAP 模型,分别为不同类型的骨干语言模型。通过在多个长篇音乐文本检索数据集上的全面实验,我们证明了与基线相比,检索准确性的提高是持续的。我们还证明了预训练的 CoLLAP 模型可以应用于各种音乐信息检索任务,包括具有异质长篇多模态上下文的各种任务。
https://arxiv.org/abs/2410.02271
With hundreds of thousands of language models available on Huggingface today, efficiently evaluating and utilizing these models across various downstream, tasks has become increasingly critical. Many existing methods repeatedly learn task-specific representations of Large Language Models (LLMs), which leads to inefficiencies in both time and computational resources. To address this, we propose EmbedLLM, a framework designed to learn compact vector representations, of LLMs that facilitate downstream applications involving many models, such as model routing. We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness. Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency. Additionally, we demonstrate that our method can forecast a model's performance on multiple benchmarks, without incurring additional inference cost. Extensive probing experiments validate that the learned embeddings capture key model characteristics, e.g. whether the model is specialized for coding tasks, even without being explicitly trained on them. We open source our dataset, code and embedder to facilitate further research and application.
借助于今天Huggingface上数百万人训练好的语言模型,在各种下游任务上有效地评估和使用这些模型变得越来越重要。许多现有方法在反复学习大型语言模型的任务特定表示后,导致了在时间和计算资源上的低效。为了应对这个问题,我们提出了EmbedLLM,一个旨在学习紧凑向量表示的大型语言模型(LLM)的框架,以便于涉及多种模型的下游应用,例如模型路由。我们引入了一种编码器-解码器方法来学习这些嵌入,并建立了一个系统性的框架来评估它们的有效性。实验结果表明,EmbedLLM在模型路由方面的表现优于先前方法,且延迟更小。此外,我们还证明了我们的方法可以预测模型在多个基准上的性能,而无需承担额外的推理成本。大量的探索性实验证实了学习到的嵌入捕捉了关键模型特征,例如模型是否专门用于编码任务,即使没有明确针对这些模型进行训练。我们开源了我们的数据集、代码和嵌入器,以促进进一步的研究和应用。
https://arxiv.org/abs/2410.02223
In-context learning (ICL) enables large language models (LLMs) to generalize to new tasks by incorporating a few in-context examples (ICEs) directly in the input, without updating parameters. However, the effectiveness of ICL heavily relies on the selection of ICEs, and conventional text-based embedding methods are often inadequate for tasks that require multi-step reasoning, such as mathematical and logical problem solving. This is due to the bias introduced by shallow semantic similarities that fail to capture the deeper reasoning structures required for these tasks. We present GraphIC, a novel approach that leverages graph-based representations of reasoning processes, coupled with Bayesian Networks (BNs) to select ICEs. Graph structures inherently filter out shallow semantics while preserving the core reasoning structure. Importantly, BNs capture the dependency of a node's attributes on its parent nodes, closely mirroring the hierarchical nature of human cognition-where each thought is shaped by preceding ones. This makes BNs particularly well-suited for multi-step reasoning tasks, aligning the process more closely with human-like reasoning. Extensive experiments across three types of reasoning tasks (mathematical reasoning, code generation, and logical reasoning) demonstrate that GraphIC outperforms both training-free and training-based models in selecting ICEs, excelling in terms of both effectiveness and efficiency. We show that GraphIC enhances ICL's performance and interoperability, significantly advancing ICE selection for multi-step reasoning tasks.
上下文学习(ICL)通过在输入中直接包含几个上下文示例(ICEs),使大型语言模型(LLMs)能够通过引入上下文而对新任务进行泛化,而无需更新参数。然而,ICL的有效性很大程度上取决于选定的ICE的质量和数量,传统的基于文本的嵌入方法通常不足以解决需要多步推理的任务,例如数学和逻辑问题解决。这是因为浅层语义相似性带来的偏差,这种相似性无法捕捉到这些任务所需的更深刻的推理结构。我们提出了GraphIC,一种新方法,它利用基于图的推理过程表示来选择ICE。 Graph结构天生过滤掉浅层语义,同时保留核心推理结构。重要的是,BN捕捉到节点属性对父节点的关系,密切反映了人类认知中每个想法都受到先前的想法的影响。这使得BN特别适用于多步推理任务,更接近人类推理过程。在数学推理、代码生成和逻辑推理等三种不同类型的推理任务上进行的大量实验证明,GraphIC在选择ICE方面优于无训练和基于训练的模型,在效果和效率方面均表现出色。我们还证明了GraphIC增强了ICL的性能和互操作性,显著推动了多步推理任务中ICE选择的进步。
https://arxiv.org/abs/2410.02203
Recent work showed that retrieval based on embedding similarity (e.g., for retrieval-augmented generation) is vulnerable to poisoning: an adversary can craft malicious documents that are retrieved in response to broad classes of queries. We demonstrate that previous, HotFlip-based techniques produce documents that are very easy to detect using perplexity filtering. Even if generation is constrained to produce low-perplexity text, the resulting documents are recognized as unnatural by LLMs and can be automatically filtered from the retrieval corpus. We design, implement, and evaluate a new controlled generation technique that combines an adversarial objective (embedding similarity) with a "naturalness" objective based on soft scores computed using an open-source, surrogate LLM. The resulting adversarial documents (1) cannot be automatically detected using perplexity filtering and/or other LLMs, except at the cost of significant false positives in the retrieval corpus, yet (2) achieve similar poisoning efficacy to easily-detectable documents generated using HotFlip, and (3) are significantly more effective than prior methods for energy-guided generation, such as COLD.
近期的工作表明,基于嵌入相似性(例如,用于生成增强)的检索在受到攻击时是脆弱的:攻击者可以构造针对广泛类查询返回的恶意文档。我们证明了之前基于 HotFlip 的技术产生的文档在通过词性过滤很难被检测出来。即使生成限制为产生低词性文本,产生的文档仍然被语言模型(LLM)认为是异常的,并可以从检索语料库中自动过滤掉。我们设计、实现并评估了一种新的控制生成技术,该技术将攻击目标(嵌入相似性)与基于开源、代理 LLM 计算的“自然性”目标相结合。由此产生的攻击性文档(1)无法通过词性过滤或其他 LLM 被自动检测出来,除非在检索语料库中产生明显的误报,然而(2)具有与使用 HotFlip 生成的容易检测的文档相似的毒性效果,(3)比之前的能量引导生成方法(如 COLD)更有效。
https://arxiv.org/abs/2410.02163
Like any other useful technology, cryptocurrencies are sometimes used for criminal activities. While transactions are recorded on the blockchain, there exists a need for a more rapid and scalable method to detect addresses associated with fraudulent activities. We present RiskSEA, a scalable risk scoring system capable of effectively handling the dynamic nature of large-scale blockchain transaction graphs. The risk scoring system, which we implement for Ethereum, consists of 1. a scalable approach to generating node2vec embedding for entire set of addresses to capture the graph topology 2. transaction-based features to capture the transactional behavioral pattern of an address 3. a classifier model to generate risk score for addresses that combines the node2vec embedding and behavioral features. Efficiently generating node2vec embedding for large scale and dynamically evolving blockchain transaction graphs is challenging, we present two novel approaches for generating node2vec embeddings and effectively scaling it to the entire set of blockchain addresses: 1. node2vec embedding propagation and 2. dynamic node2vec embedding. We present a comprehensive analysis of the proposed approaches. Our experiments show that combining both behavioral and node2vec features boosts the classification performance significantly, and that the dynamic node2vec embeddings perform better than the node2vec propagated embeddings.
像其他有用的技术一样,加密货币有时会被用于犯罪活动。虽然交易记录在区块链上,但存在一种更快速、可扩展的方法来检测与欺诈活动相关的地址。我们提出了RiskSEA,一个可扩展的风险评分系统,能够有效处理大规模区块链交易图的动态性质。我们为以太坊实现的RiskSEA系统包括以下几个部分:1. 对整个地址集生成节点到节点嵌入的方法,以捕捉网络拓扑结构;2. 基于交易的特性,以捕捉地址的交易行为模式;3. 分类器模型,用于生成结合节点到节点嵌入和行为特征的地址的风险评分。有效地为大规模和动态发展的区块链交易图生成节点到节点嵌入是一个具有挑战性的任务。我们提出了两种新颖的生成节点到节点嵌入的方法,并有效地将它扩展到整个区块链地址集:1. 节点到节点嵌入传播;2. 动态节点到节点嵌入。我们对所提出的 approaches 进行了全面的分析。我们的实验结果表明,结合行为和节点到节点嵌入特征可以显著提高分类性能,而动态节点到节点嵌入嵌入的表现更好。
https://arxiv.org/abs/2410.02160
Multimodal learning plays a crucial role in enabling machine learning models to fuse and utilize diverse data sources, such as text, images, and audio, to support a variety of downstream tasks. A unified representation across various modalities is particularly important for improving efficiency and performance. Recent binding methods, such as ImageBind (Girdhar et al., 2023), typically use a fixed anchor modality to align multimodal data in the anchor modal embedding space. In this paper, we mathematically analyze the fixed anchor binding methods and uncover notable limitations: (1) over-reliance on the choice of the anchor modality, (2) failure to capture intra-modal information, and (3) failure to account for inter-modal correlation among non-anchored modalities. To address these limitations, we propose CentroBind, a simple yet powerful approach that eliminates the need for a fixed anchor; instead, it employs dynamically adjustable centroid-based anchors generated from all available modalities, resulting in a balanced and rich representation space. We theoretically demonstrate that our method captures three crucial properties of multimodal learning: intra-modal learning, inter-modal learning, and multimodal alignment, while also constructing a robust unified representation across all modalities. Our experiments on both synthetic and real-world datasets demonstrate the superiority of the proposed method, showing that dynamic anchor methods outperform all fixed anchor binding methods as the former captures more nuanced multimodal interactions.
多模态学习在使机器学习模型能够融合和利用多样数据源(如文本、图像和音频)支持各种下游任务方面发挥着关键作用。在各种模态之间实现统一表示对提高效率和性能尤为重要。最近,例如ImageBind(Girdhar et al., 2023)使用的固定锚定模式来对多模态数据在锚定模态嵌入空间中的对齐。在本文中,我们通过数学分析固定锚定绑定方法,并揭示了其显著局限性:过分依赖锚定模态的选择、无法捕捉内部模态信息以及未能考虑非锚定模态之间的 inter-modal correlation。为了克服这些局限,我们提出了CentroBind,一种简单而强大的方法,它消除了固定锚定的需要;相反,它采用所有可用的模态动态生成的中心基于锚定的锚定,导致平衡和丰富的表示空间。我们理论证明了我们的方法抓住了多模态学习的三种关键属性:内部学习、跨模态学习和多模态对齐,同时还在所有模态之间构建了 robust 的统一表示。我们对 synthetic 和 real-world 数据集的实验表明,与固定锚定方法相比,所提出的方法具有优越性,表明动态锚定方法优于所有固定锚定方法,因为后者可以捕捉更细微的多模态交互。
https://arxiv.org/abs/2410.02086
Temporal point processes (TPPs) are widely used to model the timing and occurrence of events in domains such as social networks, transportation systems, and e-commerce. In this paper, we introduce TPP-LLM, a novel framework that integrates large language models (LLMs) with TPPs to capture both the semantic and temporal aspects of event sequences. Unlike traditional methods that rely on categorical event type representations, TPP-LLM directly utilizes the textual descriptions of event types, enabling the model to capture rich semantic information embedded in the text. While LLMs excel at understanding event semantics, they are less adept at capturing temporal patterns. To address this, TPP-LLM incorporates temporal embeddings and employs parameter-efficient fine-tuning (PEFT) methods to effectively learn temporal dynamics without extensive retraining. This approach improves both predictive accuracy and computational efficiency. Experimental results across diverse real-world datasets demonstrate that TPP-LLM outperforms state-of-the-art baselines in sequence modeling and event prediction, highlighting the benefits of combining LLMs with TPPs.
temporal point processes(TPPs)广泛应用于领域,如社会网络、交通系统和电子商务,以建模事件的时间和发生。在本文中,我们引入了TPP-LLM,一种将大型语言模型(LLMs)与TPPs相结合的新框架,以捕捉事件序列的语义和时间方面。与传统方法不同,TPP-LLM直接利用事件类型的文本描述,使模型能够捕捉文本中蕴含的丰富语义信息。虽然LLMs在理解事件语义方面表现出色,但它们在捕捉时间模式方面相对较差。为了解决这个问题,TPP-LLM包含了时间嵌入,并采用参数高效的微调(PEFT)方法,有效地学习而不需要广泛的重新训练来有效学习时间动态。这种方法提高了预测准确性和计算效率。在多样真实世界数据集上的实验结果表明,TPP-LLM在序列建模和事件预测方面优于最先进的基线,突出了将LLM与TPP相结合的优势。
https://arxiv.org/abs/2410.02062
We introduce Cadenza, a new multi-stage generative framework for predicting expressive variations of symbolic musical ideas as well as unconditional generations. To accomplish this we propose a novel MIDI encoding method, PerTok (Performance Tokenizer) that captures minute expressive details whilst reducing sequence length up to 59% and vocabulary size up to 95% for polyphonic, monophonic and rhythmic tasks. The proposed framework comprises of two sequential stages: 1) Composer and 2) Performer. The Composer model is a transformer-based Variational Autoencoder (VAE), with Rotary Positional Embeddings (RoPE)ROPE and an autoregressive decoder modified to more effectively integrate the latent codes of the input musical idea. The Performer model is a bidirectional transformer encoder that is separately trained to predict velocities and microtimings on MIDI sequences. Objective and human evaluations demonstrate Cadenza's versatile capability in 1) matching other unconditional state-of-the-art symbolic models in musical quality whilst sounding more expressive, and 2) composing new, expressive ideas that are both stylistically related to the input whilst providing novel ideas to the user. Our framework is designed, researched and implemented with the objective of ethically providing inspiration for musicians.
我们提出了Cadenza,一种新的多阶段生成框架,用于预测符号音乐思想的表达性变化以及无条件生成。为了实现这一目标,我们提出了一个新的MIDI编码方法,PerTok(表演标记词),它能够在保留微小表现细节的同时将序列长度降低至59%,并将词汇量降低至95%对于多声和单声任务。该框架由两个连续阶段组成:1)作曲家,2)表演者。作曲家模型是一个基于Transformer的变分自编码器(VAE),具有旋转位置嵌入(RoPE)ROPE和经过修改的递归解码器,以更有效地整合输入音乐思想的潜在码。表演者模型是一个双向Transformer编码器,分别针对MIDI序列预测速度和微时。对象人类评价证明了Cadenza在1)与其他无条件最佳符号模型在音乐质量上相匹配的同时听起来更加表现力,2)创作与输入音乐风格相关的新表现性想法,并为用户提供新颖的想法。我们的框架旨在为音乐家提供道德灵感。
https://arxiv.org/abs/2410.02060
Large language models (LLMs) have demonstrated remarkable progress in healthcare. However, a significant gap remains regarding LLMs' professionalism in domain-specific clinical practices, limiting their application in real-world diagnostics. In this work, we introduce ZODIAC, an LLM-powered framework with cardiologist-level professionalism designed to engage LLMs in cardiological diagnostics. ZODIAC assists cardiologists by extracting clinically relevant characteristics from patient data, detecting significant arrhythmias, and generating preliminary reports for the review and refinement by cardiologists. To achieve cardiologist-level professionalism, ZODIAC is built on a multi-agent collaboration framework, enabling the processing of patient data across multiple modalities. Each LLM agent is fine-tuned using real-world patient data adjudicated by cardiologists, reinforcing the model's professionalism. ZODIAC undergoes rigorous clinical validation with independent cardiologists, evaluated across eight metrics that measure clinical effectiveness and address security concerns. Results show that ZODIAC outperforms industry-leading models, including OpenAI's GPT-4o, Meta's Llama-3.1-405B, and Google's Gemini-pro, as well as medical-specialist LLMs like Microsoft's BioGPT. ZODIAC demonstrates the transformative potential of specialized LLMs in healthcare by delivering domain-specific solutions that meet the stringent demands of medical practice. Notably, ZODIAC has been successfully integrated into electrocardiography (ECG) devices, exemplifying the growing trend of embedding LLMs into Software-as-Medical-Device (SaMD).
大型语言模型(LLMs)在医疗领域取得了显著的进步。然而,在LLMs在领域特定临床实践的专业性方面,仍存在显著的差距,这限制了其在现实世界诊断中的应用。在这项工作中,我们引入了ZODIAC,一个由LLM驱动的框架,专为cardiologist级别的专业性设计,以激发LLMs在cardiological诊断中的应用。ZODIAC通过从患者数据中提取临床相关特征、检测显著的心律失常并生成初步报告,协助cardiologists。为了实现cardiologist级别的专业性,ZODIAC基于多代理合作框架构建,从而处理多模态的病人数据。每个LLM代理器都通过使用由cardiologists审核的实时病人数据进行微调,加强了模型的专业性。ZODIAC在独立 cardiologists的监督下进行严格的临床验证,并评估其临床有效性指标和安全漏洞。结果表明,ZODIAC超过了包括OpenAI的GPT-4o、meta的Llama-3.1-405B和Google的Gemini-pro在内的领先工业模型,以及医疗专家模型的Microsoft的BioGPT。ZODIAC通过提供满足医疗实践严格要求的领域特定解决方案,展示了专门LLM在医疗保健中的 transformative潜力。值得注意的是,ZODIAC已经成功地融入了心电图(ECG)设备中,体现了将LLM嵌入软件作为医疗设备(SaMD)的趋势。
https://arxiv.org/abs/2410.02026