In this paper, we study semi-supervised graph classification, which aims at accurately predicting the categories of graphs in scenarios with limited labeled graphs and abundant unlabeled graphs. Despite the promising capability of graph neural networks (GNNs), they typically require a large number of costly labeled graphs, while a wealth of unlabeled graphs fail to be effectively utilized. Moreover, GNNs are inherently limited to encoding local neighborhood information using message-passing mechanisms, thus lacking the ability to model higher-order dependencies among nodes. To tackle these challenges, we propose a Hypergraph-Enhanced DuAL framework named HEAL for semi-supervised graph classification, which captures graph semantics from the perspective of the hypergraph and the line graph, respectively. Specifically, to better explore the higher-order relationships among nodes, we design a hypergraph structure learning to adaptively learn complex node dependencies beyond pairwise relations. Meanwhile, based on the learned hypergraph, we introduce a line graph to capture the interaction between hyperedges, thereby better mining the underlying semantic structures. Finally, we develop a relational consistency learning to facilitate knowledge transfer between the two branches and provide better mutual guidance. Extensive experiments on real-world graph datasets verify the effectiveness of the proposed method against existing state-of-the-art methods.
在本文中,我们研究半监督图分类,旨在准确预测在有限标注图和丰富无标注图的场景中,图的类别。尽管图神经网络(GNNs)具有鼓舞人心的能力,但它们通常需要大量昂贵的标注图,而大量的无标注图却无法有效利用。此外,GNNs本身有限于使用消息传递机制编码局部邻近信息,因此缺乏对节点之间高级依赖的建模能力。为了应对这些挑战,我们提出了一个名为HEAL的半监督图分类框架,从图和线图的角度分别捕捉图的语义。具体来说,为了更好地探索节点之间的更高阶关系,我们设计了一种自适应图结构学习方法,以学习超越一对一关系的复杂节点依赖。同时,基于学到的超图,我们引入了线图,以捕捉超边之间的关系,从而更好地挖掘潜在的语义结构。最后,我们开发了一种关系一致性学习方法,以促进两个分支之间的知识传递,并提供更好的相互指导。在现实世界的图数据集上进行的大量实验证实了所提出方法的有效性,与现有最先进的方法相比具有竞争力的优势。
https://arxiv.org/abs/2405.04773
To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.
为了在3D人类运动和语言之间构建跨模态潜在空间,获取大规模且高质量的人类运动数据至关重要。然而,与图像数据的丰富相比,运动数据的稀疏性限制了现有运动-语言模型的性能。为了应对这个问题,我们引入了“运动补丁”,一种新的运动序列表示,并通过迁移学习使用Vision Transformers(ViT)作为运动编码器,旨在从图像域提取有用知识并将其应用于运动域。这些运动补丁是由基于运动部件在运动序列中进行拆分和排序的骨骼关节创建的,对不同的骨架结构具有鲁棒性,可以被视为ViT中的颜色图像补丁。我们发现,通过使用通过2D图像数据训练得到的预训练ViT权重的迁移学习,可以提高运动分析的性能,为解决运动数据有限的问题提供了一个有前途的方向。我们的广泛实验表明,与ViT共同使用的运动补丁在文本到运动检索基准测试和其他新颖挑战任务(如跨骨架识别、零散射击运动分类和人类交互识别)上实现了最先进的性能,这些任务目前由于缺乏数据而受到阻碍。
https://arxiv.org/abs/2405.04771
Modern large language models (LLMs) have a significant amount of world knowledge, which enables strong performance in commonsense reasoning and knowledge-intensive tasks when harnessed properly. The language model can also learn social biases, which has a significant potential for societal harm. There have been many mitigation strategies proposed for LLM safety, but it is unclear how effective they are for eliminating social biases. In this work, we propose a new methodology for attacking language models with knowledge graph augmented generation. We refactor natural language stereotypes into a knowledge graph, and use adversarial attacking strategies to induce biased responses from several open- and closed-source language models. We find our method increases bias in all models, even those trained with safety guardrails. This demonstrates the need for further research in AI safety, and further work in this new adversarial space.
现代大型语言模型(LLMs)具有大量的世界知识,在恰当的利用下,在常识推理和知识密集型任务中表现出强大的性能。语言模型还可以学习社会偏见,这有可能对社会造成严重伤害。为解决LLM的安全性问题,已经提出了许多缓解策略,但目前尚不清楚它们是否对消除社会偏见有效。在本文中,我们提出了一个新的方法来攻击知识图增强生成语言模型。我们将自然语言刻板印象重构为知识图,并使用对抗攻击策略促使多个开源和闭源语言模型产生有偏的响应。我们发现,我们的方法在所有模型上都增加了偏见,即使是经过安全网保护的模型也不例外。这表明需要进一步研究AI安全问题,并进一步探索这个新的对抗领域。
https://arxiv.org/abs/2405.04756
Attack knowledge graph construction seeks to convert textual cyber threat intelligence (CTI) reports into structured representations, portraying the evolutionary traces of cyber attacks. Even though previous research has proposed various methods to construct attack knowledge graphs, they generally suffer from limited generalization capability to diverse knowledge types as well as requirement of expertise in model design and tuning. Addressing these limitations, we seek to utilize Large Language Models (LLMs), which have achieved enormous success in a broad range of tasks given exceptional capabilities in both language understanding and zero-shot task fulfillment. Thus, we propose a fully automatic LLM-based framework to construct attack knowledge graphs named: AttacKG+. Our framework consists of four consecutive modules: rewriter, parser, identifier, and summarizer, each of which is implemented by instruction prompting and in-context learning empowered by LLMs. Furthermore, we upgrade the existing attack knowledge schema and propose a comprehensive version. We represent a cyber attack as a temporally unfolding event, each temporal step of which encapsulates three layers of representation, including behavior graph, MITRE TTP labels, and state summary. Extensive evaluation demonstrates that: 1) our formulation seamlessly satisfies the information needs in threat event analysis, 2) our construction framework is effective in faithfully and accurately extracting the information defined by AttacKG+, and 3) our attack graph directly benefits downstream security practices such as attack reconstruction. All the code and datasets will be released upon acceptance.
攻击知识图构建旨在将文本形式的网络威胁情报(CTI)报告转换为结构化的表示形式,描绘网络攻击的演变轨迹。尽管之前的研究提出了各种方法来构建攻击知识图,但它们通常都存在对不同知识类型的泛化能力有限以及模型设计和调整的要求。为解决这些限制,我们寻求利用大型语言模型(LLMs),因为它们在广泛的任务上取得了巨大的成功,并且在语言理解和零击任务满足方面具有卓越的能力。因此,我们提出了一个完全自动化的LLM-为基础的攻击知识图构建框架,名为:AttacKG+。 我们的框架由四个连续的模块组成:改写器、解析器、标识器和总结器,每个模块都通过指令提示和上下文学习由LLM实现。此外,我们升级了现有的攻击知识模式并提出了全面版本。我们用一个时间展开的事件来表示网络攻击,每个时间步都包含三层表示,包括行为图、MITRE TTP标签和状态概述。丰富的评估表明:1)我们的公式在威胁事件分析中无缝地满足信息需求,2)我们的构建框架有效地忠实并准确地提取了由AttacKG+定义的信息,3)我们的攻击图直接受益于下游安全实践,如攻击重建。所有代码和数据将在接受提交时发布。
https://arxiv.org/abs/2405.04753
We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and quantifiable properties pertaining them, EQA with situational queries (such as "Is the bathroom clean and dry?") is more challenging, as the agent needs to figure out not just what the target objects pertaining to the query are, but also requires a consensus on their states to be answerable. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to create a dataset of unique situational queries, corresponding consensus object information, and predicted answers. PGE maintains uniqueness among the generated queries, using multiple forms of semantic similarity. We validate the generated dataset via a large scale user-study conducted on M-Turk, and introduce it as S-EQA, the first dataset tackling EQA with situational queries. Our user study establishes the authenticity of S-EQA with a high 97.26% of the generated queries being deemed answerable, given the consensus object data. Conversely, we observe a low correlation of 46.2% on the LLM-predicted answers to human-evaluated ones; indicating the LLM's poor capability in directly answering situational queries, while establishing S-EQA's usability in providing a human-validated consensus for an indirect solution. We evaluate S-EQA via Visual Question Answering (VQA) on VirtualHome, which unlike other simulators, contains several objects with modifiable states that also visually appear different upon modification -- enabling us to set a quantitative benchmark for S-EQA. To the best of our knowledge, this is the first work to introduce EQA with situational queries, and also the first to use a generative approach for query creation.
我们在家庭环境中针对情境查询(S-EQA)解决了 embodied 问题回答(EQA)问题。与之前的工作不同,这些工作主要解决与目标对象直接引用并可量化的属性相关的简单查询,而 EQA with situational queries(例如“卫生间干净干燥吗?”)更具挑战性,因为代理需要确定不仅目标对象的答案,而且还需要就它们的状态达成一致。为了实现这个目标,我们首先介绍了一种新颖的提示生成-评估(PGE)方案,该方案围绕 LLM 的输出创建了一个独特的数据集,包括独特的情境查询、相应的共识对象信息和预测的答案。PGE 在生成的查询中保持独特性,利用多种语义相似性。我们通过在 M-Turk 上进行大规模用户研究来验证生成的数据集,并将其作为 S-EQA,第一个处理情境查询的 dataset。我们的用户研究证实 S-EQA 的真实性,其中有 97.26% 的生成查询被认为具有答案,基于共识对象数据。相反,我们在 LLM 预测的答案和人类评估的答案之间观察到较低的相关性,表明 LLM 在直接回答情境查询方面能力较差,但 S-EQA 在提供人类验证的共识方面具有可用性。我们通过在 VirtualHome 上使用视觉问答(VQA)来评估 S-EQA,这个模拟器与其他模拟器不同,包含多个可修改的状态的对象,在修改后也具有不同的视觉表现,使我们能够为 S-EQA 设定一个量化基准。据我们所知,这是第一个介绍 EQA with situational queries 的作品,也是第一个使用生成方法创建查询的。
https://arxiv.org/abs/2405.04732
When agents that are independently trained (or designed) to complete their individual tasks are deployed in a shared environment, their joint actions may produce negative side effects (NSEs). As their training does not account for the behavior of other agents or their joint action effects on the environment, the agents have no prior knowledge of the NSEs of their actions. We model the problem of mitigating NSEs in a cooperative multi-agent system as a Lexicographic Decentralized Markov Decision Process with two objectives. The agents must optimize the completion of their assigned tasks while mitigating NSEs. We assume independence of transitions and rewards with respect to the agents' tasks but the joint NSE penalty creates a form of dependence in this setting. To improve scalability, the joint NSE penalty is decomposed into individual penalties for each agent using credit assignment, which facilitates decentralized policy computation. Our results in simulation on three domains demonstrate the effectiveness and scalability of our approach in mitigating NSEs by updating the policies of a subset of agents in the system.
当在共享环境中部署了那些经过独立训练(或设计)以完成各自任务的代理程序时,它们的联合行动可能会产生负面副作用(NSEs)。由于它们的训练没有考虑到其他代理程序的行为或它们联合行动对环境的影响,代理程序没有关于其行动NSEs的先验知识。我们将缓解NSEs的问题建模为合作多代理系统中的Lexicographic Decentralized Markov Decision Process,具有两个目标。代理程序必须在缓解NSEs的同时完成其分配的任务。我们假设与代理程序任务相关的转移和奖励是相互独立的,但联合NSE惩罚在某种程度上导致了这种设置中的一种形式上的依赖关系。为了提高可扩展性,联合NSE惩罚通过信用分配分解为每个代理程序的单独惩罚,这有助于促进分布式策略计算。我们在三个领域的模拟结果表明,通过更新系统中的部分代理程序策略,缓解NSEs的有效性和可扩展性。
https://arxiv.org/abs/2405.04702
Large Language Models (LLMs) deployed on edge devices learn through fine-tuning and updating a certain portion of their parameters. Although such learning methods can be optimized to reduce resource utilization, the overall required resources remain a heavy burden on edge devices. Instead, Retrieval-Augmented Generation (RAG), a resource-efficient LLM learning method, can improve the quality of the LLM-generated content without updating model parameters. However, the RAG-based LLM may involve repetitive searches on the profile data in every user-LLM interaction. This search can lead to significant latency along with the accumulation of user data. Conventional efforts to decrease latency result in restricting the size of saved user data, thus reducing the scalability of RAG as user data continuously grows. It remains an open question: how to free RAG from the constraints of latency and scalability on edge devices? In this paper, we propose a novel framework to accelerate RAG via Computing-in-Memory (CiM) architectures. It accelerates matrix multiplications by performing in-situ computation inside the memory while avoiding the expensive data transfer between the computing unit and memory. Our framework, Robust CiM-backed RAG (RoCR), utilizing a novel contrastive learning-based training method and noise-aware training, can enable RAG to efficiently search profile data with CiM. To the best of our knowledge, this is the first work utilizing CiM to accelerate RAG.
大语言模型(LLMs)在边缘设备上通过微调和完善其参数来学习。尽管这种学习方法可以优化以减少资源利用率,但总体上边缘设备所需的资源仍然沉重负担。相反,检索增强生成(RAG)是一种资源高效的LLM学习方法,可以在不更新模型参数的情况下提高LLM生成的内容的质量。然而,基于RAG的LLM可能需要在每个用户-LLM交互过程中对用户数据进行重复搜索。这种搜索可能导致延迟的积累以及随着用户数据的增长而降低RAG的可扩展性。仍然是一个未解决的问题:如何从边缘设备的延迟和可扩展性约束中解放RAG?在本文中,我们提出了通过计算在内存中的架构加速RAG的新框架。它通过在内存中进行原地计算来加速矩阵乘法,同时避免计算单元和内存之间进行昂贵的数据传输。我们的框架Robust CiM-backed RAG(RoCR)使用了一种新的基于对比学习的学习方法和新颖的噪声感知训练,可以实现RAG与CiM的 efficiently搜索用户数据。据我们所知,这是第一个利用CiM加速RAG的工作。
https://arxiv.org/abs/2405.04700
This paper describes a new research paradigm for studying human-AI collaboration, named "human-AI mutual learning", defined as the process where humans and AI agents preserve, exchange, and improve knowledge during human-AI collaboration. We describe relevant methodologies, motivations, domain examples, benefits, challenges, and future research agenda under this paradigm.
本文描述了一种研究人类-AI合作的新范式,称为“人机互学”,它定义为人类和AI代理在人类-AI合作过程中保留、交换和提高知识的过程。我们在这一范式下描述了相关的方法论、动机、领域示例、好处、挑战以及未来研究议程。
https://arxiv.org/abs/2405.04687
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages. This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations, with a special focus on Turkish. We conduct an in-depth analysis to evaluate the impact of training strategies, model choices, and data availability on the performance of LLMs designed for underrepresented languages. Our approach includes two methodologies: (i) adapting existing LLMs originally pretrained in English to understand Turkish, and (ii) developing a model from the ground up using Turkish pretraining data, both supplemented with supervised fine-tuning on a novel Turkish instruction-tuning dataset aimed at enhancing reasoning capabilities. The relative performance of these methods is evaluated through the creation of a new leaderboard for Turkish LLMs, featuring benchmarks that assess different reasoning and knowledge skills. Furthermore, we conducted experiments on data and model scaling, both during pretraining and fine-tuning, simultaneously emphasizing the capacity for knowledge transfer across languages and addressing the challenges of catastrophic forgetting encountered during fine-tuning on a different language. Our goal is to offer a detailed guide for advancing the LLM framework in low-resource linguistic contexts, thereby making natural language processing (NLP) benefits more globally accessible.
大语言模型(LLMs)在各个领域都变得至关重要,突出了在少数代表语言中高质量模型的紧迫性。这项研究探讨了低资源语言所面临独特的挑战,例如数据稀缺性、模型选择、评估和计算限制,特别关注土耳其。我们深入分析了训练策略、模型选择和数据可用性对为少数代表语言设计的LLM的性能影响。我们的方法包括两种:(i)将原始在英语上预训练的LLM适配到土耳其语,以了解土耳其语;(ii)使用土耳其预训练数据从头构建模型,并在旨在增强推理能力的全新土耳其指令微调数据上进行监督微调。我们通过创建一个新的土耳其LLM领导者板来评估这些方法的表现,该板包括评估不同推理和知识技能的基准。此外,我们还进行了在数据和模型缩放的同时进行的实验,强调知识在语言之间的传递以及在不同语言上进行微调时遇到的挑战。我们的目标是为在低资源语言环境中推动LLM框架的发展提供详细指南,从而使自然语言处理(NLP)的益处在全球范围内更加易于获取。
https://arxiv.org/abs/2405.04685
Large language models (LLMs) have demonstrated substantial commonsense understanding through numerous benchmark evaluations. However, their understanding of cultural commonsense remains largely unexamined. In this paper, we conduct a comprehensive examination of the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks. Using several general and cultural commonsense benchmarks, we find that (1) LLMs have a significant discrepancy in performance when tested on culture-specific commonsense knowledge for different cultures; (2) LLMs' general commonsense capability is affected by cultural context; and (3) The language used to query the LLMs can impact their performance on cultural-related tasks. Our study points to the inherent bias in the cultural understanding of LLMs and provides insights that can help develop culturally aware language models.
大语言模型(LLMs)通过多个基准测试展示了实质性的常识理解。然而,它们在文化常识任务中的理解仍然很大程度上没有被检验。在本文中,我们对几种最先进的LLM在文化常识任务中的能力和局限进行全面评估。使用几个通用的和文化常识基准,我们发现:(1)在测试文化特定常识知识时,LLM的表现存在显著的差异;(2)LLM的通用常识能力受到文化上下文的影响;(3)用于查询LLM的语言可以影响它们在文化相关任务中的表现。我们的研究揭示了LLM在文化理解方面的固有偏见,并为开发具有文化意识的语言模型提供了启示。
https://arxiv.org/abs/2405.04655
When a teacher provides examples for a student to study, these examples must be informative, enabling a student to progress from their current state toward a target concept or skill. Good teachers must therefore simultaneously infer what students already know and adapt their teaching to students' changing state of knowledge. There is increasing interest in using computational models, particularly large language models, as pedagogical tools. As students, language models in particular have shown a remarkable ability to adapt to new tasks given small numbers of examples. But how effectively can these models adapt as teachers to students of different types? To study this question, we introduce a suite of models and evaluation methods we call AdapT. AdapT has two components: (1) a collection of simulated Bayesian student models that can be used for evaluation of automated teaching methods; (2) a platform for evaluation with human students, to characterize the real-world effectiveness of these methods. We additionally introduce (3) AToM, a new probabilistic model for adaptive teaching that jointly infers students' past beliefs and optimizes for the correctness of future beliefs. In evaluations of simulated students across three learning domains (fraction arithmetic, English morphology, function learning), AToM systematically outperforms LLM-based and standard Bayesian teaching models. In human experiments, both AToM and LLMs outperform non-adaptive random example selection. Our results highlight both the difficulty of the adaptive teaching task and the potential of learned adaptive models for solving it.
当老师为学生提供学习例子时,这些例子必须是有指导性的,使学生能够从现有状态进展到目标概念或技能。因此,好的老师必须同时推断学生已经知道的内容,并根据学生知识的改变情况进行调整。随着人们对计算模型的兴趣增加,特别是大型语言模型,作为教学工具的应用也越来越受到关注。尤其是大型语言模型,在给定少量示例的情况下表现出惊人的适应能力。但是这些模型作为教师对不同类型的学生有哪些适应能力呢?为了研究这个问题,我们引入了一个名为AdapT的系列模型和评估方法。AdapT有两个组件:(1)一组用于评估自动教学方法的模拟贝叶斯学生模型;(2)一个用于评价人学生模型的平台,以评估这些方法在现实世界中的有效性。此外,我们还引入了(3)AToM,一种新的概率模型,用于适应性教学,它共同推断学生的先验信念并优化未来的信念正确性。在三个学习领域(分数代数,英语形态学,函数学习)的模拟学生评估中,AdapT系统性地优于LLM基和标准贝叶斯教学模型。在人类实验中,AToM和LLM都胜过了非适应性随机示例选择。我们的结果突出了适应性教学任务的困难以及学习适应性模型解决这个问题的潜在可能性。
https://arxiv.org/abs/2405.04495
Traditional knowledge graph embedding (KGE) methods typically require preserving the entire knowledge graph (KG) with significant training costs when new knowledge emerges. To address this issue, the continual knowledge graph embedding (CKGE) task has been proposed to train the KGE model by learning emerging knowledge efficiently while simultaneously preserving decent old knowledge. However, the explicit graph structure in KGs, which is critical for the above goal, has been heavily ignored by existing CKGE methods. On the one hand, existing methods usually learn new triples in a random order, destroying the inner structure of new KGs. On the other hand, old triples are preserved with equal priority, failing to alleviate catastrophic forgetting effectively. In this paper, we propose a competitive method for CKGE based on incremental distillation (IncDE), which considers the full use of the explicit graph structure in KGs. First, to optimize the learning order, we introduce a hierarchical strategy, ranking new triples for layer-by-layer learning. By employing the inter- and intra-hierarchical orders together, new triples are grouped into layers based on the graph structure features. Secondly, to preserve the old knowledge effectively, we devise a novel incremental distillation mechanism, which facilitates the seamless transfer of entity representations from the previous layer to the next one, promoting old knowledge preservation. Finally, we adopt a two-stage training paradigm to avoid the over-corruption of old knowledge influenced by under-trained new knowledge. Experimental results demonstrate the superiority of IncDE over state-of-the-art baselines. Notably, the incremental distillation mechanism contributes to improvements of 0.2%-6.5% in the mean reciprocal rank (MRR) score.
传统知识图嵌入(KGE)方法通常需要在新的知识出现时付出巨大的训练成本来保留整个知识图(KG)。为解决这个问题,连续知识图嵌入(CKGE)任务被提出,通过在同时学习和保留旧知识的同时,以高效的方式训练KGE模型。然而,现有的CKGE方法对KGs的显式图结构的重大忽视。一方面,现有方法通常在学习过程中以随机顺序学习新的三元组,破坏了新KG的内部结构。另一方面,旧三元组以等同的优先级被保留,未能有效减轻灾难性遗忘。在本文中,我们提出了一个基于增量的 distillation(IncDE)的竞争性的CKGE 方法,该方法考虑了 KGs 的显式图结构的全面利用。首先,为了优化学习顺序,我们引入了一个层次策略,对每个层级的新的三元组进行排序。通过同时使用上下文和内部层次结构,将新的三元组分组到基于图结构特征的层中。其次,为了有效地保留旧知识,我们设计了一种新颖的增量式蒸馏机制,促进了从前一层到下一层的实体表示的平稳转移,促进了旧知识的保留。最后,我们采用两级训练范式来避免受到欠训练的新知识中旧知识的过度污染。实验结果表明,与最先进的基线相比,IncDE 的优越性得到了充分证明。值得注意的是,增量式蒸馏机制对平均互反排名(MRR)得分提高了0.2%-6.5%。
https://arxiv.org/abs/2405.04453
Exact nearest neighbor search is a computationally intensive process, and even its simpler sibling -- vector retrieval -- can be computationally complex. This is exacerbated when retrieving vectors which have high-dimension $d$ relative to the number of vectors, $N$, in the database. Exact nearest neighbor retrieval has been generally acknowledged to be a $O(Nd)$ problem with no sub-linear solutions. Attention has instead shifted towards Approximate Nearest-Neighbor (ANN) retrieval techniques, many of which have sub-linear or even logarithmic time complexities. However, if our intuition from binary search problems (e.g. $d=1$ vector retrieval) carries, there ought to be a way to retrieve an organized representation of vectors without brute-forcing our way to a solution. For low dimension (e.g. $d=2$ or $d=3$ cases), \texttt{kd-trees} provide a $O(d\log N)$ algorithm for retrieval. Unfortunately the algorithm deteriorates rapidly to a $O(dN)$ solution at high dimensions (e.g. $k=128$), in practice. We propose a novel algorithm for logarithmic Fast Exact Retrieval for Nearest-neighbor lookup (FERN), inspired by \texttt{kd-trees}. The algorithm achieves $O(d\log N)$ look-up with 100\% recall on 10 million $d=128$ uniformly randomly generated vectors.\footnote{Code available at this https URL}
精确近邻搜索是一个计算密集型过程,甚至连其简单的兄弟——向量检索——也可能具有计算复杂性。当检索具有高维$d$与数据库中向量数量$N$相对时,这种情况尤为加剧。精确近邻检索通常被认为是$O(Nd)$问题,没有sub-线性解决方案。相反,注意力转向了向量检索的近似近邻(ANN)技术,其中许多具有sub-linear或甚至对数时间复杂性。然而,从二进制搜索问题(例如,$d=1$向量检索)的直觉告诉我们,应该有一种不需要暴力搜索解决方案的方法来检索向量的组织表示。对于低维度(例如,$d=2$或$d=3$的情况),\texttt{kd-trees}提供了一个$O(d\log N)$的检索算法。然而,在高层维度(例如,$k=128$)上,该算法迅速退化为一$O(dN)$的解决方案,在实际应用中。我们提出了一个新颖的logarithmic Fast Exact Retrieval for Nearest-neighbor lookup (FERN)算法,灵感来自\texttt{kd-trees}。该算法在100\%召回率的情况下,具有$O(d\log N)$的查找速度,对128个均匀随机生成的$d=128$维向量进行检索。附录中的代码可在该https URL上找到。
https://arxiv.org/abs/2405.04435
Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing.
场景文本图像不仅包含样式信息(字体,背景)还包含内容信息(字符,纹理)。不同的场景文本任务需要不同的信息,但之前的表现学习方法使用紧密耦合的特征来处理所有任务,导致在应对各种下游任务时性能较低。我们提出了一个解耦表示学习框架(DARLING),旨在解耦这两种类型的特征以提高在更好地解决各种下游任务时的适应性(选择您真正需要的)。具体来说,我们通过监督设计合成了一组具有相同风格的图像对,但具有不同内容的图像。根据这个数据集,我们通过监督设计解耦这两种类型的特征。显然,我们直接将视觉表示分为样式和内容特征。内容特征由文本识别损失进行监督,而风格特征通过图像解码器的提示进行对齐。然后,通过样式特征在图像对中重构对应图像。这种操作有效地将根据其独特属性解耦。据我们所知,这是场景文本领域首次解耦文本图像的固有属性。我们的方法在场景文本识别、去除和编辑方面取得了最先进的性能。
https://arxiv.org/abs/2405.04377
Since late 2022, generative AI has taken the world by storm, with widespread use of tools including ChatGPT, Gemini, and Claude. Generative AI and large language model (LLM) applications are transforming how individuals find and access data and knowledge. However, the intricate relationship between open data and generative AI, and the vast potential it holds for driving innovation in this field remain underexplored areas. This white paper seeks to unpack the relationship between open data and generative AI and explore possible components of a new Fourth Wave of Open Data: Is open data becoming AI ready? Is open data moving towards a data commons approach? Is generative AI making open data more conversational? Will generative AI improve open data quality and provenance? Towards this end, we provide a new Spectrum of Scenarios framework. This framework outlines a range of scenarios in which open data and generative AI could intersect and what is required from a data quality and provenance perspective to make open data ready for those specific scenarios. These scenarios include: pertaining, adaptation, inference and insight generation, data augmentation, and open-ended exploration. Through this process, we found that in order for data holders to embrace generative AI to improve open data access and develop greater insights from open data, they first must make progress around five key areas: enhance transparency and documentation, uphold quality and integrity, promote interoperability and standards, improve accessibility and useability, and address ethical considerations.
自2022年底以来,生成式人工智能(generative AI)彻底颠覆了世界,各种工具(包括ChatGPT、Gemini和Claude)的广泛应用使人们能够以全新的方式找到和访问数据和知识。生成式人工智能和大语言模型(LLM)应用正在改变个人如何发现和获取数据和知识的方式。然而,开放数据和生成式人工智能之间的关系以及它在推动这一领域创新方面所具有的广泛潜力仍然是未探索的领域。这份白皮书旨在解开开放数据和生成式人工智能之间的关系,并探讨可能的第四波开放数据的新组件:开放数据是否成为人工智能(AI)准备就绪?开放数据是否正朝着数据共享方法论演变?生成式人工智能是否使开放数据更具交互性?生成式人工智能是否改善了开放数据的质量和来源?为此,我们提供了一个新的场景框架。这个框架概述了开放数据和生成式人工智能在不同场景下可能产生的交集,以及从数据质量和来源角度看,开放数据在这些场景下做好准备所需的必要条件。这些场景包括:相关性、适应性、推理和洞察生成、数据增强和开放性探索。通过这个过程,我们发现,为了让数据持有者利用生成式人工智能改进开放数据访问并从开放数据中获得更大洞察,他们首先必须围绕五个关键领域取得进展:提高透明度和文档记录、维护质量和完整性、促进互操作性和标准、提高可访问性和可用性,以及解决道德问题。
https://arxiv.org/abs/2405.04333
In this paper, we present an unsupervised single-channel method for joint blind dereverberation and room impulse response estimation, based on posterior sampling with diffusion models. We parameterize the reverberation operator using a filter with exponential decay for each frequency subband, and iteratively estimate the corresponding parameters as the speech utterance gets refined along the reverse diffusion trajectory. A measurement consistency criterion enforces the fidelity of the generated speech with the reverberant measurement, while an unconditional diffusion model implements a strong prior for clean speech generation. Without any knowledge of the room impulse response nor any coupled reverberant-anechoic data, we can successfully perform dereverberation in various acoustic scenarios. Our method significantly outperforms previous blind unsupervised baselines, and we demonstrate its increased robustness to unseen acoustic conditions in comparison to blind supervised methods. Audio samples and code are available online.
在本文中,我们提出了一种基于后验采样和扩散模型的联合盲去噪和室脉冲响应估计方法。我们通过指数衰减的滤波器对每个频率子带进行参数化,并沿着反向扩散轨迹对语音语调进行细化,逐步估计相应的参数。一个测量一致性标准确保生成的语音与回声测量保持一致,而条件扩散模型则实现了对干净语音生成的强大假设。在没有了解室脉冲响应,也没有任何耦合的回声-等化数据的情况下,我们可以在各种声学场景中成功进行去噪。与之前基于无监督学习的盲去噪 baseline 相比,我们的方法显著性能更卓越,并且我们证明了与盲监督方法相比,其对未见过的声学条件的鲁棒性有所提高。音频样本和代码可在网上获取。
https://arxiv.org/abs/2405.04272
We develop the first (to the best of our knowledge) provably correct neural networks for a precise computational task, with the proof of correctness generated by an automated verification algorithm without any human input. Prior work on neural network verification has focused on partial specifications that, even when satisfied, are not sufficient to ensure that a neural network never makes errors. We focus on applying neural network verification to computational tasks with a precise notion of correctness, where a verifiably correct neural network provably solves the task at hand with no caveats. In particular, we develop an approach to train and verify the first provably correct neural networks for compressed sensing, i.e., recovering sparse vectors from a number of measurements smaller than the dimension of the vector. We show that for modest problem dimensions (up to 50), we can train neural networks that provably recover a sparse vector from linear and binarized linear measurements. Furthermore, we show that the complexity of the network (number of neurons/layers) can be adapted to the problem difficulty and solve problems where traditional compressed sensing methods are not known to provably work.
我们开发了第一个可证明正确的神经网络,针对一个精确的计算任务,该任务是通过自动验证算法生成正确性证明的,而没有任何人类输入。先前关于神经网络验证的工作主要集中在部分规格上,即使满足,也无法确保神经网络永远不会出错。我们专注于将神经网络验证应用于具有精确正确性的计算任务,其中可证明正确的神经网络确实可以不加任何限制地解决该任务。特别地,我们开发了一种方法来训练和验证压缩感知中第一个可证明正确的神经网络,即从比向量维度小的几个测量中恢复稀疏向量。我们证明了对于小问题维度(至多50个),我们可以训练神经网络从线性化和二进制化的线性测量中恢复稀疏向量。此外,我们还证明了网络复杂性(神经元/层数)可以适应问题难度,并且可以解决传统压缩感知方法无法保证正确性的问题。
https://arxiv.org/abs/2405.04260
Corruptions due to data perturbations and label noise are prevalent in the datasets from unreliable sources, which poses significant threats to model training. Despite existing efforts in developing robust models, current learning methods commonly overlook the possible co-existence of both corruptions, limiting the effectiveness and practicability of the model. In this paper, we develop an Effective and Robust Adversarial Training (ERAT) framework to simultaneously handle two types of corruption (i.e., data and label) without prior knowledge of their specifics. We propose a hybrid adversarial training surrounding multiple potential adversarial perturbations, alongside a semi-supervised learning based on class-rebalancing sample selection to enhance the resilience of the model for dual corruption. On the one hand, in the proposed adversarial training, the perturbation generation module learns multiple surrogate malicious data perturbations by taking a DNN model as the victim, while the model is trained to maintain semantic consistency between the original data and the hybrid perturbed data. It is expected to enable the model to cope with unpredictable perturbations in real-world data corruption. On the other hand, a class-rebalancing data selection strategy is designed to fairly differentiate clean labels from noisy labels. Semi-supervised learning is performed accordingly by discarding noisy labels. Extensive experiments demonstrate the superiority of the proposed ERAT framework.
由于数据扰动和标签噪声导致的腐败在不可靠数据源的数据集中普遍存在,这会对模型训练产生重大威胁。尽管已经开发出了一些 robust 的模型,但目前的训练方法通常忽视了两种腐败(即数据和标签)可能同时存在的可能性,从而限制了模型的有效性和可操作性。在本文中,我们提出了一种有效的鲁棒对抗训练(ERAT)框架,以同时处理两种腐败(即数据和标签),而无需具体了解其情况。我们提出了一种基于多个潜在对抗扰动周围进行半监督学习的方法,以及一种基于类重新平衡样本选择来增强模型对双重腐败的鲁棒性的方法。一方面,在所提出的 ERAT 训练中,扰动生成模块通过将 DNN 模型作为受害者来学习多个代理恶意数据扰动,而模型通过保持原始数据和混合扰动数据的语义一致来训练。预计这将使模型能够应对现实世界数据腐败中的不可预测扰动。另一方面,为了公平地区分清洁标签和噪声标签,我们设计了一种类重新平衡数据选择策略。相应地进行半监督学习,通过丢弃噪声标签。大量实验证明,所提出的 ERAT 框架具有优越性。
https://arxiv.org/abs/2405.04191
All fields of knowledge are being impacted by Artificial Intelligence. In particular, the Deep Learning paradigm enables the development of data analysis tools that support subject matter experts in a variety of sectors, from physics up to the recognition of ancient languages. Palaeontology is now observing this trend as well. This study explores the capability of Convolutional Neural Networks (CNNs), a particular class of Deep Learning algorithms specifically crafted for computer vision tasks, to classify images of isolated fossil shark teeth gathered from online datasets as well as from the authors$'$ experience on Peruvian Miocene and Italian Pliocene fossil assemblages. The shark taxa that are included in the final, composite dataset (which consists of more than one thousand images) are representative of both extinct and extant genera, namely, Carcharhinus, Carcharias, Carcharocles, Chlamydoselachus, Cosmopolitodus, Galeocerdo, Hemipristis, Notorynchus, Prionace and Squatina. We developed a CNN, named SharkNet-X, specifically tailored on our recognition task, reaching a 5-fold cross validated mean accuracy of 0.85 to identify images containing a single shark tooth. Furthermore, we elaborated a visualization of the features extracted from images using the last dense layer of the CNN, achieved through the application of the clustering technique t-SNE. In addition, in order to understand and explain the behaviour of the CNN while giving a paleontological point of view on the results, we introduced the explainability method SHAP. To the best of our knowledge, this is the first instance in which this method is applied to the field of palaeontology. The main goal of this work is to showcase how Deep Learning techniques can aid in identifying isolated fossil shark teeth, paving the way for developing new information tools for automating the recognition and classification of fossils.
知识领域的所有领域都受到人工智能的影响。特别是,深度学习范式使数据分析工具得以开发,支持各个领域的专家,从物理学到古语言的识别。古生物学领域现在也加入了这个趋势。这项研究探讨了卷积神经网络(CNNs)作为一种特别为计算机视觉任务而设计的深度学习算法的分类能力,将孤立的化石鲨牙齿图片从在线数据集中到作者在秘鲁米奥科新世和意大利普利奥新世化石群的经历中进行分类的能力。包括在最终综合 dataset(包含超过1000张图片)中的鲨鱼种类,都是灭绝和现存的物种,包括Carcharhinus、Carcharias、Carcharocles、Chlamydoselachus、Cosmopolitodus、Galeocerdo、Hemipristis、Notorynchus、Prionace和Squatina。我们开发了一个名为SharkNet-X的CNN,专门针对我们的识别任务,达到5倍交叉验证平均准确率0.85,识别包含单颗鲨牙齿的图片。此外,我们通过应用聚类技术t-SNE对提取图像特征进行可视化。为了了解和解释CNN在识别化石鲨牙齿时的行为,我们还引入了Shap解释方法。据我们所知,这是第一个将这种方法应用于古生物学领域的实例。本工作的主要目标是通过展示深度学习技术如何帮助识别孤立的化石鲨牙齿,为开发新的信息工具,用于自动化化石的识别和分类铺平道路。
https://arxiv.org/abs/2405.04189
In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. The ability to create credible minute-long music deepfakes in a few seconds on user-friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of training classifiers on datasets comprising real audio and fake reconstructions, achieving a convincing accuracy of 99.8%. To our knowledge, this marks the first publication of a music deepfake detector, a tool that will help in the regulation of music forgery. Nevertheless, informed by decades of literature on forgery detection in other fields, we stress that a good test score is not the end of the story. We step back from the straightforward ML framework and expose many facets that could be problematic with such a deployed detector: calibration, robustness to audio manipulation, generalisation to unseen models, interpretability and possibility for recourse. This second part acts as a position for future research steps in the field and a caveat to a flourishing market of fake content checkers.
在面对全新一代生成模型的新时代,检测人造内容已成为至关重要的事。在用户友好平台上几秒钟内创建可信的分钟长度的AI音乐 deepfake,对流媒体服务的欺诈威胁和对人类艺术家的不公平竞争构成了真正的威胁。本文证明了在包含真实音频和假重建的数据集上训练分类器是可能的,并且令人惊讶地容易,达到了99.8%的准确度。据我们所知,这标志着音乐 deepfake 检测器的首次发布,这将有助于音乐欺诈的监管。然而,根据其他领域的伪造检测几十年的文献,我们强调一个好的测试分数并不是故事的结束。我们离开了简单的机器学习框架,揭示了可能存在问题的部署检测器的许多方面:校准,对音频操作的鲁棒性,对未见过的模型的泛化,可解释性和可诉性。第二部分在领域未来的研究步骤中扮演了立场,同时也是繁荣内容检查器市场的警示。
https://arxiv.org/abs/2405.04181