Limited diversity in standardized benchmarks for evaluating audio representation learning (ARL) methods may hinder systematic comparison of current methods' capabilities. We present ARCH, a comprehensive benchmark for evaluating ARL methods on diverse audio classification domains, covering acoustic events, music, and speech. ARCH comprises 12 datasets, that allow us to thoroughly assess pre-trained SSL models of different sizes. ARCH streamlines benchmarking of ARL techniques through its unified access to a wide range of domains and its ability to readily incorporate new datasets and models. To address the current lack of open-source, pre-trained models for non-speech audio, we also release new pre-trained models that demonstrate strong performance on non-speech datasets. We argue that the presented wide-ranging evaluation provides valuable insights into state-of-the-art ARL methods, and is useful to pinpoint promising research directions.
标准基准测试对评估音频表示学习(ARL)方法的有限多样性可能会阻碍当前方法的系统比较能力。我们提出了ARCH(音频分类域全面基准),一个用于评估各种音频分类域中ARL方法的全面基准,包括音频事件、音乐和语音。ARCH包括12个数据集,使我们能够深入评估不同大小的预训练SSL模型的性能。ARCH通过其广泛的领域访问权限和容易纳入新数据集和模型的能力,简化了ARL技术的基准测试。为了解决当前缺乏非语音音频的开放源代码预训练模型的问题,我们还发布了在非语音数据集上表现出强劲性能的新预训练模型。我们认为,所提出的广泛的评估为最先进的ARL方法提供了宝贵的见解,有助于确定有前途的研究方向。
https://arxiv.org/abs/2405.00934
Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the intelligibility and naturalness of converted speech outperform previous work.
语音转换是将原始语音的语音特征进行转换,同时保留内容信息的过程。如今,自监督表示学习模型在内容提取中越来越受到欢迎。然而,在这些表示中,许多隐藏的说话者信息导致谐波泄漏,而隐藏单元的语调信息则缺乏使用。为了解决这些问题,我们提出了一种名为"SAVC"的新框架,基于HuBert-soft中的软语音单位。作为输入,我们设计了一个属性编码器来提取内容特征和语调特征。具体来说,我们首先引入了由对抗风格增强带来的统计畸变,以消除说话者信息。然后,我们通过知识蒸馏在软语音单位上隐含了语调信息。实验结果表明,转换后的语音的可听性和自然性超过了之前的 work。
https://arxiv.org/abs/2405.00603
We introduce a formal information-theoretic framework for image captioning by regarding it as a representation learning task. Our framework defines three key objectives: task sufficiency, minimal redundancy, and human interpretability. Building upon this foundation, we propose a novel Pyramid of Captions (PoCa) method, which constructs caption pyramids by generating localized captions for zoomed-in image patches and integrating them with global caption information using large language models. This approach leverages intuition that the detailed examination of local patches can reduce error risks and address inaccuracies in global captions, either by correcting the hallucination or adding missing details. Based on our theoretical framework, we formalize this intuition and provide formal proof demonstrating the effectiveness of PoCa under certain assumptions. Empirical tests with various image captioning models and large language models show that PoCa consistently yields more informative and semantically aligned captions, maintaining brevity and interpretability.
我们提出了一个形式化的信息论框架来处理图像标题任务,将其视为一种表示学习任务。我们的框架定义了三个关键目标:任务完备性、最小冗余性和人可解释性。在此基础上,我们提出了一个新的金字塔式标题方法(PoCa) ,通过为缩放的图像补丁生成局部标题,并使用大型语言模型将它们与全局标题信息集成来构建标题金字塔。这种方法利用直觉,即对局部补丁的详细检查可以降低错误风险并解决全局标题的不准确性,或者通过纠正幻觉或添加缺失细节来解决。根据我们的理论框架,我们形式化了这个直觉,并提供了形式化的证明,证明在某些假设下,PoCa具有有效性。用各种图像标题模型和大型语言模型进行实证测试,结果表明,PoCa始终产生更有信息量和语义一致性的标题,保持简短和可解释性。
https://arxiv.org/abs/2405.00485
Reinforcement learning from human feedback (RLHF) has been an effective technique for aligning AI systems with human values, with remarkable successes in fine-tuning large-language models recently. Most existing RLHF paradigms make the underlying assumption that human preferences are relatively homogeneous, and can be encoded by a single reward model. In this paper, we focus on addressing the issues due to the inherent heterogeneity in human preferences, as well as their potential strategic behavior in providing feedback. Specifically, we propose two frameworks to address heterogeneous human feedback in principled ways: personalization-based one and aggregation-based one. For the former, we propose two approaches based on representation learning and clustering, respectively, for learning multiple reward models that trades off the bias (due to preference heterogeneity) and variance (due to the use of fewer data for learning each model by personalization). We then establish sample complexity guarantees for both approaches. For the latter, we aim to adhere to the single-model framework, as already deployed in the current RLHF paradigm, by carefully aggregating diverse and truthful preferences from humans. We propose two approaches based on reward and preference aggregation, respectively: the former utilizes both utilitarianism and Leximin approaches to aggregate individual reward models, with sample complexity guarantees; the latter directly aggregates the human feedback in the form of probabilistic opinions. Under the probabilistic-opinion-feedback model, we also develop an approach to handle strategic human labelers who may bias and manipulate the aggregated preferences with untruthful feedback. Based on the ideas in mechanism design, our approach ensures truthful preference reporting, with the induced aggregation rule maximizing social welfare functions.
强化学习从人类反馈(RLHF)是一种将人工智能系统与人类价值观对齐的有效技术,在最近对大型语言模型的微调取得了显著的成功。大多数现有的RLHF范式都假设人类偏好相对一致,可以由单一奖励模型编码。在本文中,我们关注由于人类偏好固有的异质性以及它们在提供反馈时的潜在策略问题。具体来说,我们提出了两种基于原则的框架来解决异质人类反馈问题:基于个性化的一种和基于聚合的一种。对于前者,我们提出了两种基于表示学习和聚类的途径,分别学习多个奖励模型,这些模型在受到偏好异质性影响的同时,通过个性化使用较少的数据来降低方差(由于每个模型的学习使用量不同)。然后我们为这两种方法建立了样本复杂度保证。对于后者,我们力求遵循当前RLHF范式中已经部署的单模型框架,通过小心地从人类中聚合多样且真实的偏好来满足这一目标。我们分别提出了基于奖励和偏好聚类的两种方法:前者利用功利主义和Leximin方法来聚合个人奖励模型,具有样本复杂度保证;后者直接将人类反馈的形式聚合为概率意见。在概率意见反馈模型下,我们还开发了一种处理可能存在不诚实反馈的策略,以维护聚合的偏好。基于机制设计的思想,我们的方法确保了真实偏好报告,引导聚合规则最大化社会福利函数。
https://arxiv.org/abs/2405.00254
Despite great success in modeling visual perception, deep neural network based image quality assessment (IQA) still remains unreliable in real-world applications due to its vulnerability to adversarial perturbations and the inexplicit black-box structure. In this paper, we propose to build a trustworthy IQA model via Causal Perception inspired Representation Learning (CPRL), and a score reflection attack method for IQA model. More specifically, we assume that each image is composed of Causal Perception Representation (CPR) and non-causal perception representation (N-CPR). CPR serves as the causation of the subjective quality label, which is invariant to the imperceptible adversarial perturbations. Inversely, N-CPR presents spurious associations with the subjective quality label, which may significantly change with the adversarial perturbations. To extract the CPR from each input image, we develop a soft ranking based channel-wise activation function to mediate the causally sufficient (beneficial for high prediction accuracy) and necessary (beneficial for high robustness) deep features, and based on intervention employ minimax game to optimize. Experiments on four benchmark databases show that the proposed CPRL method outperforms many state-of-the-art adversarial defense methods and provides explicit model interpretation.
尽管在建模视觉感知方面取得了巨大的成功,但基于深度神经网络的图像质量评估(IQA)仍然不可靠,因为在实际应用中容易受到对抗扰动的影响,并且具有难以解释的黑盒结构。在本文中,我们提出了一种通过Causal Perception启发下的表示学习(CPRL)构建可靠IQA模型的方法,以及一种IQA模型得分反射攻击方法。具体来说,我们假设每个图像由Causal Perception表示(CPR)和非对称感知表示(N-CPR)组成。CPR作为主观质量标签的因果关系,对不可感知的主观扰动具有不变性。相反,N-CPR表现出与主观质量标签的伪相关性,随着对抗扰动的变化,可能会显著改变。为了从每个输入图像中提取CPR,我们基于通道的激活函数开发了一种软排名方法,以介导足够因果(提高预测准确性)和必要(提高稳健性)的深度特征,并且通过干预采用最小最大游戏进行优化。在四个基准数据库上的实验表明,与最先进的对抗防御方法相比,所提出的CPRL方法具有更好的性能,并提供了明确的模型解释。
https://arxiv.org/abs/2404.19567
Instance perception tasks (object detection, instance segmentation, pose estimation, counting) play a key role in industrial applications of visual models. As supervised learning methods suffer from high labeling cost, few-shot learning methods which effectively learn from a limited number of labeled examples are desired. Existing few-shot learning methods primarily focus on a restricted set of tasks, presumably due to the challenges involved in designing a generic model capable of representing diverse tasks in a unified manner. In this paper, we propose UniFS, a universal few-shot instance perception model that unifies a wide range of instance perception tasks by reformulating them into a dynamic point representation learning framework. Additionally, we propose Structure-Aware Point Learning (SAPL) to exploit the higher-order structural relationship among points to further enhance representation learning. Our approach makes minimal assumptions about the tasks, yet it achieves competitive results compared to highly specialized and well optimized specialist models. Codes will be released soon.
实例感知任务(目标检测、实例分割、姿态估计、计数)在工业视觉模型的应用中扮演着关键角色。由于监督学习方法受到高标注成本的影响,希望寻求一种有效的少样本学习方法,可以从有限的标记示例中高效学习。现有的少样本学习方法主要集中于一个受限的任务集,可能是由于设计一个通用模型来表示多样任务的挑战性较大。在本文中,我们提出了UniFS,一种统一少样本实例感知模型,通过将它们重新解释为一个动态点表示学习框架,将广泛的实例感知任务统一起来。此外,我们还提出了结构感知点学习(SAPL)来利用点之间的更高阶结构关系进一步增强表示学习。我们的方法对任务的需求相对较低,然而,与高度专业化和优化良好的专家模型相比,其竞争结果具有优势。代码不久将发布。
https://arxiv.org/abs/2404.19401
Knowledge graphs (KGs), which store an extensive number of relational facts (head, relation, tail), serve various applications. While many downstream tasks highly rely on the expressive modeling and predictive embedding of KGs, most of the current KG representation learning methods, where each entity is embedded as a vector in the Euclidean space and each relation is embedded as a transformation, follow an entity ranking protocol. On one hand, such an embedding design cannot capture many-to-many relations. On the other hand, in many retrieval cases, the users wish to get an exact set of answers without any ranking, especially when the results are expected to be precise, e.g., which genes cause an illness. Such scenarios are commonly referred to as "set retrieval". This work presents a pioneering study on the KG set retrieval problem. We show that the set retrieval highly depends on expressive modeling of many-to-many relations, and propose a new KG embedding model SpherE to address this problem. SpherE is based on rotational embedding methods, but each entity is embedded as a sphere instead of a vector. While inheriting the high interpretability of rotational-based models, our SpherE can more expressively model one-to-many, many-to-one, and many-to-many relations. Through extensive experiments, we show that our SpherE can well address the set retrieval problem while still having a good predictive ability to infer missing facts. The code is available at this https URL.
知识图(KGs)作为一种存储大量关系事实(头,关系,尾)的数据结构,具有各种应用价值。尽管许多下游任务高度依赖于KGs的表示建模和预测嵌入,但目前大多数KG表示学习方法,其中每个实体以欧氏空间中的向量表示,每个关系以变换表示,都遵循实体排序协议。一方面,这种嵌入设计无法捕捉许多对多关系。另一方面,在许多检索案例中,用户希望获得一个无排名的准确集合答案,尤其是在结果预计精确的情况下,例如哪些基因导致疾病。这种情况通常被称为“集检索”。 本文在KG集检索问题上进行了一项开创性的研究。我们证明了集检索高度依赖于多对多关系的表示建模,并提出了一个新的KG嵌入模型SpherE来解决这个问题。SpherE基于旋转嵌入方法,但每个实体都被嵌入为一个球体而不是向量。虽然继承了旋转模型的高可解释性,但我们的SpherE可以更富有表现力地建模一对一、一对多和多对多关系。通过大量实验,我们证明了我们的SpherE可以在解决集检索问题的同时,仍具有推断缺失事实的良好预测能力。代码可在此处访问:https://www.acm.org/dl/d/2222216
https://arxiv.org/abs/2404.19130
The talking head generation recently attracted considerable attention due to its widespread application prospects, especially for digital avatars and 3D animation design. Inspired by this practical demand, several works explored Neural Radiance Fields (NeRF) to synthesize the talking heads. However, these methods based on NeRF face two challenges: (1) Difficulty in generating style-controllable talking heads. (2) Displacement artifacts around the neck in rendered images. To overcome these two challenges, we propose a novel generative paradigm \textit{Embedded Representation Learning Network} (ERLNet) with two learning stages. First, the \textit{ audio-driven FLAME} (ADF) module is constructed to produce facial expression and head pose sequences synchronized with content audio and style video. Second, given the sequence deduced by the ADF, one novel \textit{dual-branch fusion NeRF} (DBF-NeRF) explores these contents to render the final images. Extensive empirical studies demonstrate that the collaboration of these two stages effectively facilitates our method to render a more realistic talking head than the existing algorithms.
口语头生成最近因其在数字虚拟人和3D动画设计方面的广泛应用前景而引起了相当的关注。受到这一实际需求的启发,几篇论文探索了使用Neural Radiance Fields(NeRF)合成口语头。然而,基于NeRF的方法面临两个挑战:(1)生成风格可控的口语头困难。(2)在渲染图像中围绕颈部发生的位移伪影。为了克服这两个挑战,我们提出了一个名为ERLNet(嵌入表示学习网络)的新生成范式,包括两个学习阶段。首先,构造了一个音频驱动的FLAME(ADF)模块,用于产生与内容音频和风格视频同步的面部表情和头姿序列。其次,根据FLAME计算的序列,一篇名为DBF-NeRF的新口语头融合NeRF(DBF-NeRF)探索了这些内容,以渲染最终图像。大量的实证研究证明了这两个阶段的协同作用有效地促进了我们方法比现有算法渲染更逼真的口语头的效果。
https://arxiv.org/abs/2404.19038
Understanding the severity of conditions shown in images in medical diagnosis is crucial, serving as a key guide for clinical assessment, treatment, as well as evaluating longitudinal progression. This paper proposes Con- PrO: a novel representation learning method for severity assessment in medical images using Contrastive learningintegrated Preference Optimization. Different from conventional contrastive learning methods that maximize the distance between classes, ConPrO injects into the latent vector the distance preference knowledge between various severity classes and the normal class. We systematically examine the key components of our framework to illuminate how contrastive prediction tasks acquire valuable representations. We show that our representation learning framework offers valuable severity ordering in the feature space while outperforming previous state-of-the-art methods on classification tasks. We achieve a 6% and 20% relative improvement compared to a supervised and a self-supervised baseline, respectively. In addition, we derived discussions on severity indicators and related applications of preference comparison in the medical domain.
理解医学图像中显示病情的严重程度对于医疗诊断至关重要,作为临床评估、治疗以及评估病程进展的关键指导。本文提出了一种名为Con-PrO的新的图像严重程度评估方法,该方法使用对比学习集成偏好优化。与传统的对比学习方法不同,ConPrO将各种严重程度类之间的距离偏好知识注入到潜在向量中。我们系统地检查我们框架的关键组件,以阐明对比预测任务如何获得有价值的表示。我们证明了,与以前的最先进方法相比,我们的表示学习框架在分类任务上实现了6%和20%的相对改进。此外,我们讨论了病理性指标及其在医学领域中的相关应用。
https://arxiv.org/abs/2404.18831
Text-rich graphs, which exhibit rich textual information on nodes and edges, are prevalent across a wide range of real-world business applications. Large Language Models (LLMs) have demonstrated remarkable abilities in understanding text, which also introduced the potential for more expressive modeling in text-rich graphs. Despite these capabilities, efficiently applying LLMs to representation learning on graphs presents significant challenges. Recently, parameter-efficient fine-tuning methods for LLMs have enabled efficient new task generalization with minimal time and memory consumption. Inspired by this, we introduce Graph-aware Parameter-Efficient Fine-Tuning - GPEFT, a novel approach for efficient graph representation learning with LLMs on text-rich graphs. Specifically, we utilize a graph neural network (GNN) to encode structural information from neighboring nodes into a graph prompt. This prompt is then inserted at the beginning of the text sequence. To improve the quality of graph prompts, we pre-trained the GNN to assist the frozen LLM in predicting the next token in the node text. Compared with existing joint GNN and LMs, our method directly generate the node embeddings from large language models with an affordable fine-tuning cost. We validate our approach through comprehensive experiments conducted on 8 different text-rich graphs, observing an average improvement of 2% in hit@1 and Mean Reciprocal Rank (MRR) in link prediction evaluations. Our results demonstrate the efficacy and efficiency of our model, showing that it can be smoothly integrated with various large language models, including OPT, LLaMA and Falcon.
文本丰富的图在节点和边上表现出丰富的文本信息,这在广泛的现实业务应用中非常普遍。大型语言模型(LLMs)在理解文本方面表现出令人印象深刻的的能力,这也引入了在文本丰富的图中进行更富有表现力的建模的潜力。尽管具有这些能力,将LLM应用于图表示学习仍然存在显著的挑战。最近,参数高效的微调方法为LLM带来了高效的任务泛化,且最小化时间和内存消耗。受到这一启发,我们引入了Graph-aware Parameter-Efficient Fine-Tuning - GPEFT,一种用于在文本丰富的图中以LLM进行高效表示学习的全新方法。具体来说,我们利用图神经网络(GNN)将相邻节点的信息编码成图提示。然后将这个提示插入文本序列的开头。为了提高图提示的质量,我们在预训练GNN的基础上,帮助冻活的LLM预测节点文本中的下一个词。与现有的联合GNN和LM相比,我们的方法可以直接从大型语言模型上以可负担的成本生成节点嵌入。我们在8个不同的文本丰富的图形上进行了全面的实验,观察到平均命中率@1和Mean Reciprocal Rank(MRR)在链路预测评估中的平均提升2%。我们的结果证明了我们的模型的有效性和高效性,表明它可以轻松地与各种大型语言模型,包括OPT,LLLaMA和Falcon集成。
https://arxiv.org/abs/2404.18271
Continual learning (CL) remains one of the long-standing challenges for deep neural networks due to catastrophic forgetting of previously acquired knowledge. Although rehearsal-based approaches have been fairly successful in mitigating catastrophic forgetting, they suffer from overfitting on buffered samples and prior information loss, hindering generalization under low-buffer regimes. Inspired by how humans learn using strong inductive biases, we propose IMEX-Reg to improve the generalization performance of experience rehearsal in CL under low buffer regimes. Specifically, we employ a two-pronged implicit-explicit regularization approach using contrastive representation learning (CRL) and consistency regularization. To further leverage the global relationship between representations learned using CRL, we propose a regularization strategy to guide the classifier toward the activation correlations in the unit hypersphere of the CRL. Our results show that IMEX-Reg significantly improves generalization performance and outperforms rehearsal-based approaches in several CL scenarios. It is also robust to natural and adversarial corruptions with less task-recency bias. Additionally, we provide theoretical insights to support our design decisions further.
持续学习(CL)是深度神经网络中一个长期存在的挑战,因为之前学习的知识会因为梯度消失而丢失。尽管基于练习的方法在减轻梯度消失方面相当成功,但它们在缓冲样本和先验信息损失方面过于拟合,阻碍了在低缓冲 regime下的泛化能力。受到人类使用强归纳偏见学习的方式启发,我们提出了一种基于对比学习(CRL)的低缓冲时经验回放(IMEX-Reg)方法,以提高在低缓冲 regime 下 CL 的泛化性能。具体来说,我们采用对比表示学习(CRL)中的双峰隐式-显式正则化方法,并结合一致性正则化。为了更好地利用使用 CRL 学习到的表示之间的关系,我们提出了一种引导分类器朝 CRL 中单位球体激活关联的方向的规范化策略。我们的结果表明,IMEX-Reg 显著提高了泛化性能,在多个 CL 场景中超过了基于练习的方法。它还对于自然和对抗性污染具有较低的任务晚期偏见。此外,我们还提供了进一步支持我们设计决策的理论和实验洞察。
https://arxiv.org/abs/2404.18161
Graph Neural Networks (GNNs) have demonstrated state-of-the-art performance in various graph representation learning tasks. Recently, studies revealed their vulnerability to adversarial attacks. In this work, we theoretically define the concept of expected robustness in the context of attributed graphs and relate it to the classical definition of adversarial robustness in the graph representation learning literature. Our definition allows us to derive an upper bound of the expected robustness of Graph Convolutional Networks (GCNs) and Graph Isomorphism Networks subject to node feature attacks. Building on these findings, we connect the expected robustness of GNNs to the orthonormality of their weight matrices and consequently propose an attack-independent, more robust variant of the GCN, called the Graph Convolutional Orthonormal Robust Networks (GCORNs). We further introduce a probabilistic method to estimate the expected robustness, which allows us to evaluate the effectiveness of GCORN on several real-world datasets. Experimental experiments showed that GCORN outperforms available defense methods. Our code is publicly available at: \href{this https URL}{this https URL}.
图形神经网络(GNNs)在各种图表示学习任务中展示了最先进的性能。最近的研究表明,它们对对抗攻击非常脆弱。在本文中,我们理论性地定义了在属性图背景下 expected robustness 的概念,并将其与图表示学习文献中的经典对抗鲁棒性定义联系起来。我们的定义允许我们推导出 Graph Convolutional Networks (GCNs) 和 Graph Isomorphism Networks subject to node feature attacks 的预期鲁棒性的上界。基于这些发现,我们将 GNNs 的预期鲁棒性与它们的权重矩阵的正交性联系起来,进而提出了一个攻击-独立、更鲁棒的 GCN 变体,称为 Graph Convolutional Orthonormal Robust Networks (GCORNs)。我们还引入了一种概率方法来估计预期鲁棒性,使我们能够评估 GCORN 在多个现实世界数据集上的效果。实验实验表明 GCORN 超过了可用的防御方法。我们的代码公开可用:\href{this <https:// this URL> }{this <https:// this URL>}.
https://arxiv.org/abs/2404.17947
Modeling the dynamics of interacting entities using an evolving graph is an essential problem in fields such as financial networks and e-commerce. Traditional approaches focus primarily on pairwise interactions, limiting their ability to capture the complexity of real-world interactions involving multiple entities and their intricate relationship structures. This work addresses the problem of forecasting higher-order interaction events in multi-relational recursive hypergraphs. This is done using a dynamic graph representation learning framework that can capture complex relationships involving multiple entities. The proposed model, \textit{Relational Recursive Hyperedge Temporal Point Process} (RRHyperTPP) uses an encoder that learns a dynamic node representation based on the historical interaction patterns and then a hyperedge link prediction based decoder to model the event's occurrence. These learned representations are then used for downstream tasks involving forecasting the type and time of interactions. The main challenge in learning from hyperedge events is that the number of possible hyperedges grows exponentially with the number of nodes in the network. This will make the computation of negative log-likelihood of the temporal point process expensive, as the calculation of survival function requires a summation over all possible hyperedges. In our work, we use noise contrastive estimation to learn the parameters of our model, and we have experimentally shown that our models perform better than previous state-of-the-art methods for interaction forecasting.
使用一个随机的图来建模交互实体之间的动态关系是金融网络和电子商务等领域中一个至关重要的任务。传统的解决方案主要关注一对一交互,限制了它们对涉及多个实体及其复杂关系结构的现实交互的捕捉能力。本文解决了在多关系递归超图中预测更高阶交互事件的问题。这是通过使用动态图表示学习框架来实现的,该框架可以捕捉涉及多个实体的复杂关系。所提出的模型《关系递归超网时空点过程》(RRHyperTPP)使用一个编码器,根据历史交互模式学习动态节点表示,然后使用解码器基于预测事件的发生来建模其发生。这些学习到的表示随后用于下游任务,包括预测交互的类型和时间。学习从超边缘事件的主要挑战是,随着网络中节点数的增加,可能的超边数呈指数增长。这将使得计算时间点过程的负对数似然函数变得昂贵,因为计算生存函数需要对所有可能的超边进行求和。在我们的工作中,我们使用噪声对比估计来学习我们的模型的参数,并通过实验已经证明了我们的模型在交互预测方面优于以前的先进方法。
https://arxiv.org/abs/2404.17943
To deal with heterogeneity resulting from label distribution skew and data scarcity in distributed machine learning scenarios, this paper proposes a novel Personalized Federated Learning (PFL) algorithm, named Federated Contrastive Representation Learning (FedCRL). FedCRL introduces contrastive representation learning (CRL) on shared representations to facilitate knowledge acquisition of clients. Specifically, both local model parameters and averaged values of local representations are considered as shareable information to the server, both of which are then aggregated globally. CRL is applied between local representations and global representations to regularize personalized training by drawing similar representations closer and separating dissimilar ones, thereby enhancing local models with external knowledge and avoiding being harmed by label distribution skew. Additionally, FedCRL adopts local aggregation between each local model and the global model to tackle data scarcity. A loss-wise weighting mechanism is introduced to guide the local aggregation using each local model's contrastive loss to coordinate the global model involvement in each client, thus helping clients with scarce data. Our simulations demonstrate FedCRL's effectiveness in mitigating label heterogeneity by achieving accuracy improvements over existing methods on datasets with varying degrees of label heterogeneity.
在分布式机器学习场景中处理标签分布不均匀和数据稀缺问题,本文提出了一种名为Federated Contrastive Representation Learning(FedCRL)的新个性化联邦学习(PFL)算法。FedCRL通过在共享表示上进行对比性表示学习(CRL)来促进客户端知识获取。具体来说,将本地模型的参数和局部表示的平均值视为可共享的信息,然后在全球范围内进行聚合。CRL在本地表示和全局表示之间应用,通过将类似的代表性绘制成更接近,将不相似的代表性分离,从而通过外部知识增强本地模型,并避免因标签分布不均匀受到伤害。此外,FedCRL通过在本地模型和全局模型之间进行局部聚合来解决数据稀缺问题。引入了一种基于每个局部模型对比损失的局部聚合机制,以协调全局模型在每个客户端的参与程度,从而帮助缺乏数据的客户端。 我们对FedCRL在处理不同程度标签异质性的数据效果进行了仿真,结果表明,FedCRL通过实现对不同程度标签异质性数据的准确率提升,有效减轻了标签异质性带来的影响。
https://arxiv.org/abs/2404.17916
The representation of events in text plays a significant role in various NLP tasks. Recent research demonstrates that contrastive learning has the ability to improve event comprehension capabilities of Pre-trained Language Models (PLMs) and enhance the performance of event representation learning. However, the efficacy of event representation learning based on contrastive learning and PLMs is limited by the short length of event texts. The length of event texts differs significantly from the text length used in the pre-training of PLMs. As a result, there is inconsistency in the distribution of text length between pre-training and event representation learning, which may undermine the learning process of event representation based on PLMs. In this study, we present PromptCL, a novel framework for event representation learning that effectively elicits the capabilities of PLMs to comprehensively capture the semantics of short event texts. PromptCL utilizes a Prompt template borrowed from prompt learning to expand the input text during Contrastive Learning. This helps in enhancing the event representation learning by providing a structured outline of the event components. Moreover, we propose Subject-Predicate-Object (SPO) word order and Event-oriented Masked Language Modeling (EventMLM) to train PLMs to understand the relationships between event components. Our experimental results demonstrate that PromptCL outperforms state-of-the-art baselines on event related tasks. Additionally, we conduct a thorough analysis and demonstrate that using a prompt results in improved generalization capabilities for event representations. Our code will be available at this https URL.
文本中事件表示在各种自然语言处理任务中扮演着重要角色。最近的研究表明,对比学习有能力提高预训练语言模型(PLMs)的事件理解能力,并增强事件表示学习的效果。然而,基于对比学习和PLMs的事件表示学习的效果受到短事件文本长度的影响。事件文本的长度与PLMs预训练时使用的文本长度显著不同。因此,在预训练和事件表示学习之间文本长度的分布存在不稳定性,这可能削弱基于PLMs的事件表示学习过程。在本研究中,我们提出了PromptCL,一种新颖的事件表示学习框架,有效地激发了PLMs的全面理解短事件文本的能力。PromptCL利用从提示学习中借用的提示模板在对比学习期间扩展输入文本。这有助于增强事件表示学习,通过提供事件组件的有序结构。此外,我们提出了Subject-Predicate-Object(SPO)词序和Event-oriented Masked Language Modeling(EventMLM)来训练PLMs理解事件组件之间的关系。我们的实验结果表明,PromptCL在事件相关任务上优于最先进的基线。此外,我们进行了详细的分析和演示,使用提示会导致事件表示的泛化能力得到改善。我们的代码将在此处https:// URL上提供。
https://arxiv.org/abs/2404.17877
Prevalent solution for BioNER involves using representation learning techniques coupled with sequence labeling. However, such methods are inherently task-specific, demonstrate poor generalizability, and often require dedicated model for each dataset. To leverage the versatile capabilities of recently remarkable large language models (LLMs), several endeavors have explored generative approaches to entity extraction. Yet, these approaches often fall short of the effectiveness of previouly sequence labeling approaches. In this paper, we utilize the open-sourced LLM LLaMA2 as the backbone model, and design specific instructions to distinguish between different types of entities and datasets. By combining the LLM's understanding of instructions with sequence labeling techniques, we use mix of datasets to train a model capable of extracting various types of entities. Given that the backbone LLMs lacks specialized medical knowledge, we also integrate external entity knowledge bases and employ instruction tuning to compel the model to densely recognize carefully curated entities. Our model VANER, trained with a small partition of parameters, significantly outperforms previous LLMs-based models and, for the first time, as a model based on LLM, surpasses the majority of conventional state-of-the-art BioNER systems, achieving the highest F1 scores across three datasets.
当前解决BioNER问题的方法涉及使用表示学习和序列标注技术相结合。然而,这些方法固有地针对特定任务,表现出了差的泛化能力,通常需要为每个数据集使用专门的模型。为了利用最近在大型语言模型(LLMs)上取得的显著创新,几个研究探索了实体提取的生成方法。然而,这些方法往往无法实现前序列标注方法的效力。在本文中,我们利用开源的LLM LLaMA2作为基础模型,并针对不同类型的实体和数据集设计特定的指令。通过将LLM的指令理解与序列标注技术相结合,我们使用混合数据集来训练一个能够提取各种类型实体的模型。鉴于基础LLMs缺乏专门的医疗知识,我们还引入了外部实体知识库,并使用指令调整来促使模型对精心挑选的实体进行密集识别。我们的模型VANER,通过小参数分量训练,显著优于基于LLM的模型,并且作为基于LLM的模型,第一次超越了大多数传统BioNER系统的水平,在三个数据集上的F1得分最高。
https://arxiv.org/abs/2404.17835
Federated learning ensures the privacy of clients by conducting distributed training on individual client devices and sharing only the model weights with a central server. However, in real-world scenarios, the heterogeneity of data among clients necessitates appropriate personalization methods. In this paper, we aim to address this heterogeneity using a form of parameter decoupling known as representation learning. Representation learning divides deep learning models into 'base' and 'head' components. The base component, capturing common features across all clients, is shared with the server, while the head component, capturing unique features specific to individual clients, remains local. We propose a new representation learning-based approach that suggests decoupling the entire deep learning model into more densely divided parts with the application of suitable scheduling methods, which can benefit not only data heterogeneity but also class heterogeneity. In this paper, we compare and analyze two layer scheduling approaches, namely forward (\textit{Vanilla}) and backward (\textit{Anti}), in the context of data and class heterogeneity among clients. Our experimental results show that the proposed algorithm, when compared to existing personalized federated learning algorithms, achieves increased accuracy, especially under challenging conditions, while reducing computation costs.
联邦学习通过在个人设备上进行分布式训练并仅将模型权重与中央服务器共享来确保客户的隐私。然而,在现实场景中,数据在客户端之间的异质性需要适当的数据个性化方法。在本文中,我们使用一种称为表示学习的形式来解决异质性。表示学习将深度学习模型划分为“基础”和“头”组件。基础组件捕捉所有客户端共有的特征,与服务器共享;头组件捕捉单个客户端独特的特征,保留在本地。我们提出了一种新的表示学习-基于的方法,通过应用适当的调度方法,将整个深度学习模型划分为更密集的部分。这种方法不仅可以解决数据异质性,还可以减少类异质性。本文在客户端之间数据和类异质性的背景下,比较和分析了两种层调度方法(即前向(“Vanilla”)和后向(“Anti”))。我们的实验结果表明,与现有的个性化联邦学习算法相比,所提出的算法在数据和类异质性方面具有更高的准确度,尤其是在具有挑战性条件的场景下,同时减少了计算成本。
https://arxiv.org/abs/2404.17799
Diffusion probabilistic models (DPMs) have become the state-of-the-art in high-quality image generation. However, DPMs have an arbitrary noisy latent space with no interpretable or controllable semantics. Although there has been significant research effort to improve image sample quality, there is little work on representation-controlled generation using diffusion models. Specifically, causal modeling and controllable counterfactual generation using DPMs is an underexplored area. In this work, we propose CausalDiffAE, a diffusion-based causal representation learning framework to enable counterfactual generation according to a specified causal model. Our key idea is to use an encoder to extract high-level semantically meaningful causal variables from high-dimensional data and model stochastic variation using reverse diffusion. We propose a causal encoding mechanism that maps high-dimensional data to causally related latent factors and parameterize the causal mechanisms among latent factors using neural networks. To enforce the disentanglement of causal variables, we formulate a variational objective and leverage auxiliary label information in a prior to regularize the latent space. We propose a DDIM-based counterfactual generation procedure subject to do-interventions. Finally, to address the limited label supervision scenario, we also study the application of CausalDiffAE when a part of the training data is unlabeled, which also enables granular control over the strength of interventions in generating counterfactuals during inference. We empirically show that CausalDiffAE learns a disentangled latent space and is capable of generating high-quality counterfactual images.
扩散概率模型(DPMs)已经成为高质量图像生成的领先技术。然而,DPMs具有任意噪声的潜在空间,没有可解释或可控制的意义。尽管在提高图像样本质量方面已经进行了大量的研究努力,但在使用扩散模型进行表示控制生成方面,工作还很少。具体来说,使用DPM进行因果建模和可控制反事实生成是一个未被探索的领域。 在这项工作中,我们提出CausalDiffAE,一种基于扩散的因果表示学习框架,以实现根据指定因果模型的反事实生成。我们的关键想法是使用编码器从高维数据中提取高级语义的有意义的因果变量,并使用反向扩散建模随机变化。我们提出了一种因果编码机制,将高维数据映射到相关潜在因素,并通过神经网络参数化因果机制。为了确保因果变量的离散化,我们定义了一个变分目标,并利用先验标签信息对潜在空间进行 Regularization。我们还提出了一个基于DDIM的生成反事实程序。 最后,为了应对有限的标记监督情况,我们还研究了在训练数据部分未标记的情况下如何应用CausalDiffAE,这也能在推理过程中对干预强度进行细粒度控制。我们通过实验验证,CausalDiffAE能够学习到一个分离的潜在空间,并能够生成高质量的反事实图像。
https://arxiv.org/abs/2404.17735
Optical Doppler Tomography (ODT) is a blood flow imaging technique popularly used in bioengineering applications. The fundamental unit of ODT is the 1D frequency response along the A-line (depth), named raw A-scan. A 2D ODT image (B-scan) is obtained by first sensing raw A-scans along the B-line (width), and then constructing the B-scan from these raw A-scans via magnitude-phase analysis and post-processing. To obtain a high-resolution B-scan with a precise flow map, densely sampled A-scans are required in current methods, causing both computational and storage burdens. To address this issue, in this paper we propose a novel sparse reconstruction framework with four main sequential steps: 1) early magnitude-phase fusion that encourages rich interaction of the complementary information in magnitude and phase, 2) State Space Model (SSM)-based representation learning, inspired by recent successes in Mamba and VMamba, to naturally capture both the intra-A-scan sequential information and between-A-scan interactions, 3) an Inception-based Feedforward Network module (IncFFN) to further boost the SSM-module, and 4) a B-line Pixel Shuffle (BPS) layer to effectively reconstruct the final results. In the experiments on real-world animal data, our method shows clear effectiveness in reconstruction accuracy. As the first application of SSM for image reconstruction tasks, we expect our work to inspire related explorations in not only efficient ODT imaging techniques but also generic image enhancement.
光多普勒成像(ODT)是一种在生物工程应用中广受欢迎的血流成像技术。ODT的基本单元是沿着A线(深度)的1D频率响应,称为原始A扫描。通过先沿着B线(宽度)感应原始A扫描,然后通过幅度-相位分析和平处理这些原始A扫描来获得二维ODT图像(B-扫描)。为了获得具有精确流量图的高分辨率B-扫描,当前方法需要高密度的A扫描,导致计算和存储负担较高。为了解决这个问题,本文提出了一种新颖的稀疏重构框架,包括四个主要的序列步骤:1)早期的幅度-相位融合,鼓励 magnitude 和 phase 互补信息的丰富互动,2)基于状态空间模型的表示学习,受到Mamba 和 VMamba 最近成功的影响,以自然地捕捉 both the intra-A-scan sequential information and between-A-scan interactions,3)一种Inception-based Feedforward Network模块(IncFFN)来进一步增强SSM-module,4)一种B-线像素重排(BPS)层,以有效地重构最终结果。在现实世界动物数据上的实验表明,我们的方法在重建准确性方面表现出明显的效果。作为SSM用于图像重建任务的第一个应用,我们期望我们的工作将激发关于高效ODT成像技术和通用图像增强的相關探索。
https://arxiv.org/abs/2404.17484
Self-Supervised Learning (SSL) is a valuable and robust training methodology for contemporary Deep Neural Networks (DNNs), enabling unsupervised pretraining on a `pretext task' that does not require ground-truth labels/annotation. This allows efficient representation learning from massive amounts of unlabeled training data, which in turn leads to increased accuracy in a `downstream task' by exploiting supervised transfer learning. Despite the relatively straightforward conceptualization and applicability of SSL, it is not always feasible to collect and/or to utilize very large pretraining datasets, especially when it comes to real-world application settings. In particular, in cases of specialized and domain-specific application scenarios, it may not be achievable or practical to assemble a relevant image pretraining dataset in the order of millions of instances or it could be computationally infeasible to pretrain at this scale. This motivates an investigation on the effectiveness of common SSL pretext tasks, when the pretraining dataset is of relatively limited/constrained size. In this context, this work introduces a taxonomy of modern visual SSL methods, accompanied by detailed explanations and insights regarding the main categories of approaches, and, subsequently, conducts a thorough comparative experimental evaluation in the low-data regime, targeting to identify: a) what is learnt via low-data SSL pretraining, and b) how do different SSL categories behave in such training scenarios. Interestingly, for domain-specific downstream tasks, in-domain low-data SSL pretraining outperforms the common approach of large-scale pretraining on general datasets. Grounded on the obtained results, valuable insights are highlighted regarding the performance of each category of SSL methods, which in turn suggest straightforward future research directions in the field.
自监督学习(SSL)是一种现代深度神经网络(DNN)的有价值且鲁棒的教学方法,允许在不需要真实标签/注释的情况下进行无监督预训练。这使得从大量的未标注训练数据中进行有效的表示学习成为可能,从而在下游任务上提高准确性,通过利用监督迁移学习。尽管 SSL 的概念化和应用非常简单,但在实际应用场景中收集和/或利用非常大的预训练数据集通常是不可行的或不太实际的。特别是在专业和领域特定的应用场景中,可能无法按百万实例的顺序组装相关的图像预训练数据集,或者在当前规模上进行预训练可能具有计算上的可行性。因此,进行研究来评估 SSL 预训练任务的效力就显得尤为重要。在数据量有限/受约束的情况下,这项工作引入了一个现代视觉 SSL 方法的分类学,同时对主要方法类别进行了详细解释和洞察,随后在低数据量的情况下进行了全面的比较实验,旨在确定:a)低数据量 SSL 预训练过程中学到的知识;b)不同 SSL 类别在训练场景中的行为。有趣的是,在领域特定的下游任务中,基于领域的低数据量 SSL 预训练超过了通用数据集的大型预训练方法。根据所得到的结果,对每个 SSL 方法类别的性能进行了突出,这进而提出了该领域未来研究的明确方向。
https://arxiv.org/abs/2404.17202