Recurrent Neural Networks (RNNs) have revolutionized many areas of machine learning, particularly in natural language and data sequence processing. Long Short-Term Memory (LSTM) has demonstrated its ability to capture long-term dependencies in sequential data. Inspired by the Kolmogorov-Arnold Networks (KANs) a promising alternatives to Multi-Layer Perceptrons (MLPs), we proposed a new neural networks architecture inspired by KAN and the LSTM, the Temporal Kolomogorov-Arnold Networks (TKANs). TKANs combined the strenght of both networks, it is composed of Recurring Kolmogorov-Arnold Networks (RKANs) Layers embedding memory management. This innovation enables us to perform multi-step time series forecasting with enhanced accuracy and efficiency. By addressing the limitations of traditional models in handling complex sequential patterns, the TKAN architecture offers significant potential for advancements in fields requiring more than one step ahead forecasting.
循环神经网络(RNNs)已经在机器学习在很多领域取得了革命性的进展,特别是在自然语言处理和数据序列处理方面。长短时记忆(LSTM)已经证明了它在序列数据中捕捉长期依赖的能力。受到Kolmogorov-Arnold网络(KANs)的启发,我们提出了一个基于KAN和LSTM的新型神经网络架构,称为 Temporal Kolomogorov-Arnold Networks(TKANs)。TKANs 结合了两个网络的力量,它由 Recurring Kolmogorov-Arnold Networks(RKANs)层嵌入记忆管理组成。这一创新使得我们能够通过增强准确性和效率进行多步时间序列预测。通过解决传统模型在处理复杂序列模式方面的局限性,TKAN 架构在需要多步前预测的领域具有显著的改进潜力。
https://arxiv.org/abs/2405.07344
The quantitative analysis of political ideological positions is a difficult task. In the past, various literature focused on parliamentary voting data of politicians, party manifestos and parliamentary speech to estimate political disagreement and polarization in various political systems. However previous methods of quantitative political analysis suffered from a common challenge which was the amount of data available for analysis. Also previous methods frequently focused on a more general analysis of politics such as overall polarization of the parliament or party-wide political ideological positions. In this paper, we present a method to analyze ideological positions of individual parliamentary representatives by leveraging the latent knowledge of LLMs. The method allows us to evaluate the stance of politicians on an axis of our choice allowing us to flexibly measure the stance of politicians in regards to a topic/controversy of our choice. We achieve this by using a fine-tuned BERT classifier to extract the opinion-based sentences from the speeches of representatives and projecting the average BERT embeddings for each representative on a pair of reference seeds. These reference seeds are either manually chosen representatives known to have opposing views on a particular topic or they are generated sentences which where created using the GPT-4 model of OpenAI. We created the sentences by prompting the GPT-4 model to generate a speech that would come from a politician defending a particular position.
量化政治意识形态分析是一项困难的任务。在过去,各种文献关注于政治家的议会投票数据、党纲和议会演讲,以估计各种政治系统中的政治分歧和极化。然而,先前的量化政治分析方法普遍面临着一个共同的挑战,那就是数据可用性的数量。同时,先前的方法经常更广泛地分析政治,如整体议会的极化或党内的政治意识形态立场。在本文中,我们提出了一种利用LLM的潜在知识来分析个别议会代表意识形态的方法。该方法允许我们在我们选择的轴线上评估政治家的立场,从而我们可以灵活地衡量政治家在某个主题/争议问题上的立场。我们通过使用微调的BERT分类器从代表者的讲话中提取基于意见的句子,并将每个代表的平均BERT嵌入投影到两个参考种子对上。这些参考种子可以是手动选择的已知对某个主题持有不同观点的代表,也可以是使用OpenAI的GPT-4模型生成的句子。我们通过向GPT-4模型发出请求,生成一个政治家为某个立场辩护的演讲。
https://arxiv.org/abs/2405.07320
Generating robot demonstrations through simulation is widely recognized as an effective way to scale up robot data. Previous work often trained reinforcement learning agents to generate expert policies, but this approach lacks sample efficiency. Recently, a line of work has attempted to generate robot demonstrations via differentiable simulation, which is promising but heavily relies on reward design, a labor-intensive process. In this paper, we propose DiffGen, a novel framework that integrates differentiable physics simulation, differentiable rendering, and a vision-language model to enable automatic and efficient generation of robot demonstrations. Given a simulated robot manipulation scenario and a natural language instruction, DiffGen can generate realistic robot demonstrations by minimizing the distance between the embedding of the language instruction and the embedding of the simulated observation after manipulation. The embeddings are obtained from the vision-language model, and the optimization is achieved by calculating and descending gradients through the differentiable simulation, differentiable rendering, and vision-language model components, thereby accomplishing the specified task. Experiments demonstrate that with DiffGen, we could efficiently and effectively generate robot data with minimal human effort or training time.
通过模拟生成机器人演示是一种广泛认可的有效方法来扩展机器人数据。之前的工作通常通过训练强化学习智能体生成专家策略来实现,但这种方法缺乏样本效率。最近,一系列工作试图通过可导模拟生成机器人演示,这是一种有前途的方法,但严重依赖奖励设计,是一个劳动密集型过程。在本文中,我们提出了DiffGen,一种新颖的方法,将可导物理仿真、可导渲染和视觉语言模型集成在一起,以实现自动和高效生成机器人演示。给定一个模拟机器人操作场景和自然语言指令,DiffGen可以通过最小化操纵后语言指令的嵌入和模拟观察的嵌入之间的距离来生成真实的机器人演示。嵌入是从视觉-语言模型获得的,通过可导模拟、可导渲染和视觉-语言模型组件的计算和下降梯度来优化,从而完成指定任务。实验证明,使用DiffGen,我们可以在最小的人力努力或训练时间下,高效和有效地生成机器人数据。
https://arxiv.org/abs/2405.07309
Dense vector representations for sentences made significant progress in recent years as can be seen on sentence similarity tasks. Real-world phrase retrieval applications, on the other hand, still encounter challenges for effective use of dense representations. We show that when target phrases reside inside noisy context, representing the full sentence with a single dense vector, is not sufficient for effective phrase retrieval. We therefore look into the notion of representing multiple, sub-sentence, consecutive word spans, each with its own dense vector. We show that this technique is much more effective for phrase mining, yet requires considerable compute to obtain useful span representations. Accordingly, we make an argument for contextualized word/token embeddings that can be aggregated for arbitrary word spans while maintaining the span's semantic meaning. We introduce a modification to the common contrastive loss used for sentence embeddings that encourages word embeddings to have this property. To demonstrate the effect of this method we present a dataset based on the STS-B dataset with additional generated text, that requires finding the best matching paraphrase residing in a larger context and report the degree of similarity to the origin phrase. We demonstrate on this dataset, how our proposed method can achieve better results without significant increase to compute.
近年来,在句子相似性任务中,稠密向量表示取得了显著进展。然而,在真实世界短语检索应用中,有效地使用稠密表示仍然存在挑战。我们发现,当目标短语位于嘈杂的上下文中时,用单个稠密向量表示整个句子是不够的,对于有效的短语检索来说,这是不够的。因此,我们研究了表示多个子句连续单词跨度(每个都有自己的稠密向量)的概念。我们发现,这种技术对于短语挖掘非常有效,然而需要相当大的计算成本来获得有用的跨度表示。因此,我们提出了一个基于上下文聚合的语境化词/标记嵌入的概念,该概念可以在任意单词跨度上保持语义意义。我们引入了一种修改后的常见的对比损失,以鼓励词嵌入具有该特性。为了证明这种方法的效果,我们在STS-B数据集上建立了一个增加生成文本的 dataset,需要找到一个更大的上下文中的最佳匹配的同义词,并报告与原始短语的相似程度。我们在这个 dataset 上展示了我们的方法在没有显著增加计算成本的情况下可以实现更好的结果。
https://arxiv.org/abs/2405.07263
Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two modalities. In this work, we propose an enhanced framework for Video-Language pre-training with Synchronized Audio, termed as VLSA, that can learn tri-modal representations in a unified self-supervised transformer. Specifically, our VLSA jointly aggregates embeddings of local patches and global tokens for video, text, and audio. Furthermore, we utilize local-patch masked modeling to learn modality-aware features, and leverage global audio matching to capture audio-guided features for video and text. We conduct extensive experiments on retrieval across text, video, and audio. Our simple model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines. In addition, qualitative visualizations vividly showcase the superiority of our VLSA in learning discriminative visual-textual representations.
视频语言预训练是一个典型而具有挑战性的问题,旨在通过自监督的方式从大规模数据中学习视觉和文本表示。现有的预训练方法要么捕获图像-文本对之间的对应关系,要么利用帧的时间顺序。然而,它们并没有明确探索音频与其他两个模式之间的自然同步。在这项工作中,我们提出了一种名为VLSA的视频语言预训练增强框架,可以在统一的自监督Transformer中学习三模态表示。具体来说,我们的VLSA共同聚合视频、文本和音频的局部补丁和全局词向量。此外,我们利用局部补丁遮蔽建模来学习模式感知的特征,并利用全局音频匹配来捕捉视频和文本中的音频指导特征。我们在文本、视频和音频上进行广泛的实验。仅使用0.9M数据预训练的简单模型取得了比最先进的基准模型更好的结果。此外,定性的可视化生动地展示了VLSA在学习具有区分性的视觉文本表示方面的优越性。
https://arxiv.org/abs/2405.07202
Semitic morphologically-rich languages (MRLs) are characterized by extreme word ambiguity. Because most vowels are omitted in standard texts, many of the words are homographs with multiple possible analyses, each with a different pronunciation and different morphosyntactic properties. This ambiguity goes beyond word-sense disambiguation (WSD), and may include token segmentation into multiple word units. Previous research on MRLs claimed that standardly trained pre-trained language models (PLMs) based on word-pieces may not sufficiently capture the internal structure of such tokens in order to distinguish between these analyses. Taking Hebrew as a case study, we investigate the extent to which Hebrew homographs can be disambiguated and analyzed using PLMs. We evaluate all existing models for contextualized Hebrew embeddings on a novel Hebrew homograph challenge sets that we deliver. Our empirical results demonstrate that contemporary Hebrew contextualized embeddings outperform non-contextualized embeddings; and that they are most effective for disambiguating segmentation and morphosyntactic features, less so regarding pure word-sense disambiguation. We show that these embeddings are more effective when the number of word-piece splits is limited, and they are more effective for 2-way and 3-way ambiguities than for 4-way ambiguity. We show that the embeddings are equally effective for homographs of both balanced and skewed distributions, whether calculated as masked or unmasked tokens. Finally, we show that these embeddings are as effective for homograph disambiguation with extensive supervised training as with a few-shot setup.
语素丰富的语言(SMLs)的特点是极端的词义模糊。 因为标准文本中大多数元音都被省略了,所以许多单词是多义词,每个义项都有不同的发音和形态学特征。这种模糊性超出了词义歧义(WSD)的范围,甚至可能包括将词切分为多个词单位的情况。以前关于SML的研究声称,基于词块的预训练语言模型(PLMs)可能不足以捕捉这类词的内部结构,以区分这些分析。以希伯来语为例,我们研究了使用PLMs对希伯来语同义词进行歧义和分析的程度。我们在为新创希伯来语同义词挑战集上评估所有现有的模型。我们的实验结果表明,当代希伯来语预处理嵌入效果优于非预处理嵌入;而且它们最有效地用于区分词义和形态特征,对于纯词义歧义的效果相对较低。我们发现,当词块分割数量有限时,这些嵌入效果更有效;而且它们对于二义性和三元词性的效果比四元词性效果更佳。我们发现,这些嵌入对于平衡和偏斜分布的词相同效果,无论是作为遮罩词还是未遮罩词。最后,我们发现这些嵌入对于通过广泛监督训练进行词相同歧义的效果与少数次试验设置的效果相同。
https://arxiv.org/abs/2405.07099
The Chinese numerical string corpus, serves as a valuable resource for speaker verification, particularly in financial transactions. Researches indicate that in short speech scenarios, text-dependent speaker verification (TD-SV) consistently outperforms text-independent speaker verification (TI-SV). However, TD-SV potentially includes the validation of text information, that can be negatively impacted by reading rhythms and pauses. To address this problem, we propose an end-to-end speaker verification system that enhances TD-SV by decoupling speaker and text information. Our system consists of a text embedding extractor, a speaker embedding extractor and a fusion module. In the text embedding extractor, we employ an enhanced Transformer and introduce a triple loss including text classification loss, connectionist temporal classification (CTC) loss and decoder loss; while in the speaker embedding extractor, we create a multi-scale pooling method by combining sliding window attentive statistics pooling (SWASP) with attentive statistics pooling (ASP). To mitigate the scarcity of data, we have recorded a publicly available Chinese numerical corpus named SHALCAS22A (hereinafter called SHAL), which can be accessed on Open-SLR. Moreover, we employ data augmentation techniques using Tacotron2 and HiFi-GAN. Our method achieves an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively.
中文数字字符串语料库是一个有价值的资源,尤其是对于金融交易等场景的说话人验证。研究表明,在短语对话场景中,基于文本的说话人验证(TD-SV)始终优于基于文本的说话人验证(TI-SV)。然而,TD-SV可能包括验证文本信息,这可能会受到阅读节奏和停顿的影响。为了解决这个问题,我们提出了一个端到端的说话人验证系统,通过解耦说话人和文本信息来增强TD-SV。我们的系统包括一个文本嵌入提取器、一个说话人嵌入提取器和一個融合模块。在文本嵌入提取器中,我们使用增强的Transformer,引入了包括文本分类损失、连接式时间分类(CTC)损失和解码器损失的三重损失;而在说话人嵌入提取器中,我们通过结合滑动窗口注意统计池化(SWASP)和注意统计池化(ASP)创建了多尺度池化方法。为了缓解数据稀缺的问题,我们已经公开了一个名为SHALCAS22A的可用中文数字语料库(以下称为SHAL),可以访问Open-SLR。此外,我们还使用Tacotron2和HiFi-GAN等数据增强技术。我们的方法在Hi-Mia和SHAL上的等错误率(EER)性能改进分别为49.2%和75.0%。
https://arxiv.org/abs/2405.07029
The pitch contours of Mandarin two-character words are generally understood as being shaped by the underlying tones of the constituent single-character words, in interaction with articulatory constraints imposed by factors such as speech rate, co-articulation with adjacent tones, segmental make-up, and predictability. This study shows that tonal realization is also partially determined by words' meanings. We first show, on the basis of a Taiwan corpus of spontaneous conversations, using the generalized additive regression model, and focusing on the rise-fall tone pattern, that after controlling for effects of speaker and context, word type is a stronger predictor of pitch realization than all the previously established word-form related predictors combined. Importantly, the addition of information about meaning in context improves prediction accuracy even further. We then proceed to show, using computational modeling with context-specific word embeddings, that token-specific pitch contours predict word type with 50% accuracy on held-out data, and that context-sensitive, token-specific embeddings can predict the shape of pitch contours with 30% accuracy. These accuracies, which are an order of magnitude above chance level, suggest that the relation between words' pitch contours and their meanings are sufficiently strong to be functional for language users. The theoretical implications of these empirical findings are discussed.
汉语 two-character words 的语调轮廓通常被认为是由于构成单个字符词的潜在元音以及与诸如语速、与相邻元音的共调、语素构成和可预测性等因素的强制性约束相互作用而形成的。本研究显示,语调实现也与单词的意义有关。我们首先基于台湾会话语料库,使用广义加权回归模型,重点关注升调降调 patterns,得出在控制说话人和上下文的影响之后,单词类型是对 pitch realization 的预测强度大于所有之前确定的单词形式相关预测器的总和的强预测因素。重要的是,在上下文信息的基础上进一步提高了预测准确性。接着,我们通过使用上下文特定的 word embedding 进行计算建模,证明了对于 hold-out 数据,单词特定升调轮廓可以预测单词类型达到 50% 的准确率,而上下文敏感的、单词特定的嵌入可以预测语调轮廓的形状达到 30% 的准确率。这些准确率,其准确度是偶然水平的十倍以上,表明单词的升调轮廓和其意义之间的关系足够强,可以对语言使用者产生功能性影响。这些实证发现的理论意义进行了讨论。
https://arxiv.org/abs/2405.07006
In this report, we introduce Piccolo2, an embedding model that surpasses other models in the comprehensive evaluation over 6 tasks on CMTEB benchmark, setting a new state-of-the-art. Piccolo2 primarily leverages an efficient multi-task hybrid loss training approach, effectively harnessing textual data and labels from diverse downstream tasks. In addition, Piccolo2 scales up the embedding dimension and uses MRL training to support more flexible vector dimensions. The latest information of piccolo models can be accessed via: this https URL
在这份报告中,我们介绍了Piccolo2,一种在CMTEB基准测试中超过其他模型的全面评估任务的优秀嵌入模型,为最新的研究状态树立了新的里程碑。Piccolo2主要利用了一种高效的跨任务混合损失训练方法,有效发掘了各种下游任务的文本数据和标签。此外,Piccolo2还扩展了嵌入维,并使用MRL训练来支持更灵活的向量维度。piccolo模型的最新信息可以通过以下链接访问:https://this URL
https://arxiv.org/abs/2405.06932
Recognizing human actions from point cloud sequence has attracted tremendous attention from both academia and industry due to its wide applications. However, most previous studies on point cloud action recognition typically require complex networks to extract intra-frame spatial features and inter-frame temporal features, resulting in an excessive number of redundant computations. This leads to high latency, rendering them impractical for real-world applications. To address this problem, we propose a Plane-Fit Redundancy Encoding point cloud sequence network named PRENet. The primary concept of our approach involves the utilization of plane fitting to mitigate spatial redundancy within the sequence, concurrently encoding the temporal redundancy of the entire sequence to minimize redundant computations. Specifically, our network comprises two principal modules: a Plane-Fit Embedding module and a Spatio-Temporal Consistency Encoding module. The Plane-Fit Embedding module capitalizes on the observation that successive point cloud frames exhibit unique geometric features in physical space, allowing for the reuse of spatially encoded data for temporal stream encoding. The Spatio-Temporal Consistency Encoding module amalgamates the temporal structure of the temporally redundant part with its corresponding spatial arrangement, thereby enhancing recognition accuracy. We have done numerous experiments to verify the effectiveness of our network. The experimental results demonstrate that our method achieves almost identical recognition accuracy while being nearly four times faster than other state-of-the-art methods.
从点云序列中识别人类动作引起了学术界和产业界的高度关注,因为它具有广泛的应用。然而,大多数先前的点云动作识别研究通常需要复杂的网络来提取帧内空间特征和帧间时间特征,导致冗余计算数量过多。这导致延迟过高,使得它们对于现实应用不再实用。为了解决这个问题,我们提出了一个名为PRENet的平滑fit冗余编码点云序列网络。我们方法的主要思想是利用平滑fit来减轻序列内的空间冗余,同时编码整个序列的时间冗余以最小化冗余计算。具体来说,我们的网络由两个主要模块组成:平滑fit嵌入模块和时域-空间一致性编码模块。平滑fit嵌入模块利用观察到连续点云帧在物理空间中具有独特的几何特征的事实,实现地理位置编码数据的重复利用。时域-空间一致性编码模块将时间冗余部分与相应的空间布局相结合,从而提高识别准确性。我们进行了大量实验来验证我们网络的有效性。实验结果表明,与最先进的方法相比,我们的方法具有几乎相同的识别准确度,同时速度快了约四倍。
https://arxiv.org/abs/2405.06929
We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs). It becomes even more pronounced in the generation of customized concepts, due to the scarcity of user-provided concept visual examples. By revisiting the two major stages leading to the success of TGDMs -- 1) contrastive image-language pre-training (CLIP) for text encoder that encodes visual semantics, and 2) training TGDM that decodes the textual embeddings into pixels -- we point that existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one. To this end, we propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning. Specifically, given a few samples of customized concepts, we obtain non-confusing textual embeddings of a concept by fine-tuning CLIP via contrasting a concept and the over-segmented visual regions of other concepts. Experimental results demonstrate the effectiveness of CLIF in preventing the confusion of multi-customized concept generation.
我们通过文本指导的扩散模型(TGDMs)解决了共词视觉混淆在组合概念生成中的常见挑战。在生成定制概念时,由于用户提供的概念视觉示例稀缺,这种情况变得更加突出。通过回顾导致TGDMs成功的两个主要阶段——1)为文本编码器编码视觉语义的对称图像预训练(CLIP);2)训练TGDM将文本嵌入解密为像素——我们指出,现有的定制生成方法仅关注对第二个阶段的微调,而忽略了第一个阶段。为此,我们提出了一个简单而有效的解决方案,称为CLIF:对比图像-语言微调。具体来说,给定一些自定义概念的样本,我们通过对比一个概念和其它概念的过度分割视觉区域,对CLIP进行微调,从而获得非混淆的文本嵌入概念。实验结果表明,CLIF在防止多自定义概念生成混淆方面具有有效性。
https://arxiv.org/abs/2405.06914
Inductive representation learning on temporal heterogeneous graphs is crucial for scalable deep learning on heterogeneous information networks (HINs) which are time-varying, such as citation networks. However, most existing approaches are not inductive and thus cannot handle new nodes or edges. Moreover, previous temporal graph embedding methods are often trained with the temporal link prediction task to simulate the link formation process of temporal graphs, while ignoring the evolution of high-order topological structures on temporal graphs. To fill these gaps, we propose a Continuous-Time Representation Learning (CTRL) model on temporal HINs. To preserve heterogeneous node features and temporal structures, CTRL integrates three parts in a single layer, they are 1) a \emph{heterogeneous attention} unit that measures the semantic correlation between nodes, 2) a \emph{edge-based Hawkes process} to capture temporal influence between heterogeneous nodes, and 3) \emph{dynamic centrality} that indicates the dynamic importance of a node. We train the CTRL model with a future event (a subgraph) prediction task to capture the evolution of the high-order network structure. Extensive experiments have been conducted on three benchmark datasets. The results demonstrate that our model significantly boosts performance and outperforms various state-of-the-art approaches. Ablation studies are conducted to demonstrate the effectiveness of the model design.
在时间异质图上进行归纳表示学习对于在异质信息网络(HINs)上进行可扩展的深度学习至关重要。然而,现有的大多数方法都不是归纳的,因此无法处理新节点或边。此外,之前的时间图嵌入方法通常使用时间链接预测任务来模拟时间图的链接形成过程,而忽略了时间图上高阶拓扑结构的演化。为了填补这些空白,我们提出了一个连续时间表示学习(CTRL)模型在时间异质图上。为了保留异质节点特征和时间结构,CTRL在单个层中整合了三个部分,它们是 1)一个异质注意单元,衡量节点之间的语义关联;2)一个基于边的哈克斯过程,捕捉异质节点之间的时间影响;3)动态中心度,表示节点的动态重要性。我们使用未来事件(子图)预测任务来训练CTRL模型,以捕捉高阶网络结构的演化。在三个基准数据集上进行了广泛的实验。结果表明,我们的模型显著提高了性能,并超越了各种最先进的 approaches。进行了消融研究来证明模型设计的有效性。
https://arxiv.org/abs/2405.08013
The task of medical image recognition is notably complicated by the presence of varied and multiple pathological indications, presenting a unique challenge in multi-label classification with unseen labels. This complexity underlines the need for computer-aided diagnosis methods employing multi-label zero-shot learning. Recent advancements in pre-trained vision-language models (VLMs) have showcased notable zero-shot classification abilities on medical images. However, these methods have limitations on leveraging extensive pre-trained knowledge from broader image datasets, and often depend on manual prompt construction by expert radiologists. By automating the process of prompt tuning, prompt learning techniques have emerged as an efficient way to adapt VLMs to downstream tasks. Yet, existing CoOp-based strategies fall short in performing class-specific prompts on unseen categories, limiting generalizability in fine-grained scenarios. To overcome these constraints, we introduce a novel prompt generation approach inspirited by text generation in natural language processing (NLP). Our method, named Pseudo-Prompt Generating (PsPG), capitalizes on the priori knowledge of multi-modal features. Featuring a RNN-based decoder, PsPG autoregressively generates class-tailored embedding vectors, i.e., pseudo-prompts. Comparative evaluations on various multi-label chest radiograph datasets affirm the superiority of our approach against leading medical vision-language and multi-label prompt learning methods. The source code is available at this https URL
医学图像识别任务的复杂性显著地由多种病理诊断表现的存在所加剧,这为在多标签分类中处理未见标签的挑战带来了独特的挑战。这种复杂性突出了需要使用多标签零样本学习来进行计算机辅助诊断的方法。最近,预训练视觉语言模型(VLMs)在医学图像上的显著零样本分类能力引起了人们的关注。然而,这些方法在利用更广泛的预训练知识方面存在局限,并且通常依赖于专家放射科医生的手动提示构建。通过自动调整提示过程,提示学习技术已成为将VLMs适应下游任务的有效方法。然而,现有的CoOp基策略在为未见类别生成类特定提示时存在局限,从而限制了在细粒度场景下的泛化能力。为了克服这些限制,我们引入了一种基于自然语言处理(NLP)的全新提示生成方法,我们称之为伪提示生成(PsPG)。PsPG利用多模态特征的先前知识。它采用循环神经网络(RNN)的解码器,逐个生成类定制嵌入向量,即伪提示。在各种多标签胸部X光片数据集上的比较评估证实了我们的方法相对于最先进的医学视觉语言和多标签提示学习方法具有优越性。源代码可在此链接下载:https://url.cn/
https://arxiv.org/abs/2405.06468
Achieving gender equality is a pivotal factor in realizing the UN's Global Goals for Sustainable Development. Gender bias studies work towards this and rely on name-based gender inference tools to assign individual gender labels when gender information is unavailable. However, these tools often inaccurately predict gender for Chinese Pinyin names, leading to potential bias in such studies. With the growing participation of Chinese in international activities, this situation is becoming more severe. Specifically, current tools focus on pronunciation (Pinyin) information, neglecting the fact that the latent connections between Pinyin and Chinese characters (Hanzi) behind convey critical information. As a first effort, we formulate the Pinyin name-gender guessing problem and design a Multi-Task Learning Network assisted by Knowledge Distillation that enables the Pinyin embeddings in the model to possess semantic features of Chinese characters and to learn gender information from Chinese character names. Our open-sourced method surpasses commercial name-gender guessing tools by 9.70\% to 20.08\% relatively, and also outperforms the state-of-the-art algorithms.
实现性别平等是实现联合国可持续发展全球目标的的关键因素。性别偏见研究朝着这个方向努力,并依赖于基于名称的性别推断工具来分配个人性别标签,当性别信息不可用时。然而,这些工具通常不准确地预测中国拼音姓名中的性别,导致这些研究中存在偏见。随着中国在国际活动中的参与程度不断增加,这种状况变得更加严重。具体来说,当前的工具集中关注拼音信息,忽视了拼音背后汉字(汉字)之间潜在的传达关键信息。作为第一个尝试,我们提出了拼音姓名性别猜测问题,并设计了一个利用知识蒸馏的多任务学习网络,使模型中的拼音嵌入具有汉字的语义特征,并从汉字姓名中学习性别信息。我们的开源方法超越了商业命名性别猜测工具9.70\%到20.08\%的相对精度,同时也超过了最先进的算法。
https://arxiv.org/abs/2405.06221
Traffic prediction is pivotal for rational transportation supply scheduling and allocation. Existing researches into short-term traffic prediction, however, face challenges in adequately addressing exceptional circumstances and integrating non-numerical contextual information like weather into models. While, Large language models offer a promising solution due to their inherent world knowledge. However, directly using them for traffic prediction presents drawbacks such as high cost, lack of determinism, and limited mathematical capability. To mitigate these issues, this study proposes a novel approach. Instead of directly employing large models for prediction, it utilizes them to process textual information and obtain embeddings. These embeddings are then combined with historical traffic data and inputted into traditional spatiotemporal forecasting models. The study investigates two types of special scenarios: regional-level and node-level. For regional-level scenarios, textual information is represented as a node connected to the entire network. For node-level scenarios, embeddings from the large model represent additional nodes connected only to corresponding nodes. This approach shows a significant improvement in prediction accuracy according to our experiment of New York Bike dataset.
交通预测对于合理的运输调度和分配至关重要。然而,现有的关于短期交通预测的研究在充分解决异常情况并将非数值化的上下文信息(如天气)整合到模型方面面临挑战。虽然大型语言模型具有固有的世界知识,但直接使用它们进行交通预测存在费用高、不确定性和数学能力有限等缺点。为了减轻这些问题,本研究提出了一种新方法。 它不是直接使用大型模型进行预测,而是利用它们处理文本信息并获取嵌入。这些嵌入与历史交通数据结合并输入到传统的空间时间预测模型中。研究探讨了两种类型的特殊情况:地区级别和节点级别。在地区级别上,文本信息表示为一个与整个网络相连的节点。在节点级别上,大型模型的嵌入表示与相应节点仅连接的额外节点。根据我们在新 York Bike 数据集上的实验,这种方法在预测准确性方面显示出显著的改进。
https://arxiv.org/abs/2405.06719
Transformer-based large language models (LLM) have been widely used in language processing applications. However, most of them restrict the context window that permits the model to attend to every token in the inputs. Previous works in recurrent models can memorize past tokens to enable unlimited context and maintain effectiveness. However, they have "flat" memory architectures, which have limitations in selecting and filtering information. Since humans are good at learning and self-adjustment, we speculate that imitating brain memory hierarchy is beneficial for model memorization. We propose the Hierarchical Memory Transformer (HMT), a novel framework that enables and improves models' long-context processing ability by imitating human memorization behavior. Leveraging memory-augmented segment-level recurrence, we organize the memory hierarchy by preserving tokens from early input token segments, passing memory embeddings along the sequence, and recalling relevant information from history. Evaluating general language modeling (Wikitext-103, PG-19) and question-answering tasks (PubMedQA), we show that HMT steadily improves the long-context processing ability of context-constrained and long-context models. With an additional 0.5% - 2% of parameters, HMT can easily plug in and augment future LLMs to handle long context effectively. Our code is open-sourced on Github: this https URL.
基于Transformer的大型语言模型(LLM)在自然语言处理应用中得到了广泛应用。然而,大多数LLM限制了允许模型关注输入中每个词的上下文窗口。先前的循环模型工作可以记忆过去词以实现无限上下文并保持有效性。然而,它们具有“平”的内存架构,这限制了选择和过滤信息的能力。由于人类擅长学习和自我调整,我们推测模仿人脑记忆层次结构对模型记忆有益。我们提出了Hierarchical Memory Transformer (HMT),一种通过模仿人记忆行为来提高和改善模型长上下文处理能力的创新框架。通过保留早期输入词段的标记,我们通过序列传递内存嵌入并从历史中召回相关信息来组织记忆层次结构。在评估一般语言建模(Wikitext-103,PG-19)和问题回答任务(PubMedQA)中,我们证明了HMT能够逐步提高受约束的和长上下文模型的长上下文处理能力。通过额外的0.5% - 2%个参数,HMT可以轻松地插入并增强未来LLM的有效性。我们的代码在Github上开源:这是https://github.com/。
https://arxiv.org/abs/2405.06067
Time Series Representation Learning (TSRL) focuses on generating informative representations for various Time Series (TS) modeling tasks. Traditional Self-Supervised Learning (SSL) methods in TSRL fall into four main categories: reconstructive, adversarial, contrastive, and predictive, each with a common challenge of sensitivity to noise and intricate data nuances. Recently, diffusion-based methods have shown advanced generative capabilities. However, they primarily target specific application scenarios like imputation and forecasting, leaving a gap in leveraging diffusion models for generic TSRL. Our work, Time Series Diffusion Embedding (TSDE), bridges this gap as the first diffusion-based SSL TSRL approach. TSDE segments TS data into observed and masked parts using an Imputation-Interpolation-Forecasting (IIF) mask. It applies a trainable embedding function, featuring dual-orthogonal Transformer encoders with a crossover mechanism, to the observed part. We train a reverse diffusion process conditioned on the embeddings, designed to predict noise added to the masked part. Extensive experiments demonstrate TSDE's superiority in imputation, interpolation, forecasting, anomaly detection, classification, and clustering. We also conduct an ablation study, present embedding visualizations, and compare inference speed, further substantiating TSDE's efficiency and validity in learning representations of TS data.
时间序列表示学习(TSRL)关注为各种时间序列(TS)建模任务生成有用的表示。传统的自监督学习(SSL)方法在TSRL中可以分为四类:重构、对抗、对比和预测,每种方法都面临着对噪声和复杂数据细节的敏感性挑战。最近,扩散基方法表现出出色的生成能力。然而,它们主要针对特定的应用场景,如填充和预测, leaving a gap in leveraging diffusion models for generic TSRL. 我们的工作,TSDE,通过第一个基于扩散的SSL TSRL方法填补了这个空白。TSDE通过使用IIF掩码将TS数据分割为观测和掩码部分。它应用了一个可训练的嵌入函数,具有双正交Transformer编码器,并使用跨接机制,对观测部分进行处理。我们通过训练反扩散过程来预测掩码部分添加的噪声。丰富的实验证明TSDE在填充、 interpolation、 forecasting、异常检测、分类和聚类方面的优越性。我们还进行了消融研究,提供了嵌入视觉图,并比较了推理速度,进一步证实了TSDE在学习TS数据表示方面的效率和有效性。
https://arxiv.org/abs/2405.05959
Weather forecasting is a crucial task for meteorologic research, with direct social and economic impacts. Recently, data-driven weather forecasting models based on deep learning have shown great potential, achieving superior performance compared with traditional numerical weather prediction methods. However, these models often require massive training data and computational resources. In this paper, we propose EWMoE, an effective model for accurate global weather forecasting, which requires significantly less training data and computational resources. Our model incorporates three key components to enhance prediction accuracy: meteorology-specific embedding, a core Mixture-of-Experts (MoE) layer, and two specific loss functions. We conduct our evaluation on the ERA5 dataset using only two years of training data. Extensive experiments demonstrate that EWMoE outperforms current models such as FourCastNet and ClimaX at all forecast time, achieving competitive performance compared with the state-of-the-art Pangu-Weather model in evaluation metrics such as Anomaly Correlation Coefficient (ACC) and Root Mean Square Error (RMSE). Additionally, ablation studies indicate that applying the MoE architecture to weather forecasting offers significant advantages in improving accuracy and resource efficiency.
气象预报是一个关键性的气象学研究任务,具有直接的社会和经济影响。最近,基于深度学习的数据驱动气象预报模型显示出巨大的潜力,在传统数值气象预报方法相媲美的表现。然而,这些模型通常需要大量的训练数据和计算资源。在本文中,我们提出了EWMoE,一种准确的全球气象预报模型,该模型需要大大少的训练数据和计算资源。我们的模型包括三个关键组件来提高预测准确性:气象学特定嵌入,核心Mixture-of-Experts (MoE)层,以及两个特定的损失函数。我们对ERA5数据集仅使用两年训练数据进行了评估。广泛的实验证明,EWMoE在所有预测时间都优于当前模型(如FourCastNet和ClimaX),在评估指标如异常相关系数(ACC)和算术平均平方误差(RMSE)方面的竞争性能与最先进的Pangu-Weather模型相当。此外,消融研究结果表明,将MoE架构应用于气象预报具有显著的提高准确性和资源效率的优点。
https://arxiv.org/abs/2405.06004
In the digital era, with escalating privacy concerns, it's imperative to devise robust strategies that protect private data while maintaining the intrinsic value of textual information. This research embarks on a comprehensive examination of text anonymisation methods, focusing on Conditional Random Fields (CRF), Long Short-Term Memory (LSTM), Embeddings from Language Models (ELMo), and the transformative capabilities of the Transformers architecture. Each model presents unique strengths since LSTM is modeling long-term dependencies, CRF captures dependencies among word sequences, ELMo delivers contextual word representations using deep bidirectional language models and Transformers introduce self-attention mechanisms that provide enhanced scalability. Our study is positioned as a comparative analysis of these models, emphasising their synergistic potential in addressing text anonymisation challenges. Preliminary results indicate that CRF, LSTM, and ELMo individually outperform traditional methods. The inclusion of Transformers, when compared alongside with the other models, offers a broader perspective on achieving optimal text anonymisation in contemporary settings.
在数字时代,随着隐私问题的加剧,制定保护个人隐私的策略至关重要,同时保持文本信息固有价值。这项研究对文本匿名化方法进行全面评估,重点关注条件随机场(CRF)、长短时记忆(LSTM)、语言模型嵌入(ELMo)和Transformer架构的转化能力。每个模型都具有独特的优势,因为LSTM建模了长距离依赖关系,CRF捕捉了单词序列之间的依赖关系,ELMo通过双向深度语言模型提供上下文单词表示,而Transformer架构引入了自注意力机制,提供了更高的可扩展性。我们的研究对这些模型进行了比较分析,强调了它们在解决文本匿名化挑战中的协同潜力。初步结果表明,CRF、LSTM和ELMo单独优于传统方法。与其它模型相比较,当Transformer与其他模型结合时,为实现当代环境中最优的文本匿名化提供了更广阔的视角。
https://arxiv.org/abs/2405.06709
Combining face swapping with lip synchronization technology offers a cost-effective solution for customized talking face generation. However, directly cascading existing models together tends to introduce significant interference between tasks and reduce video clarity because the interaction space is limited to the low-level semantic RGB space. To address this issue, we propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space. Referring to recent work on face generation, we choose the VQ-embedding space due to its excellent editability and fidelity performance. To enhance the framework's generalization capabilities for unseen identities, we incorporate identity loss during the training of the face swapping module. Additionally, we introduce expert discriminator supervision within the latent space during the training of the lip synchronization module to elevate synchronization quality. In the evaluation phase, previous studies primarily focused on the self-reconstruction of lip movements in synchronous audio-visual videos. To better approximate real-world applications, we expand the evaluation scope to asynchronous audio-video scenarios. Furthermore, we introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos. Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency. Our demo is available at this http URL.
将面部换脸和嘴同步技术相结合,为定制化生成 talking face 提供了一种经济高效的方法。然而,直接级联现有的模型往往会在任务之间引入显著的干扰,并降低视频清晰度,因为交互空间仅限于低级语义 RGB 空间。为了解决这个问题,我们提出了一个创新性的统一框架 SwapTalk,它能够在同一个潜在空间中完成面部换脸和嘴同步任务。参考最近的面部生成工作,我们选择 VQ 嵌入空间,因为它具有出色的编辑性和保真度性能。为了提高框架对未见到的身份的泛化能力,我们在换脸模块的训练过程中引入了身份损失。此外,在嘴同步模块的训练过程中,我们在潜在空间中引入专家判别监督,以提高同步质量。在评估阶段,以前的研究主要关注同步音频-视频视频中的嘴运动自重建。为了更好地近似真实世界应用,我们将评估范围扩展到异步音频-视频场景。此外,我们引入了一个新的身份一致性度量来更全面地评估生成面部视频序列中的身份一致性。在 HDTF 上的实验结果表明,我们的方法在视频质量、嘴同步精度、面部换脸保真度和身份一致性方面显著超越了现有技术。我们的演示版本可以从该链接下载:http://www.example.com。
https://arxiv.org/abs/2405.05636